6. “
Cloud-native technologies empower
organizations to build and run scalable
applications in modern, dynamic
environments such as public, private,
and hybrid clouds. Containers, service
meshes, microservices, immutable
infrastructure, and declarative APIs
exemplify this approach.
7. “
These techniques enable loosely
coupled systems that are resilient,
manageable, and observable.
Combined with robust automation,
they allow engineers to make high-
impact changes frequently and
predictably with minimal toil.
22. @jimmydahlqvist
Event driven systems
◦ Run code by reacting to events
◦ Several parts of the system can
run at the same time
◦ Match made in heaven for
serverless systems
26. @jimmydahlqvist
Blue / Green
◦ Provision a new environment,
Green
◦ Shift all traffic at once to the Green
environment
◦ Keep Blue environment for fast
rollbacks
35. “
Chaos Engineering is the discipline of
experimenting on a system in order to
build confidence in the system’s
capability to withstand turbulent
conditions in production.
36. @jimmydahlqvist
Chaos Engineering
◦ Not about breaking things
◦ Controlled experiments to inject
failures
◦ Find weaknesses in a system
◦ Running on commodity hardware
Hej!
Cloud introduction.
Not telling you do this and this, give you my view of thing to consider
Not going deep this time.
Parts from my presentation at TestIt next week.
We can provision resources on the other side of the world rapidly, no lead time.
Stop guessing capacity, no need to over provision
Use managed service for faster development, don’t do undifferentiated heavy lifting, like managing servers
DevOps has increased the rate of change, we go to production in an automated way several times per day.
Då kommer vi in och pratar om Cloud Native applikationer?
Vad är det som gör dem så speciella? Vad skiljer dem från en ”vanlig” applikation?
Cloud Native foundation beskriver dem som:
Cloud native application use the cloud strength. Adapting the possibility to create new infrastructure fast an easy.
It consist of loosely couple pieces and servies that communicate over well defined API.
What to remember is that an API doesn’t have to be an REST based or GraphQL based API. It can be a message bus with well defined messages. It can be queues and broadcasts. What is important is that the interface what ever need to be well defined.
Also a cloud native application is aware that it is running on commodity hardware that can break or go away at any point.
It also uses the strength that in AWS terms, running over multiple Availbility Zones to be resiellent towards internet and power loss in a single data center.
Everything breaks all the time Werner Vogel
Now we get into the part which is one of the most important when it come to Cloud and Cloud native applications. Infra as code!
We need to have a repeatable process to create new and update current environments running in cloud.
If we do this manually and I have a set of instructions how to do it, and give that to 10 people I will have 10 different environments, human miss things and do things in wrong order.
We must automate this! Automate everything!
It also gives us the “map” how the infr should look like, it’s the single source of truth!
We use code and tools to model our infra, it can be CloudFormation or CDK in the AWS realm or Terraform, Pulimi and other tool. Most important is that it is code that you can verion control in Git.
You should store everything in Git! Infra setup, service code, documentation, just everything!! It gives us control and a built in history and audit trail.
Except….
Don’t store secrets in Git…. Please! So now don’t tell your team to do store secrets in Git.
Secrets belong in a secure store like AWS Secrets Manager or Hashicorp Vault!!
So how do we test this? Do we need to test our infra as code? Yes we do!
What we should test is to make sure that we don’t introduce security vulnerabilities, that we follow company policy, that the templates and code are correct.
That can be easily done using static code inspection using tools for Policy as Code, where we define the company plocies in code and then check towards. This can be that port 22 are never ever allowed to be open. Can also be that a EC2 instance is of a certain type or types.
Linting will check that the template is valid and doesn’t contain any syntax errors.
There are good tools for that and these should of course be part of our CI/CD piplelines
Then we should try and test the actual setup of the environment. Basically deploying a new environment and use different test tools to validate that the environment is correct.
Can be that there is an API exposed, the database exists, your ECS Container cluster exists, S3 buckets.
This can be defined as unit tests and run using common tooling, however it can be a bit tricky since some resources live in isolated networks that can’t be reached from local computers.
But it’s important to not forget about this!
Automation! I said it was important.
We need to automate everything…. And I mean everything! Humans make mistakes, that is just our nature, it happens and no one is to blame.
In this fast moving world we need things to be 100% repeatable and there is where Automation come in.
We like to create our infra automatically, therefor IaC is important.
We like to automate our tests so we can run them over and over and over and over again multiple times per day.
We like to automate our deployment and release processes using CI/CD tooling.
Automatically smoke testing our releases, automatic rollback in case of error.
Humans can click the single “Release” button but after that it should be 100% human hands off.
Our test automation is super important. What would happen if we sit with hours of manual testing for every release?
What if we then would like to release once per week? Once per day? Once per hour? Several times per hour?
The test automation is there to build confidence in our releases, it can’t find everything and there will be bugs that are not found, that is ok. It should build confidence that we are good to go!
Now we come to one of my favourite parts… Pull-Request testing and temporary environments!
What wait? Temporary environments? Yeah that is right. This is one of the greatest strength with cloud. You can throw up a brand new environment, run you tests, and shut it down.
But what of my environment takes hours to spin up then? Yes that is sometime the case, or the. Environment can be really expensive to setup. In these cases I recommend setting up one base environment and the use temporary parts and routing policies, name spaces for your pull requests.
Could be that you just deploy the affected service and use advanced routing during testing. There is not one sultions that fit all. How you create, use, and tear down temporary environments or temporary parts of environment is a case by case setup.
Why should we do pull-request testing in an environment then? Can’t we just run our unit tests locally?
First of all, we do PR testing to find problems early, we do it to make sure problems doesn’t sneak into the main branch.
We should deploy to the cloud to be able to to all form of integration tests.
Unit tests are fine for testing isolated parts, but how does it play out in the rest of the eco system? As I joke I tend to say that all unit tests passed on Titanic…. Integration tests, not so much.
So deploy it to the cloud and integrate test the shit out of it! With serverless and eventdriven systems integration testing is more important than unit test.
A small illustration of how I see PR testing.
What we should not forget is the communication back to the PR. There is no point in running tests if failing tests doesn’t block merge!
Now that has been one huge section of the Cloud Mindset.
Where things like automation, infra as code, CI/CD plays a huge roll.
Thinking that “Things break all the time” (Werner Vogel) is very important in a cloud world.
Now let’s move into something different. Microservices….. One essential part of an cloud native application.
But what is it?
Micro services is both and architectural and organizational way to organize software.
A micro service is an individual part of the system, that run and scale indepenantly.
Communication is over well defined API, and as I said before this doesn’t mean it has to be REST based API, it can be over message buses, queues etc.
On the orginzational side, microservices can be owned, developed, maintained by individual teams. Meaning that if we have 100 MS in our system we can have 100 or 20 development teams.
AWS two pizza team…
We also need a strong contract for our API, and this should be “signed off” by both the owner (producer) and the consumer team.
What is good with this you get the teams talking……
When it come to testing of microservices I see two very very important concepts. I normally refer to them as API test and Contract test.
At the end they are of the same kind but easier when describing them to teams to use different words.
API tests is tests towards the API that is written and maintained by the team owning the MS, the Producer of the API.
Contract test are tests towards the API that is written and maintained by the consuming team.’
So why both? Why not just the API test? Why the contract test?
Well the API tests are there to verify the API right, but what if something change, let say you have an incorrect behaiour, a bug in the service. Then you correct the bug and update the test, no problem? Right?
Well it is a problem. The consumers can have been making workarounds for the inocrret API. So if you roll out an update the code might break.
Also the Contract test are what is the contract, the consumer basically write tests how they are using the API and both parties agree on that contract.
This way there can not be a change that breaks the contract, we honor the contract at all times.
Contract tests shall be run every time there is a change in the MS exposing the API.
Where tests are maintained can differ,………
Now we enter the world of async calls….
Event driven serverless systems are a growing trend. And in an event driven system actions are done in a async way.
Which from a testing perspective can be highly challenging…..
The code is run in the background so to say and start and finish ”on random”.
So how do we write test for this? How do we ensure that this that is triggered by an event is executed as expected?
Sure we can trigger and wait for a result, but how long should we wait? How do we avoid flaky tests? If we can’t predict the turnaround time how do we build the test?
This is a challenge testers in my current project phase every day.
More on that topic later…
Serverless….. The trend that picks up momentum every day!
#1 reason I think is that you can focus on solving your business and not worry about babysitting servers.
The turnaround time for a serverless PoC is way less than with classic servers.
With services like AWS lambda we can break down micro service to functions and almost enter the realm of nano services.
Serverless and event driven goes hand in hand
Eventdriven systems are a match made in heaven for Serverless systems, due to the possibility to run small pieces of code in reaction to events and the autoscaling of serverless.
In an ED system we run code in reaction to events, things that happen in other parts of our system, that are then posted onto an eventbus.
Here we don’t have a one to one mapping, instead a one to many (producer consumer) is possible.
So how do we test this? As said a challenge in my current team.
What we have done is that we hook into the event bus so we know that the event was triggered and reached the bus.
We then start monitoring the output from the consumers to ensure they run as expected.
Does it work all the time? Nope, are test flaky? Yes. But it’s better now when we hooked into the bus.
I can guarantee that this will be a challenge for you as well… eventdriven serverless systems are gaining traction and the adoption rate, especially among startups, are huge.
As mentioned before integration tests are the most critical part in a cloud native application. Specially if it’s a eventdriven system
We can verify each part individually but how do we ensure they work together? Integration testing!
And with more and more moving parts it becomes the most important thing to learn and understand.
Make sure to deploy everything to an environment in the cloud, like a test stage and the integrate test the shit out of the system.
And!!! Do this every night…. Every night! That will help catch problems early on!
So not only do you need to understand test! You must understand CI/CD and how to use the strength of it!
OK. Now we have tested and we are confident that things work as intendend.
Now let’s go to production…. Just deploy it right! Your work as testers are done, right`
WRONG! It has just started!
When going to production our CI/CD pipeline and how we deploy become out last line of defense!
SMOKE TEST
And this is where you as testers come in! After we have deployed we must ensure that everything looks good, but why? We have done that?
DATA! Data is different in prod and can make things behave strange.
Therefore your task is to create smoke tests that we can run directly after deployment to verify everything.
Just remember you are testing in prod…. So don’t mess with data!
Then how we deploy should be driven by business requirements, there are so many different deployment methodologies.
Here is two of my favourites….
Blue/Green…. What is this magic stuff?
<Decsribe Blue / Green>
In this mode we can run smoke tests to Green!
Then we have canary….
<Decsribe canary>
So we have already talked about smoke testing, before.
But what is the responsibility of a smoke test?
As I see it, it has 3 typical tasks!
It’s there to build confidence in the release, that everything is working as expected.
It will ensure that the functionality that we have already tested also works in prod with prod data.
And last! It will trigger rollbacks in case of failing tests. And we should not forget this!
And since we trigger rollbacks! The tests MUST be rock solid, no flaky tests here! Imaging what would happen if we start to rollback evert second release due to a flaky test?
Now we are in production, we are done….
Well, there is one more things…
There are a couple of other aspects to think about
Werner Vogels
Chaos engineering is one!
This sound funky… Breaking things in production? Really?
Well no CE is defined as
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
So it’s not about breaking things…
It’s about doing controlled experiments where we inject failures. It can be servers going away. Throttles, latency, you name it.
We inject these failures to ensure resielncy and find weakness in our system.
CE originated at Netflix / Amazon and both are heavy users of it. Netflix run experimets where they simulate an entire AWS region going down.
Several videos on youtube where your can see the flow of data during the experiment
Then we have testing in production.
This is different from CE that we don’t inject failures.
Instead we run tests in production to verify the system. And we do that since data is different.
If someone say that their prod and stanging env are the same they are probably lying….
One important thing! If customers are affected by your test or by your CE experiments, ABORT!!
Therefor there must be. Key KPI to monitor. For Netflix it’s Streams Start per Second. SPS.
Which leads us into….
Observability!
Which is the way to understand the status of a complex system by looking at the outputs.
Business driven key KPI, such as the Netflix key KPI SPS (Streams start Per Second)
Thank you!
You can follow me on twitter or connect on linked in.