18. CHAOS ENGINEERING
โChaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
systemโs capability to withstand
turbulent conditions in production.โ
https//principlesofchaos.org
19. CHAOS ENGINEERING
โChaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
systemโs capability to withstand
turbulent conditions in production.โ
https//principlesofchaos.org
35. DEFINE HYPOTHESIS
BRAINSTORM WHAT CAN GO WRONG
BRING EVERYONE
DEVELOPERS
SRE /OPERATIONS
NETWORKS
BUSINESS
INFRASTRUCTURE
TESTERS
WHAT CAN GO WRONG?
WHAT IFDATABASE IS DOWN?
WHAT IFSERVICE RESPONDS SLOWER?
WHAT IFMY CACHE RESPONDS SLOW?
WHAT IFA POD DIES?
WHAT IF LOADBALANCER STOPS?
WHAT IFโฆ.?
46. MULTI PARALELLISM
PARALLELISM AVAILABILITY DOWNTIME PER YEAR
1 99% 3 DAYS 16 HOURS
2 99,99% 53 MINUTES
3 99,9999% 32 SECONDS
HOW PARALEL IS YOUR CLOUD COMPONENT ?
REGIONSAVAILABILITY ZONES
58. GEERT VAN DER CRUIJSEN
@GEERTVDC
THANK YOU!ALL PICTURES USED ARE FROM UNSPLASHED.COM
RESOURCES
BOOKS:
Chaosengineering-OโReilly
Chaosengineeringobservability -OโReilly
TOOLS:
chaostoolkit.org
gremlin.com
github.com/netflix/simianarmy
github.com/asobti/kube-monkey
RESOURCES:
principlesofchaos.org
github.com/dastergon/awesome-chaos-engineering
docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency
Editor's Notes
Why? Easy you can put โchaos engineerโ as function title on your resume
Who did fail over test to other data center?
Weโve started using the cloud and building distributed applications
Deathstar of amazon
Who did fail over test to other data center?
How do we monitor this kind of stuff?
Netflix SPS
Netflix SPS
It has beenย reportedย that every 100ms of latency costs Amazon 1% of profit
INFRA: cloud is providing this for us right?
NETWORK: The network is always reliable? We know it is not. How about switching over? How do we test that? Itโs often one of the easiest
APPLICATION: how do applications hold up when errors occur? What if the database is not accesible?
PEOPLE: People intervention. Is that making it wose? Fire drills. Do we fire drill for IT?
Partial failure mode
You have to think TOGETHER with business of the impact of failure.
Fail open example
Partial failure mode
Partial failure mode
Partial failure mode
Chaos engineering is like vaccination.
We add small amounts of harm to make the full system more immune to the effects
Why? Easy you can put โchaos engineerโ as function title on your resume
Why? Easy you can put โchaos engineerโ as function title on your resume
Partial failure mode
incident-response learnin
incident-response learning
MTTR
incident-response learnin
MULTI REGION, MULTI AVAILABILITY ZONE
Idempotency: can i do the same thing with same effect?
Safety: does it change the end state?
Example in kubernetes fixed CPU / memory reservations
Not 1 application can kill others by using max CPU/MEMORY
Fusebox
When things go wrong stop retrying
Polly
Command and Query Responsibility Segregationย
Chaos engineering is like vaccination.
We add small amounts of harm to make the full system more immune to the effects
Why? Easy you can put โchaos engineerโ as function title on your resume