Chaos Engineering - The Art of Breaking Things in Production

Chaos Engineering
Keet Sugathadasa| Pubudu Sitinamaluwa
Site Reliability Engineering

Keet & Pubudu
Keet
SRE, Contributor to NPM and
Stackoverflow
Interested in Cyber Security, Cloud
Computing, Distributed
Computing
Pubudu
SRE, Cloud Computing, Bigdata, ML,
Distributed Computing
Interested in distributed computing
and machine learning.

AGENDA
What started all this
What is Chaos Engineering
How is Chaos Engineering different from Testing Procedures
What companies are doing this
Why do Chaos Engineering?
What is Chaos Monkey
Principals of Chaos Engineering
Demonstration
Challenges Faced in Chaos Engineering

Dissapearing of instances in cloud was a pain
Chaos Monkey was born
What started all this...
Netflix started to move to cloud in August 2008
Christmast eve 2012 – AWS region failure
Chaos Engineering as a Decipline
Chaos Kong was born - Best case recovery time from 50 minutes to 6 minutes
Chaos community day
• 2015 – 40 participants
• 2017 Nora Jones gave a keynote on Chaos Engineering – 40,000 in person 20,000 streaming attendees

What is Chaos Engineering
"Chaos Engineering is the discipline of experimenting on a system, in
order to build confidence in the system’s capability to withstand
turbulent conditions in production"
Controlled and planned Chaos Engineering experiments
Preparing for unpredictable failure
Preparing engineers for failure
Making the chaos inherent in the system visible
A way to improve reliability
Helping to meet SLAs by Fortifying Systems
Preparing for Game Day

What is NOT Chaos Engineering
What Chaos Engineering is NOT
Random Chaos Engineering Experiments
Unsupervised Chaos Engineering Experiments
Unexpected Chaos Engineering Experiments
Breaking production by Accident

Chaos Engineering vs Other Testing
Procedures
How is Chaos Engineering Different from Testing Procedures
In Testing, an assertion is made: given specific conditions, a system will emit a specific
output based on the given specifications. Tests are typically binary, and determine whether a
property is true or false. Strictly speaking, this does not generate new knowledge about the
system, it just assigns valence to a known property of it.
Experimentation generates new knowledge, and often suggestsnew avenues of exploration.
Chaos engineering refers to the multiple methods to generate something unique. If you want
to detect or identify the complexity of any behavioral defection in the system, then injecting
communication failures is always a better choice.

What Companies Are Doing This
Netflix may have started this at first, but this area of specialisation has
advanced into many dynamics in industries all over the world.
Netflix
Amazon
Dropbox
Uber
Slack
Twilio
Facebook
And many more!
Amazon : Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli,Resilience
Engineering: Learning to Embrace Failure, ACM Queue, Vol. 10, Iss. 9, Sept. 13, 2012.
http://queue.acm.org/detail.cfm?id=2371297
Microsoft : Inside Azure Search: ChaosEngineering, Microsoft Azure Blog, July 1, 2015.
https://azure.microsoft.com/enus/blog/insideazuresearchchaosengineering/
Google : Yevgeniy Sverdlik, Facebook Turned Off Entire Data Center to Test Resiliency,
Data Center Knowledge, Sept. 15, 2014,

Why Chaos Engineering
Why Chaos Engineering
Systems need to scale fast and smoothly
Microservice architecture is tricky
Services will fail
Dependencies on other companies will fail
Reduce the amount of outages and
downtime (lose less money)
Prepare for real world scenarios
Train On Call Engineers to be Prepared for
different kinds of Outages
Train Development Engineers to build more
resilient systems
Engineering architects to make solid and
reliable decisions

Imagine a monkey enteringa 'data center', these
'farms' of servers that host all the critical functions
of our online activities. The monkey randomly rips
cables, destroys devices and returns everything that
passes by the hand [i.e. flings excrement]. The
challenge for IT managers is to design the
information system they are responsible for so that
it can work despite these monkeys, which no one
ever knows when they arrive and what they will
destroy.

Netflix Simian Army
Netflix has built an entire army of
monkeys, to simulate Chaotic
Situations in the production
environment, and this is called the
Simian Army. Some famous monkeys
are...
Chaos Gorilla
Chaos Kong
Latency Monkey
Doctor Monkey
etc...

Principles of Chaos Engineering

Build a Hypothesis around the Steady State Behavior
Vary Real-world Events
Run Experiments in Production
Automate Experiments to Run Continuously
Minimize Blast Radius

Principle 1: Build a Hypothesis around Steady State Behavior
Steady state is the state your
system is in, when it is
considered steady.
5xx Error rate below 5%
p90 latency is below 500
ms
Ops per second is above
10,000
Think of these as the "what if"
questions
What if the load balancer breaks
What if the cluster goes down
What if the auth server breaks
What if Redis becomes slow
What if latency increases by 300ms
etc

Principle 2: Vary Real-world Events
Hardware failures
Functional bugs
State transmission errors (e.g., inconsistency of states between sender and receiver nodes)
Network latency and partition
Large fluctuations in input (up or down) and retry storms
Resource exhaustion
Unusual or unpredictable combinations of inter-service communication
Byzantine failures (e.g., a node believing it has the most current data when it actually does not)
Race conditions
Downstream dependencies malfunction

Principle 3: Run Experiments in Production
Simulating the failure of an entire region or datacenter.
Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in
production.
Injecting latency between services for a selected percentage of traffic over a predetermined
period of time.
Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
Code insertion: Adding instructions to the target program and allowing fault injection to occur
prior to certain instructions.
Time travel: forcing system clocks out of sync with each other.
Executing a routine in driver code emulating I/O errors.
Maxing out CPU cores on an Elasticsearch cluster.

Principle 4: Automate Experiments to Run Continuously
The practice of Chaos Engineering is a long running process and a labour intensive
process.
Time to detect
Time for Notification and Escalation
Time to public notification
Time for graceful degradation to kick in
Time for self-healing to happen
Time to recovery - partial or full
Time to all clear and stable

Principle 5: Minimize Blast Radius
When you perform a Chaos Engineering experiment, always remember to identify
metrics like the following. (This is to ensure that the Blast Radius in contained
and identified)
Who is impacted
How many workloads
What functionality
How many locations
And more

Principles of
Chaos
Engineering

Challenges in Chaos Engineering

Challenges in Chaos Engineering
Challenges Faced in Chaos Engineering
No time or flexibility to simulate disasters
Teams will always be spending their time fixing things, and building new
features
This can be very political inside the organization
Cost involved in fixing and simulating disasters
And many more company matters that build up resistance

References
References
https://chaos-mesh.org/
https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/
https://principlesofchaos.org/
http://queue.acm.org/detail.cfm?id=2371297
https://azure.microsoft.com/enus/blog/insideazuresearchchaos-
engineering/
https://en.wikipedia.org/wiki/Chaos_engineering
https://learning.oreilly.com/library/view/chaos-
engineering/9781492043850/

Chaos Engineering - The Art of Breaking Things in Production

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Chaos Engineering - The Art of Breaking Things in Production

Similaire à Chaos Engineering - The Art of Breaking Things in Production (20)

Plus de Keet Sugathadasa

Plus de Keet Sugathadasa (9)

Dernier

Dernier (20)

Chaos Engineering - The Art of Breaking Things in Production