SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Chaos Engineering
Keet Sugathadasa| Pubudu Sitinamaluwa
Site Reliability Engineering
Keet & Pubudu
Keet
SRE, Contributor to NPM and
Stackoverflow
Interested in Cyber Security, Cloud
Computing, Distributed
Computing
Pubudu
SRE, Cloud Computing, Bigdata, ML,
Distributed Computing
Interested in distributed computing
and machine learning.
AGENDA
What started all this
What is Chaos Engineering
How is Chaos Engineering different from Testing Procedures
What companies are doing this
Why do Chaos Engineering?
What is Chaos Monkey
Principals of Chaos Engineering
Demonstration
Challenges Faced in Chaos Engineering
Chaos Engineering
Dissapearing of instances in cloud was a pain
Chaos Monkey was born
What started all this...
Netflix started to move to cloud in August 2008
Christmast eve 2012 – AWS region failure
Chaos Engineering as a Decipline
Chaos Kong was born - Best case recovery time from 50 minutes to 6 minutes
Chaos community day
• 2015 – 40 participants
• 2016 – 60 participants
• 2017 – 150 participants
• 2017 Nora Jones gave a keynote on Chaos Engineering – 40,000 in person 20,000 streaming attendees
What is Chaos Engineering
"Chaos Engineering is the discipline of experimenting on a system, in
order to build confidence in the system’s capability to withstand
turbulent conditions in production"
Controlled and planned Chaos Engineering experiments
Preparing for unpredictable failure
Preparing engineers for failure
Making the chaos inherent in the system visible
A way to improve reliability
Helping to meet SLAs by Fortifying Systems
Preparing for Game Day
What is NOT Chaos Engineering
What Chaos Engineering is NOT
Random Chaos Engineering Experiments
Unsupervised Chaos Engineering Experiments
Unexpected Chaos Engineering Experiments
Breaking production by Accident
Chaos Engineering vs Other Testing
Procedures
How is Chaos Engineering Different from Testing Procedures
In Testing, an assertion is made: given specific conditions, a system will emit a specific
output based on the given specifications. Tests are typically binary, and determine whether a
property is true or false. Strictly speaking, this does not generate new knowledge about the
system, it just assigns valence to a known property of it.
Experimentation generates new knowledge, and often suggestsnew avenues of exploration.
Chaos engineering refers to the multiple methods to generate something unique. If you want
to detect or identify the complexity of any behavioral defection in the system, then injecting
communication failures is always a better choice.
What Companies Are Doing This
Netflix may have started this at first, but this area of specialisation has
advanced into many dynamics in industries all over the world.
Netflix
Amazon
Dropbox
Uber
Slack
Twilio
Facebook
And many more!
Amazon : Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli,Resilience
Engineering: Learning to Embrace Failure, ACM Queue, Vol. 10, Iss. 9, Sept. 13, 2012.
http://queue.acm.org/detail.cfm?id=2371297
Microsoft : Inside Azure Search: ChaosEngineering, Microsoft Azure Blog, July 1, 2015.
https://azure.microsoft.com/enus/blog/insideazuresearchchaosengineering/
Google : Yevgeniy Sverdlik, Facebook Turned Off Entire Data Center to Test Resiliency,
Data Center Knowledge, Sept. 15, 2014,
Why do Chaos Engineering
Why Chaos Engineering
Why Chaos Engineering
Systems need to scale fast and smoothly
Microservice architecture is tricky
Services will fail
Dependencies on other companies will fail
Reduce the amount of outages and
downtime (lose less money)
Prepare for real world scenarios
Train On Call Engineers to be Prepared for
different kinds of Outages
Train Development Engineers to build more
resilient systems
Engineering architects to make solid and
reliable decisions
Chaos Monkey
What is Chaos Monkey
What is Chaos Monkey
Imagine a monkey enteringa 'data center', these
'farms' of servers that host all the critical functions
of our online activities. The monkey randomly rips
cables, destroys devices and returns everything that
passes by the hand [i.e. flings excrement]. The
challenge for IT managers is to design the
information system they are responsible for so that
it can work despite these monkeys, which no one
ever knows when they arrive and what they will
destroy.
Netflix Simian Army
Netflix has built an entire army of
monkeys, to simulate Chaotic
Situations in the production
environment, and this is called the
Simian Army. Some famous monkeys
are...
Chaos Gorilla
Chaos Kong
Latency Monkey
Doctor Monkey
etc...
Principles of Chaos Engineering
Principles of Chaos Engineering
Build a Hypothesis around the Steady State Behavior
Vary Real-world Events
Run Experiments in Production
Automate Experiments to Run Continuously
Minimize Blast Radius
Principles of Chaos Engineering
Principle 1: Build a Hypothesis around Steady State Behavior
Steady state is the state your
system is in, when it is
considered steady.
5xx Error rate below 5%
p90 latency is below 500
ms
Ops per second is above
10,000
Think of these as the "what if"
questions
What if the load balancer breaks
What if the cluster goes down
What if the auth server breaks
What if Redis becomes slow
What if latency increases by 300ms
etc
Principles of Chaos Engineering
Principle 2: Vary Real-world Events
Hardware failures
Functional bugs
State transmission errors (e.g., inconsistency of states between sender and receiver nodes)
Network latency and partition
Large fluctuations in input (up or down) and retry storms
Resource exhaustion
Unusual or unpredictable combinations of inter-service communication
Byzantine failures (e.g., a node believing it has the most current data when it actually does not)
Race conditions
Downstream dependencies malfunction
Principles of Chaos Engineering
Principle 3: Run Experiments in Production
Simulating the failure of an entire region or datacenter.
Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in
production.
Injecting latency between services for a selected percentage of traffic over a predetermined
period of time.
Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
Code insertion: Adding instructions to the target program and allowing fault injection to occur
prior to certain instructions.
Time travel: forcing system clocks out of sync with each other.
Executing a routine in driver code emulating I/O errors.
Maxing out CPU cores on an Elasticsearch cluster.
Principles of Chaos Engineering
Principle 4: Automate Experiments to Run Continuously
The practice of Chaos Engineering is a long running process and a labour intensive
process.
Time to detect
Time for Notification and Escalation
Time to public notification
Time for graceful degradation to kick in
Time for self-healing to happen
Time to recovery - partial or full
Time to all clear and stable
Principles of Chaos Engineering
Principle 5: Minimize Blast Radius
When you perform a Chaos Engineering experiment, always remember to identify
metrics like the following. (This is to ensure that the Blast Radius in contained
and identified)
Who is impacted
How many workloads
What functionality
How many locations
And more
Principles of Chaos Engineering
Principles of
Chaos
Engineering
Demonstration
Challenges in Chaos Engineering
Challenges in Chaos Engineering
Challenges Faced in Chaos Engineering
No time or flexibility to simulate disasters
Teams will always be spending their time fixing things, and building new
features
This can be very political inside the organization
Cost involved in fixing and simulating disasters
And many more company matters that build up resistance
Thank you
References
References
https://chaos-mesh.org/
https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/
https://principlesofchaos.org/
http://queue.acm.org/detail.cfm?id=2371297
https://azure.microsoft.com/enus/blog/insideazuresearchchaos-
engineering/
https://en.wikipedia.org/wiki/Chaos_engineering
https://learning.oreilly.com/library/view/chaos-
engineering/9781492043850/

Contenu connexe

Tendances

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureAna Medina
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guidematthewbrahms
 
Microservices Testing Strategies JUnit Cucumber Mockito Pact
Microservices Testing Strategies JUnit Cucumber Mockito PactMicroservices Testing Strategies JUnit Cucumber Mockito Pact
Microservices Testing Strategies JUnit Cucumber Mockito PactAraf Karsh Hamid
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsYury Roa
 
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Nora Jones
 
API Testing. Streamline your testing process.
API Testing. Streamline your testing process.API Testing. Streamline your testing process.
API Testing. Streamline your testing process.Andrey Oleynik
 
Selenium Architecture
Selenium ArchitectureSelenium Architecture
Selenium Architecturerohitnayak
 
Developing a Testing Strategy for DevOps Success
Developing a Testing Strategy for DevOps SuccessDeveloping a Testing Strategy for DevOps Success
Developing a Testing Strategy for DevOps SuccessDevOps.com
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with KubernetesArun Gupta
 
RESTful API Testing using Postman, Newman, and Jenkins
RESTful API Testing using Postman, Newman, and JenkinsRESTful API Testing using Postman, Newman, and Jenkins
RESTful API Testing using Postman, Newman, and JenkinsQASymphony
 
Java Source Code Analysis using SonarQube
Java Source Code Analysis using SonarQubeJava Source Code Analysis using SonarQube
Java Source Code Analysis using SonarQubeAngelin R
 
Microservices Architecture & Testing Strategies
Microservices Architecture & Testing StrategiesMicroservices Architecture & Testing Strategies
Microservices Architecture & Testing StrategiesAraf Karsh Hamid
 
API Test Automation
API Test Automation API Test Automation
API Test Automation SQALab
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos EngineeringSIGHUP
 

Tendances (20)

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
 
Microservices Testing Strategies JUnit Cucumber Mockito Pact
Microservices Testing Strategies JUnit Cucumber Mockito PactMicroservices Testing Strategies JUnit Cucumber Mockito Pact
Microservices Testing Strategies JUnit Cucumber Mockito Pact
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in Systems
 
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017
 
API Testing. Streamline your testing process.
API Testing. Streamline your testing process.API Testing. Streamline your testing process.
API Testing. Streamline your testing process.
 
Selenium Architecture
Selenium ArchitectureSelenium Architecture
Selenium Architecture
 
Developing a Testing Strategy for DevOps Success
Developing a Testing Strategy for DevOps SuccessDeveloping a Testing Strategy for DevOps Success
Developing a Testing Strategy for DevOps Success
 
Zuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne PlatformZuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne Platform
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
 
RESTful API Testing using Postman, Newman, and Jenkins
RESTful API Testing using Postman, Newman, and JenkinsRESTful API Testing using Postman, Newman, and Jenkins
RESTful API Testing using Postman, Newman, and Jenkins
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
Java Source Code Analysis using SonarQube
Java Source Code Analysis using SonarQubeJava Source Code Analysis using SonarQube
Java Source Code Analysis using SonarQube
 
Microservices Architecture & Testing Strategies
Microservices Architecture & Testing StrategiesMicroservices Architecture & Testing Strategies
Microservices Architecture & Testing Strategies
 
API Test Automation
API Test Automation API Test Automation
API Test Automation
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Introduction to microservices
Introduction to microservicesIntroduction to microservices
Introduction to microservices
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 

Similaire à Chaos Engineering - The Art of Breaking Things in Production

Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Dinis Cruz
 
Chaos Engineering on Cloud Foundry
Chaos Engineering on Cloud FoundryChaos Engineering on Cloud Foundry
Chaos Engineering on Cloud FoundryKarun Chennuri
 
Nelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional WorldNelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional WorldTimothy Perrett
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroPaul Boos
 
Fine line between performance and security
Fine line between performance and securityFine line between performance and security
Fine line between performance and securityAlmudena Vivanco
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open StackJorge Cardoso
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Alex Cachia
 
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
CHAOS ENGINEERING – OR LET'S SHAKE THE TREECHAOS ENGINEERING – OR LET'S SHAKE THE TREE
CHAOS ENGINEERING – OR LET'S SHAKE THE TREEJimmy Dahlqvist
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at ScaleJeff Henrikson
 
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...Cωνσtantίnoς Giannoulis
 
Pivotal APJ Security Chaos Engineering
Pivotal APJ Security Chaos EngineeringPivotal APJ Security Chaos Engineering
Pivotal APJ Security Chaos EngineeringAaron Rinehart
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Jorge Cardoso
 
Diagnosability vs The Cloud
Diagnosability vs The CloudDiagnosability vs The Cloud
Diagnosability vs The CloudBob Rhubart
 
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Cary Millsap
 
Chaos Engineering Talk at DevOps Days Austin
Chaos Engineering Talk at DevOps Days AustinChaos Engineering Talk at DevOps Days Austin
Chaos Engineering Talk at DevOps Days Austinmatthewbrahms
 

Similaire à Chaos Engineering - The Art of Breaking Things in Production (20)

Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos Engineering
 
ChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptxChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptx
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018
 
Chaos Engineering on Cloud Foundry
Chaos Engineering on Cloud FoundryChaos Engineering on Cloud Foundry
Chaos Engineering on Cloud Foundry
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
Nelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional WorldNelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional World
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for Distro
 
Fine line between performance and security
Fine line between performance and securityFine line between performance and security
Fine line between performance and security
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open Stack
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...
 
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
CHAOS ENGINEERING – OR LET'S SHAKE THE TREECHAOS ENGINEERING – OR LET'S SHAKE THE TREE
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at Scale
 
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
 
What DevOps Isn't
What DevOps Isn'tWhat DevOps Isn't
What DevOps Isn't
 
Pivotal APJ Security Chaos Engineering
Pivotal APJ Security Chaos EngineeringPivotal APJ Security Chaos Engineering
Pivotal APJ Security Chaos Engineering
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
 
Diagnosability vs The Cloud
Diagnosability vs The CloudDiagnosability vs The Cloud
Diagnosability vs The Cloud
 
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
 
Chaos Engineering Talk at DevOps Days Austin
Chaos Engineering Talk at DevOps Days AustinChaos Engineering Talk at DevOps Days Austin
Chaos Engineering Talk at DevOps Days Austin
 

Plus de Keet Sugathadasa

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
 
Human Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook MessengerHuman Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook MessengerKeet Sugathadasa
 
Cyber Security and Cloud Computing
Cyber Security and Cloud ComputingCyber Security and Cloud Computing
Cyber Security and Cloud ComputingKeet Sugathadasa
 
How to compete in hackathons
How to compete in hackathonsHow to compete in hackathons
How to compete in hackathonsKeet Sugathadasa
 
Quality Engineering - When to Stop Testing
Quality Engineering - When to Stop TestingQuality Engineering - When to Stop Testing
Quality Engineering - When to Stop TestingKeet Sugathadasa
 
Training Report WSO2 internship
Training Report  WSO2 internshipTraining Report  WSO2 internship
Training Report WSO2 internshipKeet Sugathadasa
 
Object oriented programming interview questions
Object oriented programming interview questionsObject oriented programming interview questions
Object oriented programming interview questionsKeet Sugathadasa
 
Revolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connectRevolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connectKeet Sugathadasa
 

Plus de Keet Sugathadasa (9)

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
Human Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook MessengerHuman Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook Messenger
 
Cyber Security and Cloud Computing
Cyber Security and Cloud ComputingCyber Security and Cloud Computing
Cyber Security and Cloud Computing
 
How to compete in hackathons
How to compete in hackathonsHow to compete in hackathons
How to compete in hackathons
 
Quality Engineering - When to Stop Testing
Quality Engineering - When to Stop TestingQuality Engineering - When to Stop Testing
Quality Engineering - When to Stop Testing
 
Training Report WSO2 internship
Training Report  WSO2 internshipTraining Report  WSO2 internship
Training Report WSO2 internship
 
Object oriented programming interview questions
Object oriented programming interview questionsObject oriented programming interview questions
Object oriented programming interview questions
 
Interview Facing Workshop
Interview Facing WorkshopInterview Facing Workshop
Interview Facing Workshop
 
Revolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connectRevolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connect
 

Dernier

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 

Dernier (20)

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 

Chaos Engineering - The Art of Breaking Things in Production

  • 1. Chaos Engineering Keet Sugathadasa| Pubudu Sitinamaluwa Site Reliability Engineering
  • 2. Keet & Pubudu Keet SRE, Contributor to NPM and Stackoverflow Interested in Cyber Security, Cloud Computing, Distributed Computing Pubudu SRE, Cloud Computing, Bigdata, ML, Distributed Computing Interested in distributed computing and machine learning.
  • 3. AGENDA What started all this What is Chaos Engineering How is Chaos Engineering different from Testing Procedures What companies are doing this Why do Chaos Engineering? What is Chaos Monkey Principals of Chaos Engineering Demonstration Challenges Faced in Chaos Engineering
  • 5. Dissapearing of instances in cloud was a pain Chaos Monkey was born What started all this... Netflix started to move to cloud in August 2008 Christmast eve 2012 – AWS region failure Chaos Engineering as a Decipline Chaos Kong was born - Best case recovery time from 50 minutes to 6 minutes Chaos community day • 2015 – 40 participants • 2016 – 60 participants • 2017 – 150 participants • 2017 Nora Jones gave a keynote on Chaos Engineering – 40,000 in person 20,000 streaming attendees
  • 6. What is Chaos Engineering "Chaos Engineering is the discipline of experimenting on a system, in order to build confidence in the system’s capability to withstand turbulent conditions in production" Controlled and planned Chaos Engineering experiments Preparing for unpredictable failure Preparing engineers for failure Making the chaos inherent in the system visible A way to improve reliability Helping to meet SLAs by Fortifying Systems Preparing for Game Day
  • 7. What is NOT Chaos Engineering What Chaos Engineering is NOT Random Chaos Engineering Experiments Unsupervised Chaos Engineering Experiments Unexpected Chaos Engineering Experiments Breaking production by Accident
  • 8. Chaos Engineering vs Other Testing Procedures How is Chaos Engineering Different from Testing Procedures In Testing, an assertion is made: given specific conditions, a system will emit a specific output based on the given specifications. Tests are typically binary, and determine whether a property is true or false. Strictly speaking, this does not generate new knowledge about the system, it just assigns valence to a known property of it. Experimentation generates new knowledge, and often suggestsnew avenues of exploration. Chaos engineering refers to the multiple methods to generate something unique. If you want to detect or identify the complexity of any behavioral defection in the system, then injecting communication failures is always a better choice.
  • 9. What Companies Are Doing This Netflix may have started this at first, but this area of specialisation has advanced into many dynamics in industries all over the world. Netflix Amazon Dropbox Uber Slack Twilio Facebook And many more! Amazon : Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli,Resilience Engineering: Learning to Embrace Failure, ACM Queue, Vol. 10, Iss. 9, Sept. 13, 2012. http://queue.acm.org/detail.cfm?id=2371297 Microsoft : Inside Azure Search: ChaosEngineering, Microsoft Azure Blog, July 1, 2015. https://azure.microsoft.com/enus/blog/insideazuresearchchaosengineering/ Google : Yevgeniy Sverdlik, Facebook Turned Off Entire Data Center to Test Resiliency, Data Center Knowledge, Sept. 15, 2014,
  • 10. Why do Chaos Engineering
  • 11. Why Chaos Engineering Why Chaos Engineering Systems need to scale fast and smoothly Microservice architecture is tricky Services will fail Dependencies on other companies will fail Reduce the amount of outages and downtime (lose less money) Prepare for real world scenarios Train On Call Engineers to be Prepared for different kinds of Outages Train Development Engineers to build more resilient systems Engineering architects to make solid and reliable decisions
  • 13. What is Chaos Monkey What is Chaos Monkey Imagine a monkey enteringa 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.
  • 14. Netflix Simian Army Netflix has built an entire army of monkeys, to simulate Chaotic Situations in the production environment, and this is called the Simian Army. Some famous monkeys are... Chaos Gorilla Chaos Kong Latency Monkey Doctor Monkey etc...
  • 15. Principles of Chaos Engineering
  • 16. Principles of Chaos Engineering Build a Hypothesis around the Steady State Behavior Vary Real-world Events Run Experiments in Production Automate Experiments to Run Continuously Minimize Blast Radius
  • 17. Principles of Chaos Engineering Principle 1: Build a Hypothesis around Steady State Behavior Steady state is the state your system is in, when it is considered steady. 5xx Error rate below 5% p90 latency is below 500 ms Ops per second is above 10,000 Think of these as the "what if" questions What if the load balancer breaks What if the cluster goes down What if the auth server breaks What if Redis becomes slow What if latency increases by 300ms etc
  • 18. Principles of Chaos Engineering Principle 2: Vary Real-world Events Hardware failures Functional bugs State transmission errors (e.g., inconsistency of states between sender and receiver nodes) Network latency and partition Large fluctuations in input (up or down) and retry storms Resource exhaustion Unusual or unpredictable combinations of inter-service communication Byzantine failures (e.g., a node believing it has the most current data when it actually does not) Race conditions Downstream dependencies malfunction
  • 19. Principles of Chaos Engineering Principle 3: Run Experiments in Production Simulating the failure of an entire region or datacenter. Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production. Injecting latency between services for a selected percentage of traffic over a predetermined period of time. Function-based chaos (runtime injection): randomly causing functions to throw exceptions. Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions. Time travel: forcing system clocks out of sync with each other. Executing a routine in driver code emulating I/O errors. Maxing out CPU cores on an Elasticsearch cluster.
  • 20. Principles of Chaos Engineering Principle 4: Automate Experiments to Run Continuously The practice of Chaos Engineering is a long running process and a labour intensive process. Time to detect Time for Notification and Escalation Time to public notification Time for graceful degradation to kick in Time for self-healing to happen Time to recovery - partial or full Time to all clear and stable
  • 21. Principles of Chaos Engineering Principle 5: Minimize Blast Radius When you perform a Chaos Engineering experiment, always remember to identify metrics like the following. (This is to ensure that the Blast Radius in contained and identified) Who is impacted How many workloads What functionality How many locations And more
  • 22. Principles of Chaos Engineering
  • 25. Challenges in Chaos Engineering
  • 26. Challenges in Chaos Engineering Challenges Faced in Chaos Engineering No time or flexibility to simulate disasters Teams will always be spending their time fixing things, and building new features This can be very political inside the organization Cost involved in fixing and simulating disasters And many more company matters that build up resistance