SlideShare une entreprise Scribd logo
1  sur  33
The Case for Chaos – AWS Pop-up Loft
Bruce Wong – Engineering Manager – Chaos Engineering, Netflix
1
Who am I?
Bruce Wong
2@bruce_m_wong
Who am I?
Bruce Wong
 Netflix since 2010
3@bruce_m_wong
Who am I?
Bruce Wong
 Netflix since 2010
 Computer Science
4@bruce_m_wong
Who am I?
Bruce Wong
 Netflix since 2010
 Computer Science
 Builds Engineering Teams
 5 different teams so far
5@bruce_m_wong
Agenda
 Why?
 Case Studies
 How you can start chaos testing
 Future chaos
6@bruce_m_wong
Failure is Unavoidable
 Disks Fail
 Power outages. And your generator fails
 Software bugs
 Human Error
7@bruce_m_wong
What about the cloud?
8@bruce_m_wong
Cloud Case Study
9@bruce_m_wong
 XSA-108 Security Vulnerability
 ~10% of EC2 instances
rebooted
 Spread over a 5 days
 One availability-zone at a time
Chaos Validated + Public Cloud Validated
10@bruce_m_wong
Netflix & Micro-Services
11@bruce_m_wong
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
Netflix & Micro-Services
12@bruce_m_wong
13@bruce_m_wong
14@bruce_m_wong
15@bruce_m_wong
16@bruce_m_wong
17
Graceful Degradation
@bruce_m_wong
Product + Engineering Decision
18
Designing for Failure
@bruce_m_wong
 Infrastructure Failure
 Instance terminations – single points of failure
 Latency
 Availability Zone
 Regional
 Application Failure
 Graceful degradation
 Software Bugs
19
Testing
@bruce_m_wong
 Unit testing
 Integration testing
 Functional testing
 Regression testing
 Chaos Testing
Finding bugs earlier
20
Resilience needs to be tested
@bruce_m_wong
Testing is hard
 Large and growing data sets
 Internet-scale traffic
 Innovation and New features
 Change is constant
21
Resilience needs to be tested
@bruce_m_wong
 Validate resilience design
 Don’t wait for next outage
 Un-controlled
 Un-predictable
Hope is not a strategy
Types of Chaos
22
Instances Fail
Lessons
• Be as stateless as possible
• Autoscaling groups are good
• Invest in automation to rebuilt
state when necessary
• Running Chaos Monkey on
C*
@bruce_m_wong
Types of Chaos
23
Many Instances can Fail
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management tools
can be a bottleneck
@bruce_m_wong
Types of Chaos
24
Natural Disasters Happen
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management can
be a bottleneck
• Smaller Blast-Radius Benefits
• Traffic + Capacity orchestration
is hard
@bruce_m_wong
Types of Chaos
25
Latency
Still Learning
• Functional fallbacks don’t
account for system limitations
• Thread pools
• Connection pools
• Slow can be hard to find
• Slow can be hard to contain
• Unbounded Queues are BAD
@bruce_m_wong
26
Unbounded Queues
@bruce_m_wong
 Come in many forms, to name a few
 Threads
 Memory
 Disk
 Bounded by physical limitations
 VERY difficult to find
 Elastic is not Infinite
27
For Example: Memory and Data
@bruce_m_wong
 Data is important
 In-Memory Queue grows and shrinks
 Failure Mode # 1 – Out of memory
 NOT A MEMORY LEAK!
28
For Example: Memory and Data
@bruce_m_wong
 Data is important
 If Queue gets to size X
 Write to disk
 Flush later
 Failure Mode # 2
 Disk Full
 File Descriptors Saturated
29
For Example: Memory and Data
@bruce_m_wong
 Data is important
…
But not as important as uptime
Starting Chaos
30
 Start small, very small.
 Start simple, stateless systems
 Start manually and coordinated
 Failure Injection Fridays
 Build confidence
 Outages are opportunities
@bruce_m_wong
Chaos takes time
31@bruce_m_wong
2010
2012
2014
Aspirational Chaos
32
 Increase Frequency & Intensity
 Reduces chance of drift
 Infrastructure
 Continuous Latency injection
 Chaos Gorilla random AZ weekly
 Latency Gorilla
 CPU, Memory, Disk
 Application
 Continuous Validation of fallbacks
 Startup dependency failure injection
@bruce_m_wong
Questions
33@bruce_m_wong

Contenu connexe

Tendances

CI-CD with AWS Developer Tools and Fargate_AWSPSSummit_Singapore
CI-CD with AWS Developer Tools and Fargate_AWSPSSummit_SingaporeCI-CD with AWS Developer Tools and Fargate_AWSPSSummit_Singapore
CI-CD with AWS Developer Tools and Fargate_AWSPSSummit_Singapore
Amazon Web Services
 
AWS Enterprise First Call Deck
AWS Enterprise First Call DeckAWS Enterprise First Call Deck
AWS Enterprise First Call Deck
Alexandre Melo
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 

Tendances (20)

Azure DevOps
Azure DevOpsAzure DevOps
Azure DevOps
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
 
Best Practices for Middleware and Integration Architecture Modernization with...
Best Practices for Middleware and Integration Architecture Modernization with...Best Practices for Middleware and Integration Architecture Modernization with...
Best Practices for Middleware and Integration Architecture Modernization with...
 
CI-CD with AWS Developer Tools and Fargate_AWSPSSummit_Singapore
CI-CD with AWS Developer Tools and Fargate_AWSPSSummit_SingaporeCI-CD with AWS Developer Tools and Fargate_AWSPSSummit_Singapore
CI-CD with AWS Developer Tools and Fargate_AWSPSSummit_Singapore
 
Transforming Organizations with CI/CD
Transforming Organizations with CI/CDTransforming Organizations with CI/CD
Transforming Organizations with CI/CD
 
Cloud Native In-Depth
Cloud Native In-DepthCloud Native In-Depth
Cloud Native In-Depth
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
 
Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)
 
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
 
20200212 AWS Black Belt Online Seminar AWS Systems Manager
20200212 AWS Black Belt Online Seminar AWS Systems Manager20200212 AWS Black Belt Online Seminar AWS Systems Manager
20200212 AWS Black Belt Online Seminar AWS Systems Manager
 
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 
Deep Dive - CI/CD on AWS
Deep Dive - CI/CD on AWSDeep Dive - CI/CD on AWS
Deep Dive - CI/CD on AWS
 
20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR
 
High Availability Application Architectures in Amazon VPC (ARC202) | AWS re:I...
High Availability Application Architectures in Amazon VPC (ARC202) | AWS re:I...High Availability Application Architectures in Amazon VPC (ARC202) | AWS re:I...
High Availability Application Architectures in Amazon VPC (ARC202) | AWS re:I...
 
AWS Black Belt Online Seminar 2018 AWS Certificate Manager
AWS Black Belt Online Seminar 2018 AWS Certificate ManagerAWS Black Belt Online Seminar 2018 AWS Certificate Manager
AWS Black Belt Online Seminar 2018 AWS Certificate Manager
 
AWS Enterprise First Call Deck
AWS Enterprise First Call DeckAWS Enterprise First Call Deck
AWS Enterprise First Call Deck
 
Continuous Integration With Jenkins
Continuous Integration With JenkinsContinuous Integration With Jenkins
Continuous Integration With Jenkins
 
20200623 AWS Black Belt Online Seminar Amazon Elasticsearch Service
20200623 AWS Black Belt Online Seminar Amazon Elasticsearch Service20200623 AWS Black Belt Online Seminar Amazon Elasticsearch Service
20200623 AWS Black Belt Online Seminar Amazon Elasticsearch Service
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 

En vedette

Chaos Patterns
Chaos PatternsChaos Patterns
Chaos Patterns
Bruce Wong
 
Principles of microservices velocity
Principles of microservices   velocityPrinciples of microservices   velocity
Principles of microservices velocity
Sam Newman
 

En vedette (20)

The Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepThe Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single Step
 
Scalable Microservices at Netflix. Challenges and Tools of the Trade
Scalable Microservices at Netflix. Challenges and Tools of the TradeScalable Microservices at Netflix. Challenges and Tools of the Trade
Scalable Microservices at Netflix. Challenges and Tools of the Trade
 
Chaos Patterns
Chaos PatternsChaos Patterns
Chaos Patterns
 
Chaos Patterns Twilio SIGNALCONF 2016
Chaos Patterns Twilio SIGNALCONF 2016Chaos Patterns Twilio SIGNALCONF 2016
Chaos Patterns Twilio SIGNALCONF 2016
 
Enterprise Architecture Case in PHP (MUZIK Online)
Enterprise Architecture Case in PHP (MUZIK Online)Enterprise Architecture Case in PHP (MUZIK Online)
Enterprise Architecture Case in PHP (MUZIK Online)
 
Dockercon State of the Art in Microservices
Dockercon State of the Art in MicroservicesDockercon State of the Art in Microservices
Dockercon State of the Art in Microservices
 
I Love APIs 2015: Microservices at Amazon
I Love APIs 2015: Microservices at AmazonI Love APIs 2015: Microservices at Amazon
I Love APIs 2015: Microservices at Amazon
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
Principles of microservices velocity
Principles of microservices   velocityPrinciples of microservices   velocity
Principles of microservices velocity
 
Chaos Driven Development
Chaos Driven DevelopmentChaos Driven Development
Chaos Driven Development
 
Chaos Engineeringのススメ
Chaos EngineeringのススメChaos Engineeringのススメ
Chaos Engineeringのススメ
 
Microservices 2.0
Microservices 2.0Microservices 2.0
Microservices 2.0
 
Principles of Chaos Engineering
Principles of Chaos EngineeringPrinciples of Chaos Engineering
Principles of Chaos Engineering
 
Microservices for the rest of us
Microservices for the rest of usMicroservices for the rest of us
Microservices for the rest of us
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering Hamburg
 
Principles of Microservices - NDC 2014
Principles of Microservices  - NDC 2014Principles of Microservices  - NDC 2014
Principles of Microservices - NDC 2014
 
Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...
Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...
Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...
 
Microservices Practitioner Summit Jan '15 - Scaling Uber from 1 to 100s of Se...
Microservices Practitioner Summit Jan '15 - Scaling Uber from 1 to 100s of Se...Microservices Practitioner Summit Jan '15 - Scaling Uber from 1 to 100s of Se...
Microservices Practitioner Summit Jan '15 - Scaling Uber from 1 to 100s of Se...
 
"WE MAKE SPACE, SPACE MAKES US" - 김정태 MYSC 대표
"WE MAKE SPACE, SPACE MAKES US" - 김정태 MYSC 대표"WE MAKE SPACE, SPACE MAKES US" - 김정태 MYSC 대표
"WE MAKE SPACE, SPACE MAKES US" - 김정태 MYSC 대표
 
What's New in Java 8
What's New in Java 8What's New in Java 8
What's New in Java 8
 

Similaire à The Case for Chaos

Think Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureThink Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information Architecture
Inside Analysis
 

Similaire à The Case for Chaos (20)

CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain Syndrome
 
Chaos Driven Development (Bruce Wong)
Chaos Driven Development (Bruce Wong)Chaos Driven Development (Bruce Wong)
Chaos Driven Development (Bruce Wong)
 
cse40822-CAP.pptx
cse40822-CAP.pptxcse40822-CAP.pptx
cse40822-CAP.pptx
 
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxpl
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev days
 
The Misuse of Cloud Infrastructure
The Misuse of Cloud InfrastructureThe Misuse of Cloud Infrastructure
The Misuse of Cloud Infrastructure
 
Building Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache CassandraBuilding Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache Cassandra
 
Think Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureThink Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information Architecture
 
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
 
009978776.pdf
009978776.pdf009978776.pdf
009978776.pdf
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Ocassionally connected devices spark final
Ocassionally connected devices spark finalOcassionally connected devices spark final
Ocassionally connected devices spark final
 
Introduction to databasecasmfnbskdfjnfkjsdnsjkdfn
Introduction to databasecasmfnbskdfjnfkjsdnsjkdfnIntroduction to databasecasmfnbskdfjnfkjsdnsjkdfn
Introduction to databasecasmfnbskdfjnfkjsdnsjkdfn
 
Introduction
IntroductionIntroduction
Introduction
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf
 
#VirtualDesignMaster 3 Challenge 1 – James Brown
#VirtualDesignMaster 3 Challenge 1 – James Brown#VirtualDesignMaster 3 Challenge 1 – James Brown
#VirtualDesignMaster 3 Challenge 1 – James Brown
 
Tudor Damian - Microsoft Azure ca si solutie pentru backup sau disaster recovery
Tudor Damian - Microsoft Azure ca si solutie pentru backup sau disaster recoveryTudor Damian - Microsoft Azure ca si solutie pentru backup sau disaster recovery
Tudor Damian - Microsoft Azure ca si solutie pentru backup sau disaster recovery
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

The Case for Chaos

Notes de l'éditeur

  1. Fantastic because we didn’t have to do anything. If we were running our own hardware we would have had to do the reboots ourselves.
  2. The netflix experience Services to make experience possible -”because you watched” -search -profiles -localization
  3. Gracefully degraded netflix experience Miss it? “evidence”
  4. Ratings