SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos Engineering and Scalability
at Audible.com
Tyler Lund
Director, Software Engineering
Audible.com
A R C 3 0 8
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
What I talk about when I talk
about distributed architectures
The curious case of the broken
downloads
A song of ice and chaos
Principles of chaos and the
Goblet of Fire
The art and Zen of implementing
chaos
What happened?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Largest producer and retailer of
audiobooks … worldwide!
To unleash the power of the spoken
word
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Audible members listened for almost 3 BILLION hours
in 2017
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Audible downloads
Ownership
Stats Table
DB
Activation
Audible
Download
Service
Content
Delivery
Service
Get Static
Metadata
Content
Delivery
Engine
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service
A
Service
B
Service
C
Service
D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service
A
Service
B
Service
C
Service
D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service
A
Service
B
Service
C
Service
D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service
A
Service
B
Service
C
Service
D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service
A
Service
B
Service
C
Service
D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service
A
Service
B
Service
C
Service
D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Audible downloads
Ownership
Stats
Table DB
Activation
Audible
Download
Service
Content
Delivery
Service
Get Static
Metadata
Content
Delivery
Engine
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Audible downloads
Ownership
Stats
Table DB
Activation
Audible
Download
Service
Content
Delivery
Service
Get Static
Metadata
Content
Delivery
Engine
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Audible downloads
Ownership
Stats Table
DB
Activation
Audible
Download
Service
Content
Delivery
Service
Get Static
Metadata
Content
Delivery
Engine
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Goals of chaos engineering
Distributed systems are complex. No one person can understand the
entirety.
Chaos is inherent. It’s better to accept the chaos.
Gain confidence in understanding the system through experiments.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Component
A
Input Output
Unit
Testing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service AInput Output
Integration
Testing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is chaos engineering?
Chaos engineering is the discipline of experimenting on distributed
systems to gain confidence in their behavior
Fail calls between services or add latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Implementing chaos engineering
Socialization
Monitoring
Graceful restarts and degradation
Targeted chaos
Cause a cascading failure
Build a failure ingestion framework
Create a chaos automation platform
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Socialization
Acknowledge the complexity of the system
Get support from the business
Never let a good problem go to waste
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Socialization
• First CPU hog experiment in 2012
• Socialized cause and impact of download issue
• Documented system architecture
• Created file access experiments
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Getting started with chaos
Define the steady state
Start with non-critical services, in QA
Only experiment on the services of teams that want to be
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring
What are your key business metrics?
Playback starts/second
Adds to cart
Orders
Membership signups
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Graceful restarts and degradation
• Start with on/off
• Bring down a host, service, or DB
• Spike the CPU
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Targeted chaos
Start with your own team or one prepared for it
Look at large recent issues
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cause a cascading failure
Define a hypothesis about the system
Test the hypothesis
Break something other teams are dependent on
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Types of experiments
Add latency
Make services and dependencies unavailable
Exceptions
Packet loss
Failed requests
Resource contention
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Create a failure injection framework
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Create a failure injection framework
• Web UI
• Host agent
• Service framework injection
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service
A
Service
B
Latency
Injection
Injector
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Gremlin proxy injector
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Audible downloads
Ownership
Stats Table
DB
Activation
Audible
Download
Service
Content
Delivery
Service
Get Static
Metadata
Content
Delivery
Engine
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Create a chaos automation platform
• Calculate the maximum impact the KPI can experience
• Run control and experiment clusters
• Route based on impact
• Stop experiment if KPI suffers or customers experience pain
• Figure out problem with time, not when paged
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service
A -
Prod
Chaos
Automation
Client
Service
A -
Control
Service
A -
Experi
ment
LB
98%
1%
1%
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
High Deviation:
Stop Experiment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Run chaos all the time, everywhere
• Run on all services, critical and non-critical
• Run with limited warning
• Run in production
• Run often
• Get feedback
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Results
• Show examples of resilience.
• Link back to previous issues.
• Can’t be proven with integration tests.
• Find issues wouldn’t otherwise find. Retry storms.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Our results
• Development team ownership
• Prioritizing experiments
• Building a framework
• Preventing customer impact
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Takeaways
• Everyone can and should do chaos
• Implementing chaos educates and improves engineering teams
• Do no harm to the customer
• Involve business partners
Chaos engineering doesn’t cause
problems. Chaos engineering reveals
them.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Tuesday, Nov 27
ARC314 – Globalizing Player Accounts at Riot Games While
Maintaining Availability
1:45 PM – 2:45 PM | Aria East, Plaza Level, Orovada 2
Tuesday, Nov 27
ARC307 – How Intuit Turbo Tax Ran Entirely on AWS for 2017
Taxes
10:45 AM - 11:45 AM | Venetian, Level 2, Venetian F
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tyler Lund
tlund@audible.com
@tylopoda
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Contenu connexe

Tendances

Accelerate ML Training on Amazon SageMaker Using GPU-Based EC2 P3 Instances (...
Accelerate ML Training on Amazon SageMaker Using GPU-Based EC2 P3 Instances (...Accelerate ML Training on Amazon SageMaker Using GPU-Based EC2 P3 Instances (...
Accelerate ML Training on Amazon SageMaker Using GPU-Based EC2 P3 Instances (...Amazon Web Services
 
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...Amazon Web Services
 
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...Amazon Web Services
 
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...Amazon Web Services
 
Role of Central Teams in DevOps Organizations (DEV370) - AWS re:Invent 2018
Role of Central Teams in DevOps Organizations (DEV370) - AWS re:Invent 2018Role of Central Teams in DevOps Organizations (DEV370) - AWS re:Invent 2018
Role of Central Teams in DevOps Organizations (DEV370) - AWS re:Invent 2018Amazon Web Services
 
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...Amazon Web Services
 
How Cox Automotive Runs GitHub Enterprise on AWS (ENT356-S) - AWS re:Invent 2018
How Cox Automotive Runs GitHub Enterprise on AWS (ENT356-S) - AWS re:Invent 2018How Cox Automotive Runs GitHub Enterprise on AWS (ENT356-S) - AWS re:Invent 2018
How Cox Automotive Runs GitHub Enterprise on AWS (ENT356-S) - AWS re:Invent 2018Amazon Web Services
 
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018Amazon Web Services
 
How Enterprises Are Modernizing Their Security, Risk Management, & Compliance...
How Enterprises Are Modernizing Their Security, Risk Management, & Compliance...How Enterprises Are Modernizing Their Security, Risk Management, & Compliance...
How Enterprises Are Modernizing Their Security, Risk Management, & Compliance...Amazon Web Services
 
Enhancing Media Workflows with Machine Learning (MAE303) - AWS re:Invent 2018
Enhancing Media Workflows with Machine Learning (MAE303) - AWS re:Invent 2018Enhancing Media Workflows with Machine Learning (MAE303) - AWS re:Invent 2018
Enhancing Media Workflows with Machine Learning (MAE303) - AWS re:Invent 2018Amazon Web Services
 
Accelerate Digital Transformation for Telecom Operators with Cloud-Native Amd...
Accelerate Digital Transformation for Telecom Operators with Cloud-Native Amd...Accelerate Digital Transformation for Telecom Operators with Cloud-Native Amd...
Accelerate Digital Transformation for Telecom Operators with Cloud-Native Amd...Amazon Web Services
 
Improve Accessibility Using Machine Learning (AIM332) - AWS re:Invent 2018
Improve Accessibility Using Machine Learning (AIM332) - AWS re:Invent 2018Improve Accessibility Using Machine Learning (AIM332) - AWS re:Invent 2018
Improve Accessibility Using Machine Learning (AIM332) - AWS re:Invent 2018Amazon Web Services
 
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...Amazon Web Services
 
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Amazon Web Services
 
Build Workflows with Amazon CloudFront, Amazon Route 53, & Lambda@Edge (CTD40...
Build Workflows with Amazon CloudFront, Amazon Route 53, & Lambda@Edge (CTD40...Build Workflows with Amazon CloudFront, Amazon Route 53, & Lambda@Edge (CTD40...
Build Workflows with Amazon CloudFront, Amazon Route 53, & Lambda@Edge (CTD40...Amazon Web Services
 
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Amazon Web Services
 
Create a Frictionless, Connected Retail Checkout Experience Inspired by Amazo...
Create a Frictionless, Connected Retail Checkout Experience Inspired by Amazo...Create a Frictionless, Connected Retail Checkout Experience Inspired by Amazo...
Create a Frictionless, Connected Retail Checkout Experience Inspired by Amazo...Amazon Web Services
 
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018Amazon Web Services
 
Deep Dive on AWS CloudHSM (SEC358-R1) - AWS re:Invent 2018
Deep Dive on AWS CloudHSM (SEC358-R1) - AWS re:Invent 2018Deep Dive on AWS CloudHSM (SEC358-R1) - AWS re:Invent 2018
Deep Dive on AWS CloudHSM (SEC358-R1) - AWS re:Invent 2018Amazon Web Services
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Amazon Web Services
 

Tendances (20)

Accelerate ML Training on Amazon SageMaker Using GPU-Based EC2 P3 Instances (...
Accelerate ML Training on Amazon SageMaker Using GPU-Based EC2 P3 Instances (...Accelerate ML Training on Amazon SageMaker Using GPU-Based EC2 P3 Instances (...
Accelerate ML Training on Amazon SageMaker Using GPU-Based EC2 P3 Instances (...
 
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
 
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
 
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
 
Role of Central Teams in DevOps Organizations (DEV370) - AWS re:Invent 2018
Role of Central Teams in DevOps Organizations (DEV370) - AWS re:Invent 2018Role of Central Teams in DevOps Organizations (DEV370) - AWS re:Invent 2018
Role of Central Teams in DevOps Organizations (DEV370) - AWS re:Invent 2018
 
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
 
How Cox Automotive Runs GitHub Enterprise on AWS (ENT356-S) - AWS re:Invent 2018
How Cox Automotive Runs GitHub Enterprise on AWS (ENT356-S) - AWS re:Invent 2018How Cox Automotive Runs GitHub Enterprise on AWS (ENT356-S) - AWS re:Invent 2018
How Cox Automotive Runs GitHub Enterprise on AWS (ENT356-S) - AWS re:Invent 2018
 
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
 
How Enterprises Are Modernizing Their Security, Risk Management, & Compliance...
How Enterprises Are Modernizing Their Security, Risk Management, & Compliance...How Enterprises Are Modernizing Their Security, Risk Management, & Compliance...
How Enterprises Are Modernizing Their Security, Risk Management, & Compliance...
 
Enhancing Media Workflows with Machine Learning (MAE303) - AWS re:Invent 2018
Enhancing Media Workflows with Machine Learning (MAE303) - AWS re:Invent 2018Enhancing Media Workflows with Machine Learning (MAE303) - AWS re:Invent 2018
Enhancing Media Workflows with Machine Learning (MAE303) - AWS re:Invent 2018
 
Accelerate Digital Transformation for Telecom Operators with Cloud-Native Amd...
Accelerate Digital Transformation for Telecom Operators with Cloud-Native Amd...Accelerate Digital Transformation for Telecom Operators with Cloud-Native Amd...
Accelerate Digital Transformation for Telecom Operators with Cloud-Native Amd...
 
Improve Accessibility Using Machine Learning (AIM332) - AWS re:Invent 2018
Improve Accessibility Using Machine Learning (AIM332) - AWS re:Invent 2018Improve Accessibility Using Machine Learning (AIM332) - AWS re:Invent 2018
Improve Accessibility Using Machine Learning (AIM332) - AWS re:Invent 2018
 
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...
 
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
 
Build Workflows with Amazon CloudFront, Amazon Route 53, & Lambda@Edge (CTD40...
Build Workflows with Amazon CloudFront, Amazon Route 53, & Lambda@Edge (CTD40...Build Workflows with Amazon CloudFront, Amazon Route 53, & Lambda@Edge (CTD40...
Build Workflows with Amazon CloudFront, Amazon Route 53, & Lambda@Edge (CTD40...
 
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
 
Create a Frictionless, Connected Retail Checkout Experience Inspired by Amazo...
Create a Frictionless, Connected Retail Checkout Experience Inspired by Amazo...Create a Frictionless, Connected Retail Checkout Experience Inspired by Amazo...
Create a Frictionless, Connected Retail Checkout Experience Inspired by Amazo...
 
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018
 
Deep Dive on AWS CloudHSM (SEC358-R1) - AWS re:Invent 2018
Deep Dive on AWS CloudHSM (SEC358-R1) - AWS re:Invent 2018Deep Dive on AWS CloudHSM (SEC358-R1) - AWS re:Invent 2018
Deep Dive on AWS CloudHSM (SEC358-R1) - AWS re:Invent 2018
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
 

Similaire à Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018

Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Amazon Web Services
 
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...Amazon Web Services
 
Dev348 ReInvent Corteva Agriscience
Dev348   ReInvent Corteva AgriscienceDev348   ReInvent Corteva Agriscience
Dev348 ReInvent Corteva AgriscienceRandy Black
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringAmazon Web Services
 
Operating at Scale- Preparing for the Journey [Portuguese]
Operating at Scale- Preparing for the Journey [Portuguese]Operating at Scale- Preparing for the Journey [Portuguese]
Operating at Scale- Preparing for the Journey [Portuguese]Amazon Web Services
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksAmazon Web Services
 
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...Amazon Web Services
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudAmazon Web Services
 
Transforming Product Development - AWS Transformation Day 2018: Detroit
Transforming Product Development - AWS Transformation Day 2018: DetroitTransforming Product Development - AWS Transformation Day 2018: Detroit
Transforming Product Development - AWS Transformation Day 2018: DetroitAmazon Web Services
 
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...Amazon Web Services
 
Transforming Product Development- AWS Transformation Day Raleigh 2018.pdf
Transforming Product Development- AWS Transformation Day Raleigh 2018.pdfTransforming Product Development- AWS Transformation Day Raleigh 2018.pdf
Transforming Product Development- AWS Transformation Day Raleigh 2018.pdfAmazon Web Services
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedAWS User Group Bengaluru
 
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Adrian Hornsby
 
Transforming Product Development - AWS Transformation Day Boston 2018
Transforming Product Development - AWS Transformation Day Boston 2018Transforming Product Development - AWS Transformation Day Boston 2018
Transforming Product Development - AWS Transformation Day Boston 2018Amazon Web Services
 
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Amazon Web Services
 
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018Amazon Web Services Korea
 
Transforming Product Development- Transformation Day Philadelphia 2018
Transforming Product Development- Transformation Day Philadelphia 2018Transforming Product Development- Transformation Day Philadelphia 2018
Transforming Product Development- Transformation Day Philadelphia 2018Amazon Web Services
 

Similaire à Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018 (20)

Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
 
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...
 
Dev348 ReInvent Corteva Agriscience
Dev348   ReInvent Corteva AgriscienceDev348   ReInvent Corteva Agriscience
Dev348 ReInvent Corteva Agriscience
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
Operating at Scale- Preparing for the Journey [Portuguese]
Operating at Scale- Preparing for the Journey [Portuguese]Operating at Scale- Preparing for the Journey [Portuguese]
Operating at Scale- Preparing for the Journey [Portuguese]
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
 
Are you Well-Architected?
Are you Well-Architected?Are you Well-Architected?
Are you Well-Architected?
 
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
 
Transforming Product Development - AWS Transformation Day 2018: Detroit
Transforming Product Development - AWS Transformation Day 2018: DetroitTransforming Product Development - AWS Transformation Day 2018: Detroit
Transforming Product Development - AWS Transformation Day 2018: Detroit
 
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
 
Transforming Product Development- AWS Transformation Day Raleigh 2018.pdf
Transforming Product Development- AWS Transformation Day Raleigh 2018.pdfTransforming Product Development- AWS Transformation Day Raleigh 2018.pdf
Transforming Product Development- AWS Transformation Day Raleigh 2018.pdf
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
 
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
 
TECHTalks - Boston MA - Tim Harney
TECHTalks - Boston MA - Tim HarneyTECHTalks - Boston MA - Tim Harney
TECHTalks - Boston MA - Tim Harney
 
Transforming Product Development - AWS Transformation Day Boston 2018
Transforming Product Development - AWS Transformation Day Boston 2018Transforming Product Development - AWS Transformation Day Boston 2018
Transforming Product Development - AWS Transformation Day Boston 2018
 
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
 
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
 
Transforming Product Development- Transformation Day Philadelphia 2018
Transforming Product Development- Transformation Day Philadelphia 2018Transforming Product Development- Transformation Day Philadelphia 2018
Transforming Product Development- Transformation Day Philadelphia 2018
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering and Scalability at Audible.com Tyler Lund Director, Software Engineering Audible.com A R C 3 0 8
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda What I talk about when I talk about distributed architectures The curious case of the broken downloads A song of ice and chaos Principles of chaos and the Goblet of Fire The art and Zen of implementing chaos What happened?
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 6. Largest producer and retailer of audiobooks … worldwide!
  • 7. To unleash the power of the spoken word
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible members listened for almost 3 BILLION hours in 2017
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Service C Service D
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goals of chaos engineering Distributed systems are complex. No one person can understand the entirety. Chaos is inherent. It’s better to accept the chaos. Gain confidence in understanding the system through experiments.
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Component A Input Output Unit Testing
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service AInput Output Integration Testing
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is chaos engineering? Chaos engineering is the discipline of experimenting on distributed systems to gain confidence in their behavior Fail calls between services or add latency
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Implementing chaos engineering Socialization Monitoring Graceful restarts and degradation Targeted chaos Cause a cascading failure Build a failure ingestion framework Create a chaos automation platform
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Socialization Acknowledge the complexity of the system Get support from the business Never let a good problem go to waste
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Socialization • First CPU hog experiment in 2012 • Socialized cause and impact of download issue • Documented system architecture • Created file access experiments
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Getting started with chaos Define the steady state Start with non-critical services, in QA Only experiment on the services of teams that want to be
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring What are your key business metrics? Playback starts/second Adds to cart Orders Membership signups
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Graceful restarts and degradation • Start with on/off • Bring down a host, service, or DB • Spike the CPU
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Targeted chaos Start with your own team or one prepared for it Look at large recent issues
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cause a cascading failure Define a hypothesis about the system Test the hypothesis Break something other teams are dependent on
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Types of experiments Add latency Make services and dependencies unavailable Exceptions Packet loss Failed requests Resource contention
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create a failure injection framework
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create a failure injection framework • Web UI • Host agent • Service framework injection
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A Service B Latency Injection Injector
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Gremlin proxy injector
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audible downloads Ownership Stats Table DB Activation Audible Download Service Content Delivery Service Get Static Metadata Content Delivery Engine
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create a chaos automation platform • Calculate the maximum impact the KPI can experience • Run control and experiment clusters • Route based on impact • Stop experiment if KPI suffers or customers experience pain • Figure out problem with time, not when paged
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service A - Prod Chaos Automation Client Service A - Control Service A - Experi ment LB 98% 1% 1%
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. High Deviation: Stop Experiment
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Run chaos all the time, everywhere • Run on all services, critical and non-critical • Run with limited warning • Run in production • Run often • Get feedback
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Results • Show examples of resilience. • Link back to previous issues. • Can’t be proven with integration tests. • Find issues wouldn’t otherwise find. Retry storms.
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Our results • Development team ownership • Prioritizing experiments • Building a framework • Preventing customer impact
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Takeaways • Everyone can and should do chaos • Implementing chaos educates and improves engineering teams • Do no harm to the customer • Involve business partners
  • 51. Chaos engineering doesn’t cause problems. Chaos engineering reveals them.
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Tuesday, Nov 27 ARC314 – Globalizing Player Accounts at Riot Games While Maintaining Availability 1:45 PM – 2:45 PM | Aria East, Plaza Level, Orovada 2 Tuesday, Nov 27 ARC307 – How Intuit Turbo Tax Ran Entirely on AWS for 2017 Taxes 10:45 AM - 11:45 AM | Venetian, Level 2, Venetian F
  • 53. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tyler Lund tlund@audible.com @tylopoda
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.