SlideShare a Scribd company logo
1 of 84
Designing Services for
Resilience Experiments:
Lessons from Netflix
Nora Jones, Senior Chaos Engineer
@nora_js
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-microservices-resiliency
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Designing Services for
Resilience Experiments:
Lessons from Netflix
Nora Jones, Senior Chaos Engineer
@nora_js
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
● Proper monitoring
○ Key business metrics to look for
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
● Proper monitoring
○ Key business metrics to look for
● Proper timeouts
○ And ways to discover them
Known Ways to Increase
Confidence in Resilience
Known Ways to Increase
Confidence in Resilience
● Unit Tests
Known Ways to Increase
Confidence in Resilience
● Integration Tests
New Ways to Increase Confidence
in Resilience
● Chaos Experiments
SPS: Key Business Metric
Chaos Engineering: Netflix’s ChAP
API Personalization
100%
Chaos Engineering: Netflix’s ChAP
APIGateway Personalization
API Control
1%
98%
Chaos Engineering: Netflix’s ChAP
APIGateway Personalization
API Control
1%
98%
Chaos Engineering: Netflix’s ChAP
APIGateway Personalization
API Control
API Exp
1%
1%
98%
Chaos Engineering: Netflix’s ChAP
APIGateway Personalization
API Control
API Exp
1%
1%
98%
Monitoring
Monitoring
SHORTED
1. Have Failure Injection
Testing Enabled.
Sample Failure Injection
Library
https://github.com/norajones/FailureInjectionLibrary
Types of Chaos Failures
Types of Chaos Failures
Criteria&API
Automating Creation of Chaos
Experiments
2. Have Good Monitoring in
Place for Configuration
Changes.
Have Good Monitoring in Place
● RPC Enabled
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
■ Associated Fallbacks
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
■ Associated Fallbacks
● Timeouts
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
■ Associated Fallbacks
● Timeouts
● Retries
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
■ Associated Fallbacks
● Timeouts
● Retries
● All in One Place!
● Java library managing REST clients to/from
different services
● Fast failing/fallback capability
RPC/Ribbon
RPC/Ribbon Timeouts
RPC Timeouts
At what point does the service give up?
Retries
Immediately retrying a failure after an operation
is not usually a great idea.
Retries
Understand the logic between your timeouts and
your retries.
Circuit Breakers/Fallback Paths
Hystrix Commands/Fallback Paths
If your service is non-critical, ensure that there
are fallback paths in place.
Fallback Strategies
Static Content Cache Fallback
Service
Fallback Strategies
Know what your fallback strategy is and how to
get that information.
3.Ensure Synergy
between Hystrix
Timeouts, RPC timeouts,
and retry logic.
ChAP’s Monocle
ChAP’s Monocle
ChAP’s Monocle
There isn’t always money in
microservices
Criticality Score
Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score
Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score
Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score
Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score
Chaos Success Stories
“We ran a chaos experiment which
verifies that our fallback path works
and it successfully caught a issue in
the fallback path and the issue was
resolved before it resulted in any
availability incident!”
“While [failing calls] we discovered an increase in
license requests for the experiment cluster even
though fallbacks were all successful...
“While [failing calls] we discovered an increase in
license requests for the experiment cluster even
though fallbacks were all successful. ...This likely
means that whoever was consuming the fallback
was retrying the call, causing an increase in
license requests.”
Don’t lose sight of your
company’s customers.
Takeaways
● Designing for resiliency testability is a shared
responsibility.
● Configuration changes can cause outages.
● Have explicit monitoring in place on
antipatterns in configuration changes.
@nora_js
Questions?
@nora_js
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-microservices-resiliency

More Related Content

More from C4Media

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Designing Services for Resilience: Netflix Lessons