Designing Services for Resilience: Netflix Lessons

Designing Services for
Resilience Experiments:
Lessons from Netflix
Nora Jones, Senior Chaos Engineer
@nora_js

InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-microservices-resiliency

Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com

So, how can teams design services
for resilience testing?
● Failure Injection Enabled

● RPC enabled

● RPC enabled
● Fallback Paths
○ And ways to discover them

● RPC enabled
● Fallback Paths
● Proper monitoring
○ Key business metrics to look for

● RPC enabled
● Fallback Paths
● Proper monitoring
○ Key business metrics to look for
● Proper timeouts

Known Ways to Increase
Confidence in Resilience

● Unit Tests

● Integration Tests

New Ways to Increase Confidence
in Resilience
● Chaos Experiments

Chaos Engineering: Netflix’s ChAP
API Personalization
100%

APIGateway Personalization
API Control
1%
98%

APIGateway Personalization
API Control
API Exp
1%
1%
98%

1. Have Failure Injection
Testing Enabled.

Sample Failure Injection
Library
https://github.com/norajones/FailureInjectionLibrary

Automating Creation of Chaos
Experiments

2. Have Good Monitoring in
Place for Configuration
Changes.

Have Good Monitoring in Place
● RPC Enabled

● RPC Enabled
○ Associated Hystrix Commands

● RPC Enabled
■ Associated Fallbacks

● RPC Enabled
● Timeouts

● RPC Enabled
● Timeouts
● Retries

● RPC Enabled
● Timeouts
● Retries
● All in One Place!

● Java library managing REST clients to/from
different services
● Fast failing/fallback capability
RPC/Ribbon

RPC Timeouts
At what point does the service give up?

Retries
Immediately retrying a failure after an operation
is not usually a great idea.

Retries
Understand the logic between your timeouts and
your retries.

Circuit Breakers/Fallback Paths

Hystrix Commands/Fallback Paths
If your service is non-critical, ensure that there
are fallback paths in place.

Fallback Strategies
Static Content Cache Fallback
Service

Fallback Strategies
Know what your fallback strategy is and how to
get that information.

3.Ensure Synergy
between Hystrix
Timeouts, RPC timeouts,
and retry logic.

There isn’t always money in
microservices

Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score

“We ran a chaos experiment which
verifies that our fallback path works
and it successfully caught a issue in
the fallback path and the issue was
resolved before it resulted in any
availability incident!”

“While [failing calls] we discovered an increase in
license requests for the experiment cluster even
though fallbacks were all successful...

“While [failing calls] we discovered an increase in
license requests for the experiment cluster even
though fallbacks were all successful. ...This likely
means that whoever was consuming the fallback
was retrying the call, causing an increase in
license requests.”

Don’t lose sight of your
company’s customers.

Takeaways
● Designing for resiliency testability is a shared
responsibility.
● Configuration changes can cause outages.
● Have explicit monitoring in place on
antipatterns in configuration changes.
@nora_js

Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-microservices-resiliency

Designing Services for Resilience: Netflix Lessons

Recommended

Recommended

More Related Content

More from C4Media

More from C4Media (20)

Recently uploaded

Recently uploaded (20)

Designing Services for Resilience: Netflix Lessons