Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2Dv9VKr.
Nora Jones talks about designing microservices for enabling resiliency testing and the moving parts we need to consider when designing them from the get go, and along their lifetime. She shares tips and tricks on how to design microservices for resiliency tests, examples of poorly designed services, and how to ensure pertinent design decisions are in place on a continuous basis. Filmed at qconsf.com.
Nora Jones is a Senior Chaos Engineer at Netflix. She is passionate about delivering high-quality software, improving processes, and promoting efficiency within architecture. Occasionally, she pokes holes in distributed systems to make them more resilient.
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-microservices-resiliency
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
10. So, how can teams design services
for resilience testing?
● Failure Injection Enabled
11. So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
12. So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
13. So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
● Proper monitoring
○ Key business metrics to look for
14. So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
● Proper monitoring
○ Key business metrics to look for
● Proper timeouts
○ And ways to discover them
78. “We ran a chaos experiment which
verifies that our fallback path works
and it successfully caught a issue in
the fallback path and the issue was
resolved before it resulted in any
availability incident!”
79. “While [failing calls] we discovered an increase in
license requests for the experiment cluster even
though fallbacks were all successful...
80. “While [failing calls] we discovered an increase in
license requests for the experiment cluster even
though fallbacks were all successful. ...This likely
means that whoever was consuming the fallback
was retrying the call, causing an increase in
license requests.”
82. Takeaways
● Designing for resiliency testability is a shared
responsibility.
● Configuration changes can cause outages.
● Have explicit monitoring in place on
antipatterns in configuration changes.
@nora_js