Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1NSdRSp.
Kolton Andrus presents how Netflix, in order to harden their systems, designed “Failure as a Service” to allow anyone to test and validate how their systems handle failure. Filmed at qconnewyork.com.
Kolton Andrus (@deelyle) is a Chaos Engineer on Netflix’s Edge Platform team. He designed and built FIT, a failure injection service. Prior to Netflix, he worked in Amazon Retail where he built Gremlin, Amazon’s failure service. In both companies he has served as a ‘Call Leader’, managing the resolution of large scale incidents.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/failure-as-a-service-netflix
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Overview
1. Why is Failure Testing Important?
2. How did we build Failure as a Service?
3. How has this made our systems more
resilient?
5.
6.
7. Why Failure Testing?
1. Makes our systems immune to failure
2. Prevents larger outages
3. Production verification is requisite
8.
9. Failure testing is a form of Hormesis -
we imbibe the poison to become
immune.
10.
11.
12. Validating that our defenses will work
when called upon - by exercising them
at scale in production.
32. Take Aways
1. Failure Testing is a worthwhile investment
2. Testing in Production is sustainable
3. It can harden your systems against failure
Kolton Andrus (@deelyle)
33. Resources
● Netflix Techblog - FIT
● “On Designing and Deploying Internet-Scale
Services” - James Hamilton
● Drift into Failure - Sidney Dekker
● Antifragile - Nassim Nicholas Taleb
34. Photo Credits
● Nuclear Blast - Mark Waldrep
● Forest Fire
● Poison
● Needle
● Explosion
● Robot