Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2mArA9C.
Kolton Andrus and Peter Alvaro present how a “big idea” -- lineage-driven fault injection -- evolved from a theoretical model into an automated failure testing service at Netflix. They describe the challenges (expected as well as unexpected, technical as well as ideological) that arose, and how they overcame them. Filmed at qconsf.com.
Kolton Andrus is the founder of Gremlin Inc. He is passionate about building resilient systems, primarily as it lets him break things for fun and profit. Peter Alvaro is an Assistant Professor of Computer Science at the University of California Santa Cruz. He is the creator of the Dedalus language and co-creator of the Bloom language.
1. Monkeys in Lab Coats
Automating Failure Testing Research at
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-testing-research
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. The whole is greater than the sum of its parts.
- Aristotle
[Metaphysics]
5. The Professor vs The Practitioner
Peter Alvaro
Ex-Berkeley, Ex-Industry
Assistant Prof @ Santa Cruz
Misses the calm of PhD life
Likes prototyping stuff
Kolton Andrus
Ex-Netflix, Ex-Amazon
‘Chaos’ Engineer
Misses his actual pager
Likes breaking stuff
25. How do we find the redundancy?
Could a bad ‘thing’ ever happen?
Why did a good ‘thing’ happen?
26. Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
27. Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
28. Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
29. What would have to go wrong?
(RepA OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast2
Client Client
Bcast1
30. What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
31. What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1
Client Client
Bcast2
32. What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
33. Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Hypothesis: {Bcast1, Bcast2}
35. The prototype system “Molly”
Recipe:
1. Start with a successful
outcome. Work backwards.
2. Ask why it happened: Lineage
3. Convert lineage to a boolean
formula and solve
4. Lather, rinse, repeat
2. Lineage 3. CNF
Fail1. Success
Why?
Encode
Solve
4. REPEAT
70. Case study: “Netflix AppBoot”
Services ~100
Search space (executions) 2100
(1,000,000,000,000,000,000,000,000,000,000)
Experiments performed 200
Critical bugs found 11
71. Future Work
Richer device metrics
Request class creation
Better experiment selection
Search prioritization
Richer lineage collection
Exploring temporal
interleavings