Breaking things on purpose (with Gremlin)

Breaking Things on Purpose
@KoltonAndrus
kolton@gremlininc.com
Introduction
“Context, not Control”
Core Value
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Inject something harmful (in a system),
in order to build an immunity
Bad
Good
Bounded Downside
Unbounded
Upside
Effective Failure
Testing
“What could go wrong?”
Breaking things on purpose (with Gremlin)
“How likely is this to occur?”
“What is the cost of being wrong?”
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Validating our
assumptions
Experiment
Form a hypothesis If we lose the Ratings service,
members will get default ratings
Measurable Outcome This will manifest as increased Hystrix
Fallbacks
Success Criteria But the overall success rate will remain
constant
Abort Conditions Halt immediately if members are
unable to stream
Validate
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Dial it up!
Breaking things on purpose (with Gremlin)
“Productionalization”
Breaking things on purpose (with Gremlin)
Case Studies
Breaking things on purpose (with Gremlin)
Chaos Kong
“What is Tier 1?”
Finding the Critical Services
Breaking things on purpose (with Gremlin)
Outcome
Reduce Toil
Increase Reliability
Better prepared for the outages that occur
Break things on purpose!
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
Thank You!
@KoltonAndrus
kolton@gremlininc.com
“Required Reading” and References
FIT: Failure Injection Testing - Netflix Tech Blog
Automated Failure Testing - Netflix Tech Blog
Automated Failure Testing at Internet Scale - ACM SOCC Paper
Antifragile: Things That Gain from Disorder by Nassim Nicholas Taleb
On Designing and Deploying Internet-Scale Services by James Hamilton
Drift into Failure by Sidney Dekker
Photo Credits
http://i.gyazo.com/38b53958cccde98b712acfde6d880336.png
http://www.thedoctorschannel.com/wp-
content/uploads/2013/01/Vaccine_Vials_Syringe_Needle.jpg
http://www.horizonservicesinc.com/wp/wp-content/uploads/Explosion.jpg
Star Trek: The Next Generation
http://www.joshuanhook.com/wp-content/uploads/2014/11/broken-
communication.jpg
http://s3.amazonaws.com/media.eremedia.com/uploads/2014/01/15174902/THINK-
small.jpg
1 sur 41

Contenu connexe

Breaking things on purpose (with Gremlin)

Notes de l'éditeur

  1. Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service. At Netflix, we run failure exercises on a regular basis to ensure we are prepared. These efforts hardened our Edge services and helped us to have a quiet holiday season and a smooth global launch. Come and learn how to run an effective “Game Day” and safely test in production. Then sleep peacefully knowing you are ready!
  2. Ex-Netflix Edge Platform - Edge Services Built FIT - ‘Failure as a Service’ “Engineering Crisis Manager” Ex-Amazonian Retail Website Performance and Reliability Build Amazon’s Failure Injection Service “Call Leader”
  3. What is the opposite of fragile? Robust/Resilient? Those are indifferent to change. We want something that improves with change.
  4. Vaccine Analogy
  5. The downside is the impact of the failure test The upside is the prevention of future outages Additionally the upside is in training our organizations to handle failure.
  6. Describe high level engagement with a team. Start with the high level questions
  7. Failure Scenarios :: Threat Model Scope your failures - “Who will I impact?”
  8. Value of Tracing and Observability
  9. Analysis of past events - Start with low hanging fruit Prepare for the Unknown - Black Swan
  10. If you run only in one AWS region, and that region goes down, what will happen? Cost/Benefit Analysis -> Prioritize the largest risks first
  11. How do you measure the cost of an outage? Lost revenue Average cost (various internet sources): $5,000-10,000 / minute, 20 minute outage = $100,000 -$200,000 eCommerce site (on Black Friday) Est that an hour of downtime costs Facebook $1.7M in lost advertising revenue Brand Reputation Customer Trust
  12. https://blog.acolyer.org/2016/10/06/simple-testing-can-prevent-most-critical-failures/ https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
  13. Example of a CDN selection ‘successful’ cdn selection failure test
  14. Setup Communication Let your team/dependencies/organization know you are running a test. Invite anyone interested or impacted Many eyes looking will spot errors faster Share your pass/fail criteria Command Center Team Bullpen Chat Room Conference call
  15. Smallest possible step Run it locally Run in test Run it for a single instance Validate the expected outcome
  16. Small Scale - Then to find functional failures Large Scale - Resource Constraints, Queuing, Cascading Failure Emergent Behavior?
  17. Test in Prod! Those unwilling to test in production aren't yet confident that the service will continue operating through failures. And, without production testing, recovery won't work when called upon. - James Hamilton
  18. Think through potential failure modes Setup Dashboards Setup Alarms Test Engagement Run a Drill
  19. Prepare your organization for what could go wrong. Run on your own terms. During the day, after the caffeine has kicked in. Practice. Train. Answer questions. Know how to turn it off up front.
  20. Run by the traffic team, this is one of the best examples of the power of failure testing. New learnings every few runs Ready when called upon AWS Outage - Q4 2015 AWS Outage - Q1 2016 - Jan 14th
  21. At Amazon, this was hard to answer Many known by name Managed by hand If you caused an outage, you might end up on the list Netflix had a similar problem We solved it by automating a set of failure testing On devices we’d: Log In, Browse, Choose a video, and stream Whitelist the ‘known’ critical service If this test failed, a new critical service was introduced From this, we can make a choice about it’s criticality
  22. Less Operational Burden Less Outages Better prepared for the outages that occur