Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

From resilient to antifragile - Chaos Engineering Primer DevSecCon

859 vues

Publié le

Can we inject failure scenarios into deployed systems to reduce platform risk? During this talk, demonstrations of the Simian Army, Chaos Lemur and Locust.io tools will be presented. We will go beyond reliability, stability and availability to help your platform operations team build a continuous process improvement program which will prepare your production systems for the unexpected.
92% of catastrophic system failures were the result of incorrect handling of non­fatal errors.
It is simply not possible to fully reproduce the entire architecture and run an end ­to ­end test.
Don't trust claims systems make about themselves & their dependencies. Verify by breaking.

Publié dans : Logiciels
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

From resilient to antifragile - Chaos Engineering Primer DevSecCon

  1. 1. Join the conversation #devseccon From Resilient to Antifragile - Chaos Engineering Primer By @Sergiu_Bodiu @Pivotal Platform Architect Asia Pacific & Japan
  2. 2. Singapore Spring User
 Group DevOpsDays Singapore
 Conference Imagine having an idea in the morning and shipping in the evening.
  3. 3. @Sergiu_Bodiu A new way to look at organizations 3 Fragile: At risk of total failure / financial ruin Resilient: Takes damage, avoids total failure, recovers Robust: Absorbs uncertainty, repels blows, avoids damage Antifragile: Responds to stress by mutating, maintains fitness for purpose. Identity Change.
  4. 4. @Sergiu_Bodiu Risk management 4 The new normal: FROM RESILIENT TO ANTIFRAGILE
  5. 5. @Sergiu_Bodiu Distributed Systems Complexity 5 Complexity is like addiction… It comes on slowly, forming weak bonds that you can barely feel. But as it continues, the bonds strengthen quietly until they calcify and become hard to break.
  6. 6. @Sergiu_Bodiu Software is Single Point of Failure 6 Instapaper-outage-cause- recovery after 31 hours gitlab.com was down for about 18 hours and also lost production data roughly 5,000 projects, 5,000 comments and 700 new user accounts. Root Cause Analysis: While component failures such as NETWORK, STORAGE, SERVER, HARDWARE, and POWER failures are anticipated and thus guarded with extra redundancies.
  7. 7. @Sergiu_Bodiu 8 Alexey Krivitsky https://www.slideshare.net/krivitsky/dejirafication-clean-your-process Dejirafication
  8. 8. @Sergiu_Bodiu Chaos Engineering 9 Chaos Engineering - Discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org
  9. 9. @Sergiu_Bodiu Backups 10 "Backups always succeed. It's the restores that fail. Test your backups by practicing restores!" Using Chaos Monkey
  10. 10. @Sergiu_Bodiu Some outages in the Region 13 SingTel fined a record $6m for Bukit Panjang exchange fire; Telstra goes down again, people can't drink beer or catch Ubers Amazon Web Services outage causes Australian website chaos
  11. 11. @Sergiu_Bodiu Netflix Simian Army 14 The Simian Army is a suite of tools for keeping your cloud operating in top form. https://github.com/Netflix/ SimianArmy
  12. 12. @Sergiu_Bodiu Chaos Monkey 15 •Active during normal working hours •Break things in production •Design better software services •Embracing failure http://techblog.netflix.com/2016/10/netflix-chaos-monkey-upgraded.html
  13. 13. @Sergiu_Bodiu Other Monkeys 16 •Latency Monkey •Janitor Monkey •Conformity Monkey •Security Monkey •Doctor Monkey
  14. 14. @Sergiu_Bodiu Principles of Chaos 17 1. Build a Hypothesis around Steady State Behavior 2. Vary Real-world Events 3. Run Experiments in Production 4. Automate Experiments to Run Continuously TIP: Intentionally break things, compare measured with expected impact, and correct any problems uncovered this way. Chaos Engineering Whitepaper 2016
  15. 15. @Sergiu_Bodiu Hypothesize 18 Build a hypothesis around steady state behavior. Steady state characterizations that are visible at the boundary of the system, which directly capture an interaction between the users and the system. 
 TIP: Utilisation is Virtually Useless as a Metric!
  16. 16. @Sergiu_Bodiu Vary Events 19 • terminate virtual machine instances • inject latency into requests between services • fail requests between services • fail an internal microservice • make an entire region unavailable TIP: Select only a subset of users
  17. 17. @Sergiu_Bodiu Experiment 20 92% of catastrophic system failures were the result of incorrect handling of nonfatal errors. It is simply not possible to fully reproduce the entire architecture and run an end to end test. TIP: Customers don't behave as your JMeter script. https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf
  18. 18. @Sergiu_Bodiu Experiment 21 92% of catastrophic system failures were the result of incorrect handling of nonfatal errors. It is simply not possible to fully reproduce the entire architecture and run an end to end test. TIP: Customers don't behave as your JMeter script. https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf
  19. 19. @Sergiu_Bodiu Automate 22 • Distributed systems changes continuously over time. • Engineers modify the behavior of existing services, add new services. • Engineers are changing runtime configuration parameters, upgrading and patching systems TIP: Depending on the context, change the rate of each experiment.
  20. 20. @Sergiu_Bodiu Chaos Engineering Whitepaper 2016 23 Build a hypothesis around steady state behavior. Vary realworld events. Run experiments in production. Automate experiments to run continuously.
  21. 21. @Sergiu_Bodiu The importance of reliability 24 Don't trust claims systems make about themselves & their dependencies. Verify by breaking.
  22. 22. @Sergiu_Bodiu Locust demo 25 Locust is an open-source Python load testing framework. • Define user behaviour in code • Can execute end-to-end user test with sessions and cookies. • Expands to multiple slaves to increase load capacity • Allows for distributed user paths based on percentages Gatling is an open-source Scala load testing framework • High performance • Ready-to-present HTML reports • Scenario recorder and developer-friendly DSL
  23. 23. @Sergiu_Bodiu Lessons Learned 26 • Don’t wait so long to start load testing. • The conversations drive new requirements. • Changing architecture last minute is extremely dangerous. • This is incredible hard under pressure. • Build relation with Networking Team, Database Tea, Third Party Partners, Vendors etc.. • Make everything Asynchronous (Embrace Failure, Background Tasks, Retry, Idempotence)
  24. 24. @Sergiu_Bodiu Demo 27
  25. 25. @Sergiu_Bodiu Chaos Lemur demo 28 Chaos Lemur is an alternative to Chaos Monkey (which was designed for AWS) that was designed with PCF in mind.
  26. 26. @Sergiu_Bodiu Clean your process 29 From lean thinking perspective: managing the inventory is a non- value-adding activity Culture > Principles > Tools
  27. 27. @Sergiu_Bodiu Testing Pyramid 30 https://watirmelon.blog/2012/01/31/introducing-the-software-testing-ice-cream-cone/
  28. 28. @Sergiu_Bodiu Principles 31 Any developer building applications which run as a service. Ops engineers who deploy or manage such applications. https://12factor.net: Anyone working in software that writes tests or maintains continuous integration pipelines. http://www.10factor.ci
  29. 29. @Sergiu_Bodiu Agile Manifesto 32 The Agile movement is not anti-methodology, in fact, many of us want to restore credibility to the word methodology. Now, a bigger gathering of organizational anarchists would be hard to find, so what emerged from this meeting was symbolic.
  30. 30. @Sergiu_Bodiu Further Reading 33 https://www.infoq.com/br/presentations/exercising-failure-at- netflix https://www.infoq.com/podcasts/failure-as-a-service Peter Alvaro: Orchestrated Chaos: Applying Failure Testing Research at Scale Mathias Lafeldt: Writing your first postmortem Adrian Colyer Simple Testing Can Prevent Most Critical Failures
  31. 31. Join the conversation #devseccon Thank You @sergiu_bodiu Let's Build Something MEANINGFULL
  32. 32. @Sergiu_Bodiu Chaos Engineering 35 Chaos Engineering - Discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org
  33. 33. @Sergiu_Bodiu Blueprint for living in a Black Swan world. 36 Antifragile, and only the antifragile, will Make it.
  34. 34. The Three R’s of Enterprise Security: Rotate, Repave, and Repair

×