Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Chaos engineering

Talk about Chaos Engineering for XPUG group

  • Identifiez-vous pour voir les commentaires

Chaos engineering

  1. 1. Chaos Engineering Organizations that are ignoring Chaos Engineering are leaving money on the table.
  2. 2. «Failures are given, and everything will eventually fail over time» (Werner Vogels – CTO Amazon)
  3. 3. Why? 1. Growth of microservices and distributed cloud architectures 2. The web has grown increasingly complex 3. We all depend on these system more than ever 4. Failures have become much harder to predict 5. These failures cause strongly outages for companies
  4. 4. From On-Premises ... 1. Before the Cloud, users were connected to our application through the Company’s local network 2. A server’s downtime was planned and involved stopping production 3. Monolithic
  5. 5. ... To Cloud 1. Now our users are connected through the Internet 2. The workload to which our services are subjected will increase significantly, thanks to the greater spread of the applications themselves 3. Many Microservices replace one Monolithic
  6. 6. Microservices: is it really a matter of sizes? Common Characteristics Componentisation via services Organised around business capabilities Decentralised data management Products not projects Decentralised governance Smart endpoints and dumb pipes Evolutionary design Infrastructure automation Designed for failure We cannot say there is a formal definition of the microservices architectural style, but we can attempt to describe what we see as common characteristics for architectures that fit the label. (Martin Fowler, James Lewis)
  7. 7. Or is it a matter of paradigms?
  8. 8. Or is it a matter of paradigms?
  9. 9. Change Mindset Building a reliable application in the cloud is different than building a reliable application in an enterprise setting
  10. 10. Reactive Manifesto 1. Jones Boner, Dave Farley, Roland Kuhn, Martin Thompson – 16.01.2014 2. The absolute, most important thing is it to be responsive. This means that a reactive system needs to remain responsive even when a failure occurs. • https://www.reactivemanifesto.org/it
  11. 11. Resilient System • Networks • Servers • Applications • Processes • People Resilience is the ability of a system to adapt to changes, failures & disturbances Resilience is a function of People & Culture
  12. 12. Failures are given Availability Downtime per year 95% (1-nine) 18 days 6 hours 99% (2-nines) 3 days 15 hours 99.9% (3-nines) 8 hours 45 minutes 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds
  13. 13. The beauty of Math at work Component Availability Downtime X 99% (2-nines) 3 days 15 hours Y 99.99% (4-nines) 52 minutes X and Y Combined 98.99% 3 days 16 hours 33 minutes Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  14. 14. Chaos Engineering Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to whitstand turbolent conditions in production. https://principlesofchaos.org • Instead of trying to avoid failure, chaos engineering embraces it • Provide evidence of system weaknesses through scientific chaos engineering experiments • Which kind of weaknesses? Dark Debt
  15. 15. History 1. 1564-1642: Galileo Galilei introduces the experimental scientific method 2. 1879-1955: No amount of experiments will prove me right; a single experiment will prove me wrong (A. Einstein) 3. 2000: Game Day by Jesse Robbins, the Master of Disaster 4. 2010: Chaos Monkey by Netflix. Why? To support move from physical infrastructure to cloud infrastructure 5. 2011: Simian Army. We have to design a cloud architecture where individual components can fail without affecting the availability of the entire system 6. 2012: Neftlix shared Chaos Monkey on Github 7. 2014: A new role. Chaos Engineer
  16. 16. Once upon a time in Seattle «You don’t choose the moment, the moment chooses you» «You only choose how prepared you are when it does» Jesse Robbins, the Master of Disaster at Amazon
  17. 17. Chaos Experiment vs Testing Testing • Several set of inputs and predicted outputs • Limited scopes • Is a programming practice that instructs developers • Testing, strictly speaking, does not create new knowledge Chaos Experiment • Discover weakness through experiments • Limited scopes • Experimentation creates new knowledge
  18. 18. Game Day • An exercise designed to increase Resilience through large-scale fault injection across critical systems. • The goal of a Game Day is to practice how you, your team, and your supporting system deal with real-world turbolent conditions. • Creating Resiliency through destruction
  19. 19. Sociotechnical System Before starting your journey into chaos engineering, make sure you’ve done your homework and have built resiliency into every level of your organization. Building resilient systems isn’t all about software. It starts at the infrastructure layer, progress to the network and data, influences application design and extends to people and culture. Adrian Hornsby
  20. 20. Notifications and Approvals Name Role Approved? Bob Jennifer Owner (CEO) Yes • Remember the Conway’s Law Table of notifications and approvals
  21. 21. Dark Debt • Dark Debt is not recognizable at the time of creation. • Dark Debt arises from the unforeseen interactions of hardware or software with other parts of the framework. • Dark Debt is invisible until an anomaly reveals its presence. • Platform • Applications • People, practices, and processes
  22. 22. The Phases of Chaos Engineering Chaos engineering is NOT about letting monkeys loose or allowing them to break things randomly without a purpose. Chaos engineering is about breaking things in a controlled environment.
  23. 23. Start with Experiments • Get your team together and come up with a picture of your system (including people, practices, processes) • Make the right questions:  Where would it be most valuable to create an experiment that helps us build trust and confidence in our system under turbolent conditions?  What could possibly go wrong? • Chaos Engineering doesn’t guarantee you have the perfect system • Chaos Engineering never ends • Likelihood and Impact
  24. 24. Checkmate in three moves Preparation • Identification and mitigation of risks and impact from failure • Reduces frequency of failures (MTBF) • Reduces duration of recovery (MTTR) Participation • Builds confidence & competence responding to failure and under stress • Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recovery from failures of all types Exercises • Trigger and expose «latent defects» • Choose discover them, instead of letting that be determined by the next real disaster.
  25. 25. Likelihood-Impact Map • The likelihood that a failure may occur • The potential impact your system will experience if it does API products becomes unavailable Contribution Availability
  26. 26. Describe Your Experiment • A steady-state hypothesis: A set of measurements that indicates that the system is working in an expected way from a business perspective, and within a given set of tolerances • A method: The set of activities you’re going to use to inject the turbolent conditions into the target system • Rollbacks: A set of remediating actions through which you will attempt to repair what you have done knowingly in your experiment’s method Explore Discover Analyze Validate Improve
  27. 27. Demo Explore Discover Analyze Validate Improve 1. Using a chaos experiment to explore and discover weaknesses in the target system 2. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the system 3. One the challenge of analysis is done, it’s time to apply an improvement to the system (if needed) 4. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been overcome.
  28. 28. Demo Explore Discover Analyze Validate Improve 1. Using a chaos experiment to explore and discover weaknesses in the target system 2. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the system 3. One the challenge of analysis is done, it’s time to apply an improvement to the system (if needed) 4. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been overcome.
  29. 29. Demo Explore Discover Analyze Validate Improve 1. Using a chaos experiment to explore and discover weaknesses in the target system 2. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the system 3. One the challenge of analysis is done, it’s time to apply an improvement to the system (if needed) 4. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been overcome.
  30. 30. Demo Explore Discover Analyze Validate Improve 1. Using a chaos experiment to explore and discover weaknesses in the target system 2. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the system 3. One the challenge of analysis is done, it’s time to apply an improvement to the system (if needed) 4. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been overcome.
  31. 31. Under the skin of chaos run Start Experiment valid? Steady-state hypothesis Execute method Steady-state hypothesis No deviations found Deviations found Experiment aborted No Not within tolerances Not within tolerances Within tolerances Within tolerances Yes
  32. 32. Steady-state hypothesis Model that characterizes the steady-state of the system based on expected values of the business metrics. Chaos Engineering
  33. 33. Canary Deployment Start small and slowly build confidence within your team and your organization. - How many customers are affected? - What functionality is impaired? - Which locations are imapcted?
  34. 34. Benefits of Chaos Engineering - First, chaos engineering help you uncover the unknowns in your system and fix them before they happen in production at 3am during the weekend — so, first, improved resiliency and sleep. - Second, a successful chaos engineering practice always generates a lot more changes than anticipated, and these are mostly cultural. Probably the most important of these changes is a natural evolution towards a “non-blaming” culture: the “Why did you do that?” turns into a “How can we avoid doing that in the future?” — resulting in happier and more efficient, empowered, engaged and successful teams. And that’s gold!
  35. 35. Books and Resources Principles of Chaos Engineering https://github.com/chaostoolkit
  36. 36. Thank you @aacerbis Linkedin alberto.acerbis@4solid.it

×