Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Chaos Engineering
Hamburg
Marvin Hoffmann | Computer Scientist
15.12.2015
1. AWS Basics and Intro
2. Evolution of Chaos Testing
3. Tooling
4. Chaos Engineering
Agenda
Europe West (Ireland)US East (N. Virginia)
Regions
AZs Instances
AWS Basics
Chaos? -
What do we mean?
“A way to improve availability is
to install proven hardware and
software, and then leave it alone”
Jim Gray
Why Do Comput...
• Systems need to be reliable
• Nuklear weapon arsenal, heart rate monitoring,
World of Warcraft servers, Streaming busine...
DynamoDB Outage US-East
• “… there was a brief network disruption that impacted a
portion of DynamoDB’s storage servers.”
...
• Deployments themselves may cause issues
• Unpredicted behaviour after a change has been
rolled out
• Issues during rollb...
Evolution of
Chaos Testing
Do the simplest thing first
• Prepare for your machines to die
• “Cattle, not pets” (Adrian Cockcroft)
• Resilience through...
Deal with infrastructure issues
• Latency between instances
• Package loss
• Ports blocked
• or even outages of an entire ...
Think big!
• Remember that DynamoDB failure?
• Outage of an entire AWS region!
• You’ll need more than one region in the fi...
Tooling
(meet the Monkeys)
Chaos Monkey
Kills random instances in your account
Chaos Gorilla
Kills a random AZ in your account
Chaos Kong
Kills an entire AWS region in your account
What’s in it?
• A compilation of scripts
• Scripts mess with your AWS account
• Thus, they are very AWS specific
• If not o...
• Latency Monkey
• Conformity Monkey
• Security Monkey
• Doctor Monkey
• 10-18 Monkey
Simian Army
Chaos
Engineering
• Systematic approach to Chaos Testing
• Started by Netflix
• Talk about it a lot to attract talent
• Many other companies ...
“Experiment on a distributed system
in order to build confidence in the
system’s capability to withstand
turbulent conditio...
Four Principles of
Chaos Engineering
Know your system
• Operational insight
• What is “normal”? What does a failure look like?
Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
The “Happy Path”
• Trace through code
where nothing bad
happens
• usually testing happens
first on the happy path
• Bad thi...
Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
2.Vary real-world events
Laboratory
• “Works on my machine” (or “works in stage env.”)
Source: http://www.memegasms.com/media/created/vhyfxm.jpg
Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
2.Vary real-world events
3.Run exp...
Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
2.Vary real-world events
3.Run exp...
Chaos Engineering Culture
• http://principlesofchaos.com
• More resources:
• https://github.com/Netflix/SimianArmy
• https:...
Prochain SlideShare
Chargement dans…5
×

Principles of Chaos Engineering

730 vues

Publié le

Slide deck from my talk about "Principles of Chaos Engineering" at the first ever Chaos Engineering Hamburg meet up.

Come join us at http://www.meetup.com/Chaos-Engineering-Hamburg/ and stay up to date with new events and other news.

Publié dans : Logiciels
  • Soyez le premier à commenter

Principles of Chaos Engineering

  1. 1. Chaos Engineering Hamburg Marvin Hoffmann | Computer Scientist 15.12.2015
  2. 2. 1. AWS Basics and Intro 2. Evolution of Chaos Testing 3. Tooling 4. Chaos Engineering Agenda
  3. 3. Europe West (Ireland)US East (N. Virginia) Regions AZs Instances AWS Basics
  4. 4. Chaos? - What do we mean?
  5. 5. “A way to improve availability is to install proven hardware and software, and then leave it alone” Jim Gray Why Do Computers Stop and What Can Be Done About It?
  6. 6. • Systems need to be reliable • Nuklear weapon arsenal, heart rate monitoring, World of Warcraft servers, Streaming business • Third party dependencies (software and hardware) Be reliable!
  7. 7. DynamoDB Outage US-East • “… there was a brief network disruption that impacted a portion of DynamoDB’s storage servers.” • 2:19am until 7:10am PDT • “There are several other AWS services that use DynamoDB that experienced problems during the event.” • SQS, EC2 auto scaling, CloudWatch Source: https://aws.amazon.com/message/5467D2/
  8. 8. • Deployments themselves may cause issues • Unpredicted behaviour after a change has been rolled out • Issues during rollback • Change in client / user behaviour It’s not always the infrastructure
  9. 9. Evolution of Chaos Testing
  10. 10. Do the simplest thing first • Prepare for your machines to die • “Cattle, not pets” (Adrian Cockcroft) • Resilience through redundancy • Stateless machines
  11. 11. Deal with infrastructure issues • Latency between instances • Package loss • Ports blocked • or even outages of an entire AZ
  12. 12. Think big! • Remember that DynamoDB failure? • Outage of an entire AWS region! • You’ll need more than one region in the first place • Re-routing of entire traffic from one region to another • Any region needs to be able to scale to take the load of two regions
  13. 13. Tooling (meet the Monkeys)
  14. 14. Chaos Monkey Kills random instances in your account
  15. 15. Chaos Gorilla Kills a random AZ in your account
  16. 16. Chaos Kong Kills an entire AWS region in your account
  17. 17. What’s in it? • A compilation of scripts • Scripts mess with your AWS account • Thus, they are very AWS specific • If not on AWS, get inspired and build your toolset around these ideas • Not a comprehensive toolset
  18. 18. • Latency Monkey • Conformity Monkey • Security Monkey • Doctor Monkey • 10-18 Monkey Simian Army
  19. 19. Chaos Engineering
  20. 20. • Systematic approach to Chaos Testing • Started by Netflix • Talk about it a lot to attract talent • Many other companies doing similar things in that field • Want to grow a community around it Chaos Engineering
  21. 21. “Experiment on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Netflix
  22. 22. Four Principles of Chaos Engineering
  23. 23. Know your system • Operational insight • What is “normal”? What does a failure look like?
  24. 24. Four Principles of Chaos Engineering 1.Build a hypothesis around steady-state behaviour
  25. 25. The “Happy Path” • Trace through code where nothing bad happens • usually testing happens first on the happy path • Bad things usually happen off the happy path Source: https://bethtrissel.files.wordpress.com/2014/06/176869567.jpg
  26. 26. Four Principles of Chaos Engineering 1.Build a hypothesis around steady-state behaviour 2.Vary real-world events
  27. 27. Laboratory • “Works on my machine” (or “works in stage env.”) Source: http://www.memegasms.com/media/created/vhyfxm.jpg
  28. 28. Four Principles of Chaos Engineering 1.Build a hypothesis around steady-state behaviour 2.Vary real-world events 3.Run experiments in production
  29. 29. Four Principles of Chaos Engineering 1.Build a hypothesis around steady-state behaviour 2.Vary real-world events 3.Run experiments in production 4.Automate experiments to run continuously
  30. 30. Chaos Engineering Culture • http://principlesofchaos.com • More resources: • https://github.com/Netflix/SimianArmy • https://github.com/Netflix/atlas • https://www.youtube.com/watch?v=vq4QZ4_YDok

×