Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Chaos Engineering with Containers

118 vues

Publié le

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2USKOWb.

Ana Medina discusses the benefits of using Chaos Engineering to inject failures in order to make our container infrastructure more reliable. She also shares how to improve container monitoring and observability and lessons learned from running Chaos Engineering GameDays with Gremlin customers. Filmed at qconsf.com.

Ana Medina is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. She last worked at Uber where she was an engineer on the SRE and Infrastructure teams specifically focusing on chaos engineering and cloud computing.

Publié dans : Technologie
  • Soyez le premier à commenter

Chaos Engineering with Containers

  1. 1. #QConSF @ana_m_medina Chaos EngineeringChaos Engineering with Containers 1 Ana Medina
 Chaos Engineer at
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ chaos-engineering-gamedays
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. #QConSF @ana_m_medina 2 Ana Medina @ana_m_medina 
 Chaos Engineer @ Gremlin Previously Software Engineer / SRE @ Uber, Also worked/ interned @ SFEFCU, Google, Quicken Loans, Stanford University and Miami Dade College. College dropout. Self taught engineer.
  5. 5. #QConSF @ana_m_medina 3 How many of you have heard of Chaos Engineering?
  6. 6. #QConSF @ana_m_medina 4 How many of have run a Chaos Engineering experiment?
  7. 7. #QConSF @ana_m_medina 5 Thoughtful, planned experiments designed to reveal the weakness in our systems. 
 Chaos Engineering
  8. 8. #QConSF @ana_m_medina 6 Inject something harmful to build an immunity. -@KoltonAndrus
 Gremlin Founder and CEO Chaos Engineering
  9. 9. #QConSF @ana_m_medina 7 Why? ● Microservices ● Systems are scaling fast ● Downtime is really expensive ● Our dependencies will fail ● Pager fatigue and burnout really hurts
  10. 10. #QConSF @ana_m_medina 8 “Chaos Engineering Without Observability ... Is Just Chaos”
 -@mipsytipsy Charity Majors CEO of honeycomb

  11. 11. #QConSF @ana_m_medina 9 Prerequisite of Chaos Engineering ● Monitoring/Observability ● On-Call and Incident Management ● Cost of Downtime Per Hour
  12. 12. #QConSF @ana_m_medina 10 Use Cases for Chaos Engineering ● Outage reproduction ● On-call training ● Strengthen new products ● Battle test new infrastructure and services
  13. 13. #QConSF @ana_m_medina 11 Use Cases for Chaos Engineering - Containers ● Testing Provider Specific Reliability (eg: EKS vs AKS vs GKE) ● Auto Scaling ● Logs, Disk failure
  14. 14. #QConSF @ana_m_medina Minimize the Blast radius 12
  15. 15. #QConSF @ana_m_medina Monitoring / Observability 13
  16. 16. #QConSF @ana_m_medina 14 What to measure and monitor? ! System Metrics: CPU, Disk, I/O ! Availability ! Service specific KPIs ! Customer complaints
  17. 17. #QConSF @ana_m_medina 15 Demo
  18. 18. #QConSF @ana_m_medina 16 #1 - Battle Test Cloud infrastructure Real World Scenario: company / user is evaluating cloud provider managed kubernetes. which one is more reliable? The Hypothesis: shutting down a container (1/1) should only give a small delay before app is reachable again The Experiment: shut down kubernetes dashboard container Abort Conditions: app is unreachable after 60 seconds
  19. 19. #QConSF @ana_m_medina 17
  20. 20. #QConSF @ana_m_medina
  21. 21. #QConSF @ana_m_medina
  22. 22. #QConSF @ana_m_medina
  23. 23. #QConSF @ana_m_medina 21 #2 - Shutdown of a Container Real World Scenario: company / user is evaluating containers. Are they as reliable as promised? The Hypothesis: yes, they will come back up The Experiment: shutdown container and wait a few seconds and check if it’s up Abort Conditions: app is unreachable after 60 seconds
  24. 24. #QConSF @ana_m_medina 22
  25. 25. #QConSF @ana_m_medina 23 #3 - Blackholing traffic to Catalog Real World Scenario: company / user is working with their UI team to provide a good user experience when there API/DB issues The Hypothesis: images will not load, but product listing will The Experiment: blackhole all traffic from the front end to REST API and DB ports Abort Conditions: app is unreachable after 60 seconds
  26. 26. #QConSF @ana_m_medina 24
  27. 27. #QConSF @ana_m_medina Case Study 25
  28. 28. #QConSF @ana_m_medina 26 Companies doing Chaos Engineering
  29. 29. #QConSF @ana_m_medina 27 Tools you Can Use Gremlin
 Chaos Toolkit
 Litmus
 PowerfulSeal
  30. 30. #QConSF @ana_m_medina 28 Break Things Together bit.ly/chaos-eng-slack
 2,000+ members across the world
  31. 31. #QConSF @ana_m_medina THANKS! @ana_m_medina ana@gremlin.com
  32. 32. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ chaos-engineering-gamedays

×