Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

How Did Things Go Right? Learning More From Incidents

23 vues

Publié le

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/31YfGIV.

Ryan Kitchens describes more rewarding ways to approach incident investigation without overly focusing on failure prevention. Filmed at qconnewyork.com.

Ryan Kitchens is a Site Reliability Engineer on the Core team at Netflix where he works on building capacity across the organization to ensure its availability and reliability. Before that, he was a founding member of the SRE team at Blizzard Entertainment.

Publié dans : Technologie
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Soyez le premier à aimer ceci

How Did Things Go Right? Learning More From Incidents

  1. 1. How Did Things Go Right? LEARNING MORE FROM INCIDENTS @this_hits_home RYAN KITCHENS
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ incident-investigation-failure- prevention/
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. SLOs & SLIs War Rooms Error Budgets Incident Reviews SRE WORLD @this_hits_home
  5. 5. Failure is so important @this_hits_home
  6. 6. Failure is so important that it’s no longer interesting. @this_hits_home
  7. 7. Failure is so important that it’s no longer interesting. @this_hits_home
  8. 8. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state.
  9. 9. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state. We’re pretty good @ preventing incidents.
  10. 10. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state. We’re pretty good @ preventing incidents. Recovery > prevention.
  11. 11. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state. We’re pretty good @ preventing incidents. Recovery > prevention. The most important thing we can learn is how to build the capacity to encounter failure successfully.
  12. 12. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state. We’re pretty good @ preventing incidents. Recovery > prevention.
  13. 13. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state. We’re pretty good @ preventing incidents. Recovery > prevention. Availability is made up & the nines don’t matter.
  14. 14. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state. We’re pretty good @ preventing incidents. Recovery > prevention. Availability is made up & the nines don’t matter.
  15. 15. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state. We’re pretty good @ preventing incidents. Recovery > prevention. Availability is made up & the nines don’t matter. Success isn’t the absence of failure.
  16. 16. TOTALLY CORRECT OPINIONS @this_hits_home Failure is so important that it’s no longer interesting. Failure is the normal state. We’re pretty good @ preventing incidents. Recovery > prevention. Availability is made up & the nines don’t matter. Success isn’t the absence of failure. There is no panacea
  17. 17. Failure is so important that it’s no longer interesting. Failure is the normal state. TOTALLY CORRECT OPINIONS We’re pretty good @ preventing incidents. Recovery > prevention. Availability is made up & the nines don’t matter. Success isn’t the absence of failure. There is no panacea (except for expertise). @this_hits_home
  18. 18. @this_hits_home
  19. 19. There is no root cause. @this_hits_home@this_hits_home
  20. 20. root cause. There is no @this_hits_home
  21. 21. root cause. There is no g @this_hits_home
  22. 22. root cause. There is no @this_hits_home
  23. 23. root cause. There is no A PERFECT STORM @this_hits_home
  24. 24. EACH NECESSARY, BUT ONLY JOINTLY SUFFICIENT root cause. There is no A PERFECT STORM q @this_hits_home
  25. 25. EACH NECESSARY, BUT ONLY JOINTLY SUFFICIENT root cause. There is no multiple contributing factors. A PERFECT STORM @this_hits_home
  26. 26. Three Pillars of Fallacy Comprehension Understandability Predictability NONE OF THESE ARE ENTIRELY POSSIBLE @this_hits_home
  27. 27. Three Pillars of Fallacy NONE OF THESE ARE ENTIRELY POSSIBLE Comprehension Incidents cannot be fully comprehended. They are fraught with uncertainty. Remediation items will contribute to further incidents. Understandability Predictability @this_hits_home
  28. 28. Three Pillars of Fallacy NONE OF THESE ARE ENTIRELY POSSIBLE Comprehension Incidents cannot be fully comprehended. They are fraught with uncertainty. Remediation items will contribute to further incidents. Understandability Managing the small stuff does not prevent big incidents. Incidents are not made up of causes. We do not find them; we construct them. Predictability @this_hits_home
  29. 29. Three Pillars of Fallacy NONE OF THESE ARE ENTIRELY POSSIBLE Comprehension Incidents cannot be fully comprehended. They are fraught with uncertainty. Remediation items will contribute to further incidents. Understandability Managing the small stuff does not prevent big incidents. Incidents are not made up of causes. We do not find them; we construct them. Predictability Learning from the last incident will not allow you to predict the next one. Complex systems are not deterministic. Their state cannot be precisely or repeatedly foretold. @this_hits_home
  30. 30. @this_hits_home
  31. 31. Ask more ‘how’ than ‘why’ TO LEARN MORE FROM INCIDENTS @this_hits_home@this_hits_home
  32. 32. 5 Whys is a competitive disadvantage. @this_hits_home@this_hits_home
  33. 33. 5 Whys is a competitive disadvantage. @this_hits_home@this_hits_home
  34. 34. There is no root cause. @this_hits_home@this_hits_home
  35. 35. 5 Whys is a competitive disadvantage. @this_hits_home@this_hits_home
  36. 36. Normal Work Performance Variability Performance Variability Workarounds Non-Compliance Workarounds Non-Compliance SUCCESS FAILURE Make sure that things go right Rather than preventing them from going wrong SAME PROCESS, DIFFERENT OUTCOMES @this_hits_home@this_hits_home
  37. 37. Performance Variability Performance Variability Workarounds Non-Compliance Workarounds Non-Compliance SAME PROCESS, DIFFERENT OUTCOMES SUCCESS FAILURE Make sure that things go right Rather than preventing them from going wrongNormal Work @this_hits_home@this_hits_home
  38. 38. Workarounds Non-Compliance Workarounds Non-Compliance SUCCESS FAILURE Make sure that things go right Rather than preventing them from going wrongNormal Work Performance Variability Performance Variability SAME PROCESS, DIFFERENT OUTCOMES @this_hits_home@this_hits_home
  39. 39. SUCCESS FAILURE Make sure that things go right Rather than preventing them from going wrongNormal Work Performance Variability Performance Variability Non-ComplianceWorkarounds Workarounds Non-Compliance SAME PROCESS, DIFFERENT OUTCOMES @this_hits_home@this_hits_home
  40. 40. Normal Work Performance Variability Performance Variability Workarounds Non-Compliance Workarounds Non-Compliance SUCCESS FAILURE Make sure that things go right Rather than preventing them from going wrong SAME PROCESS, DIFFERENT OUTCOMES @this_hits_home@this_hits_home
  41. 41. Normal Work Performance Variability Performance Variability Workarounds Non-Compliance Workarounds Non-Compliance SUCCESS FAILURE Rather than preventing them from going wrong Make sure that things go right SAME PROCESS, DIFFERENT OUTCOMES @this_hits_home@this_hits_home
  42. 42. FAILURE Normal Work Performance Variability Performance Variability Workarounds Non-Compliance Workarounds Non-Compliance Rather than preventing them from going wrong Make sure that things go right SUCCESS SAME PROCESS, DIFFERENT OUTCOMES @this_hits_home@this_hits_home
  43. 43. SUCCESS Normal Work Performance Variability Performance Variability Workarounds Non-Compliance Workarounds Non-Compliance Rather than preventing them from going wrong Make sure that things go right FAILURE SAME PROCESS, DIFFERENT OUTCOMES @this_hits_home@this_hits_home
  44. 44. Constraining‘PERFORMANCE VARIABILITY’ in order to remove failures will also remove SUCCESSFUL EVERYDAY WORK! @this_hits_home@this_hits_home
  45. 45. Unwanted Outcomes Figure by Kelvin Genn Planned Outcomes Positive Surprises @this_hits_home@this_hits_home
  46. 46. Traditional investigation uses a small portion of the total experience. Unwanted Outcomes Planned Outcomes Positive Surprises Figure by Kelvin Genn @this_hits_home@this_hits_home
  47. 47. Traditional investigation uses a small portion of the total experience. Unwanted Outcomes Positive Surprises Planned Outcomes What about the 99.999% of the time in which things go right? Figure by Kelvin Genn @this_hits_home@this_hits_home
  48. 48. Traditional investigation uses a small portion of the total experience. Unwanted Outcomes Planned Outcomes Positive Surprises What about the 99.999% of the time in which things go right? Figure by Kelvin Genn @this_hits_home@this_hits_home
  49. 49. Traditional investigation uses a small portion of the total experience. What about the 99.999% of the time in which things go right? Unwanted Outcomes Planned Outcomes Positive Surprises Figure by Kelvin Genn @this_hits_home@this_hits_home
  50. 50. @this_hits_home@this_hits_home
  51. 51. @this_hits_home@this_hits_home
  52. 52. @this_hits_home@this_hits_home
  53. 53. Availability is perceived. @this_hits_home@this_hits_home
  54. 54. Success is often invisible @this_hits_home@this_hits_home
  55. 55. Typical, everyday performance goes unknown & ignored. Success is often invisible @this_hits_home@this_hits_home
  56. 56. HOW DO WE SUCCESS AND FAILURE IS NOT REALLY A DUALITY @this_hits_home
  57. 57. Keep normal work normal HOW DO WE SUCCESS AND FAILURE IS NOT REALLY A DUALITY @this_hits_home
  58. 58. SUCCESS AND FAILURE IS NOT REALLY A DUALITY Keep normal work normal KEEP INCIDENTS WEIRD? AND HOW DO WE @this_hits_home
  59. 59. DO YOUR INCIDENTS GET FILED OR DO THEY GET READ? PROD ALERT / INC-1806 Signups Impact Triggered By Website Going Cattywampus DETAILS Type Incident Regions Global Severity Major Signup Impact? Yes Notifications Yes Source of Detection Paging Alert ACTIVITY Status STABLE Assignee Ryan Kitchens Watchers 0 DATES Created 03/25/2019 1:45 PM Updated 03/25/2019 1:50 PM Incident Start 03/25/2019 1:40 PM Incident Detected 03/25/2019 1:41 PM Stabilization Time 03/25/2019 1:48 PM DESCRIPTION We were paged for an abnormal RPS trend which kicked off the investigation. Signup flows were depressed in conjunction with a spike in traffic to the website that sent Node.js into a ‘death spiral’. During this time, customers would have been experiencing errors when attempting to load the site. We are still looking into the nature of the requests to surface any patterns that may be helpful. @this_hits_home@this_hits_home
  60. 60. PROD ALERT / INC-1806 Signups Impact Triggered By Website Going Cattywampus DETAILS Type Incident Regions Global Severity Major Signup Impact? Yes Notifications Yes Source of Detection Paging Alert ACTIVITY Status STABLE Assignee Ryan Kitchens Watchers 0 DATES Created 03/25/2019 1:45 PM Updated 03/25/2019 1:50 PM Incident Start 03/25/2019 1:40 PM Incident Detected 03/25/2019 1:41 PM Stabilization Time 03/25/2019 1:48 PM DESCRIPTION We were paged for an abnormal RPS trend which kicked off the investigation. Signup flows were depressed in conjunction with a spike in traffic to the website that sent Node.js into a ‘death spiral’. During this time, customers would have been experiencing errors when attempting to load the site. We are still looking into the nature of the requests to surface any patterns that may be helpful. DO YOUR INCIDENTS GET FILED OR DO THEY GET READ? @this_hits_home@this_hits_home
  61. 61. PROD ALERT / INC-1806 Signups Impact Triggered By Website Going Cattywampus DETAILS Type Incident Regions Global Severity Major Signup Impact? Yes Notifications Yes Source of Detection Paging Alert ACTIVITY Status STABLE Assignee Ryan Kitchens Watchers 0 DATES Created 03/25/2019 1:45 PM Updated 03/25/2019 1:50 PM Incident Start 03/25/2019 1:40 PM Incident Detected 03/25/2019 1:41 PM Stabilization Time 03/25/2019 1:48 PM DESCRIPTION We were paged for an abnormal RPS trend which kicked off the investigation. Signup flows were depressed in conjunction with a spike in traffic to the website that sent Node.js into a ‘death spiral’. During this time, customers would have been experiencing errors when attempting to load the site. We are still looking into the nature of the requests to surface any patterns that may be helpful. DO YOUR INCIDENTS GET FILED OR DO THEY GET READ? @this_hits_home@this_hits_home
  62. 62. PROD ALERT / INC-1806 Signups Impact Triggered By Website Going Cattywampus DETAILS Type Incident Regions Global Severity Major Signup Impact? Yes Notifications Yes Source of Detection Paging Alert ACTIVITY Status STABLE Assignee Ryan Kitchens Watchers 0 DATES Created 03/25/2019 1:45 PM Updated 03/25/2019 1:50 PM Incident Start 03/25/2019 1:40 PM Incident Detected 03/25/2019 1:41 PM Stabilization Time 03/25/2019 1:48 PM DESCRIPTION We were paged for an abnormal RPS trend which kicked off the investigation. Signup flows were depressedin conjunction with a spike in traffic to the website that sent Node.js into a ‘death spiral’. During this time, customers would have been experiencing errors when attempting to load the site. We are still looking into the nature of the requests to surface any patterns that may be helpful. DO YOUR INCIDENTS GET FILED OR DO THEY GET READ? @this_hits_home@this_hits_home
  63. 63. PROD ALERT / INC-1806 Signups Impact Triggered By Website Going Cattywampus DETAILS Type Incident Regions Global Severity Major Signup Impact? Yes Notifications Yes Source of Detection Paging Alert ACTIVITY Status STABLE Assignee Ryan Kitchens Watchers 0 DATES Created 03/25/2019 1:45 PM Updated 03/25/2019 1:50 PM Incident Start 03/25/2019 1:40 PM Incident Detected 03/25/2019 1:41 PM Stabilization Time 03/25/2019 1:48 PM DESCRIPTION We were paged for an abnormal RPS trend which kicked off the investigation. Signup flows were depressed in conjunction with a spike in traffic to the website that sent Node.js into a ‘death spiral’. During this time, customers would have been experiencing errors when attempting to load the site. We are still looking into the nature of the requests to surface any patterns that may be helpful. DO YOUR INCIDENTS GET FILED OR DO THEY GET READ? @this_hits_home@this_hits_home
  64. 64. PROD ALERT / INC-1806 Signups Impact Triggered By Website Going Cattywampus DETAILS Type Incident Regions Global Severity Major Signup Impact? Yes Notifications Yes Source of Detection Paging Alert ACTIVITY Status STABLE Assignee Ryan Kitchens Watchers 0 DATES Created 03/25/2019 1:45 PM Updated 03/25/2019 1:50 PM Incident Start 03/25/2019 1:40 PM Incident Detected 03/25/2019 1:41 PM Stabilization Time 03/25/2019 1:48 PM DESCRIPTION We were paged for an abnormal RPS trend which kicked off the investigation. Signup flows were depressed in conjunction with a spike in traffic to the website that sent Node.js into a ‘death spiral’. During this time, customerswould have been experiencing errors when attempting to load the site. We are still looking into the nature of the requests to surface any patterns that may be helpful. DO YOUR INCIDENTS GET FILED OR DO THEY GET READ? @this_hits_home@this_hits_home
  65. 65. PROD ALERT / INC-1806 Signups Impact Triggered By Website Going Cattywampus DETAILS Type Incident Regions Global Severity Major Signup Impact? Yes Notifications Yes Source of Detection Paging Alert ACTIVITY Status STABLE Assignee Ryan Kitchens Watchers 0 DATES Created 03/25/2019 1:45 PM Updated 03/25/2019 1:50 PM Incident Start 03/25/2019 1:40 PM Incident Detected 03/25/2019 1:41 PM Stabilization Time 03/25/2019 1:48 PM DESCRIPTION We were paged for an abnormal RPS trend which kicked off the investigation. Signup flows were depressed in conjunction with a spike in traffic to the website that sent Node.js into a ‘death spiral’. During this time, customers would have been experiencing errors when attempting to load the site. We are still lookinginto the nature of the requests to surface any patterns that may be helpful. DO YOUR INCIDENTS GET FILED OR DO THEY GET READ? @this_hits_home@this_hits_home
  66. 66. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  67. 67. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  68. 68. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  69. 69. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  70. 70. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions Contributors SR rule to handle broken headers. Only regions in peak traffic window observed the issue. Website clusters auto-scaled as expected to meet demand. An old ASG was able to be repurposed into additional capacity. Rate limiting was observed via recently created dashboards. New ASG was deployed which patched a bug in the validation. All relevant teams were alerted to the issue in under one minute. An engineer discovered the property change by searching events. RisksDivision of labor across teams. Usage of deprecated property UI over the recommended one. No type safety or constraints were in place for the property. The team had confusion about how to revert the property. Property parser is overly optimistic of intended context. Clearing massive amounts of individual rules is not trivial. Afterwards, we still observed short-circuiting on redemption. No feedback to users on effects of removing rules en masse. Mitigators An unstaged, global property changed was performed. Large volume of redemptions during the holiday window. Invalid rules added to redemption workflows. Removal of rules contributed to an increase of traffic. Validate commands started failing in mid-tier APIs. ID validation was not working as expected. Recent move to new instance types had unexpected perf. & Enablers @this_hits_home
  71. 71. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions Risks Use of deprecated UI over the recommended one. No type safety or property constraints. Division of labor across teams. The team had confusion about how to revert. Property parser is overly optimistic of intended context. Clearing massive amounts of individual rules is not easy. Short-circuiting for redemption persisted long after. No system feedback to users during mass rule removal. @this_hits_home
  72. 72. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions Proxy rules added to handle broken headers. Only peak traffic regions observed the issue. Clusters eventually auto-scaled to meet demand. Old ASG was repurposed into additional capacity. New ASG deployed to patch a bug in validation. Rate-limiting observed via recently saved dashboard. All relevant teams were alerted under 1min to the issue. A person discovered the change by searching events. Mitigators @this_hits_home
  73. 73. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions Contributors & Enablers @this_hits_home Risks An unstaged, global property changed was performed. Large volume of redemptions during holiday window. Invalid rules added to redemption workflows. Removal of rules contributed to an increase of traffic. Validate cmd started failing in mid-tier APIs. ID validation was not working as expected. Unexpected perf from upgrade to new instance types.
  74. 74. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  75. 75. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions ‘Islands of Knowledge’ @this_hits_home
  76. 76. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions “I didn’t know it worked like that.” ‘Islands of Knowledge’ @this_hits_home
  77. 77. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions “We didn’t make any changes… did we?” ‘Islands of Knowledge’ @this_hits_home “I didn’t know it worked like that.”
  78. 78. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  79. 79. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  80. 80. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  81. 81. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  82. 82. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  83. 83. Trade-offs Under Pressure: HEURISTICS AND OBSERVATIONS OF TEAMS RESOLVING INTERNET SERVICE OUTAGES –John Allspaw
  84. 84. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  85. 85. CONTRIBUTORS&ENABLERS MITIGATORS RISKS DIFFICULTIESINHANDLING FOLLOW-UPITEMS ARTIFACTS TIMELINE REFERENCES OPENQUESTIONS Contributors & Enablers Mitigators Risks Difficulties in Handling Follow-up Items Artifacts Timeline References Open Questions @this_hits_home
  86. 86. @this_hits_home
  87. 87. What’s going on NOTHING is happening? WHEN IT SEEMS LIKE KEEP CALM NOTHING IS HAPPENING @this_hits_home
  88. 88. HOW HARD ARE people workingJUST TO HEALTHY? KEEP THE SYSTEM @this_hits_home
  89. 89. HOW HARD ARE people workingJUST TO HEALTHY? KEEP THE SYSTEM How do you THINK ABOUT feeding this UP THE CHAIN?back @this_hits_home
  90. 90. HOW HARD ARE people workingJUST TO HEALTHY? KEEP THE SYSTEM How do you Or do you NOTTHINK ABOUT feeding this UP THE CHAIN?back BECAUSE EVERYTHING looks good? @this_hits_home
  91. 91. THERE IS NO GETTING AROUND THIS THE HARDEST PROBLEM IN TECH @this_hits_home
  92. 92. Interviews ‘why did things go wrong?’ EVERY EXPERIENCE IS VALID & ESSENTIAL GETTING FROM ‘how did things go right?’TO @this_hits_home
  93. 93. MIND IF I TAKE NOTES? @this_hits_home
  94. 94. UP FOR A QUICK CHAT?
  95. 95. TEAMS Learning HOW DO TEAMS ADAPT SUCCESSFULLY? @this_hits_home
  96. 96. ASK BETTER QUESTIONS @this_hits_home
  97. 97. elicit descriptions Interviews try to convey what the world perspective. looked like from their to construct how we got here. @this_hits_home
  98. 98. “WE MUST... Be more careful. Avoid making mistakes. Have more discipline. Show more thoughtfulness. ” How we respond is important. @this_hits_home ‘BEST PRACTICES’ ARE NOT REAL PRACTICES
  99. 99. “WE MUST...How we respond is important. @this_hits_home ‘BEST PRACTICES’ ARE NOT REAL PRACTICES Be more careful. Avoid making mistakes. Have more discipline. Show more thoughtfulness. ”
  100. 100. Be more careful. Avoid making mistakes. Have more discipline. Show more thoughtfulness. ” How we respond is important. “WE MUST... ‘BEST PRACTICES’ ARE NOT REAL PRACTICES @this_hits_home
  101. 101. ‘incidents’ are SURPRISES @this_hits_home
  102. 102. Takeaways Recovery > Prevention. There is no root cause. Stop reporting on the nines. Learn how things go right.
  103. 103. Thank you @this_hits_home http://continuous.wtf
  104. 104. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ incident-investigation-failure- prevention/

×