SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Nothing Good Ever
Happens After 2am
Reversim 2019
Daniel Korn
Engineering Team Lead at BigPanda 
korndaniel1
BigPanda’s 

Outage Procedure
Roles and responsibilities
On-call Incident Manager

On-Call (IMOC)
Tech Lead

On-Call (TLOC)
Support 

On-Call (SOC)
Incident Priority Definitions
Priority Affect Outage Resolution
P1
• Core feature
• Multiple customers
24/7
P2
• Core feature
• Single customer
24/7
P3
• Secondary feature
• No workaround
Next business day
Tools
Tools
• Alerting
Tools
• Alerting
• Communication
Tools
• Alerting
• Communication
• Observability
Alert/Support
notifies On-call
IMOC asses impact,
determine P1/P2/P3
On-call performs
simple mitigation
On-call escalate

to IMOC
IMOC escalate to
TLOC and SOC
1
2
3
4
5
6
7
8
9
10
On-call If (P1) { 

StatusPage;

dedicated channel;

}
SOC update
customers
R&D mitigate till
solved, update
StatusPage
IMOC Verifies resolved,

summary in channel
IMOC postmortem,
share with stakeholders
The Long Night
THIS IS A TRUE STORY.
The events depicted in this postmortem
took place in Tel Aviv and San Francisco
in 2018.



Despite the request of the survivors, the
names have not been changed.
Out of respect for our customers, the
story has been told exactly as it occurred.
Michal
On-call
Almog & Pini
TLOCs
Daniel (Me)
TLOC
Shmeff Andru
SOC Support
Julio
Support
Background
• REMINDER: BigPanda’s SLA
• New Access Control (RBAC) service
• Not all customers migrated
• Sunday: Multi-service deployment
[MON 05:03 PM] SOC

multiple tickets:“cannot
update environments”
[05:05 PM] On-call

Asks SOC for details, opens a
dedicated Slack channel
[05:08 PM] On-call

Identifies as Auth-related,
notifies TLOCs
[05:35 PM] On-call

“we think it’s related to a
deploy, working on a fix”
[05:33 PM] SOC

considers opening a status
page, but “might be a P3”
[06:16 PM] SOC

Opens status page
Stick to the Plan
TA
K
EAW
AY
[07:41 PM] TLOCs

Deploy fix to production
[06:50-07:30 PM] TLOCs

Fix is tested, not reproduced
debate fix or revert
[07:45-08:05 PM] SOC

Verifies together with TLOCs
the issue is resolved
[08:10 PM] SOC

Closes status page

On-call and TLOCs leaving
REVERT FIRST
Rule of Thumb
TA
K
EAW
AY
[12:57 AM] SOC

“So it appears to be just a
UI issue”. Notifies On-call
[12:45 AM] Support

“Some customers can’t see
roles in the env editor”
[12:59 AM] On-call

Notifies TLOC
[01:01 AM] TLOC

Starts investigating the issue
– Someone smart
If it looks like an outage, and (support)
sounds like an outage, then it might
be just a bug“
Do not Assume
an Outage
TA
K
EAW
AY
[01:54 AM] TLOCs

Deploy fix to production, 

ask SOC to verify with customers
[01:20 AM] TLOCs

Identifying the cause, 

starting to work on a fix
If you think this has a
happy ending, you haven’t
been paying attention.
— Ramsay Bolton
“
[02:00 AM] SOC + Support 

Debating on StatusPage re-open
[01:57 AM] Support

customers reporting the initial issue -
“cannot update environments”
[02:03 AM] TLOCs

Start investigating the issue
[02:15-02:51 AM] TLOC

Manually adds missing
permissions to customers DB
[02:10 AM] TLOCs

Identifying the cause - lack of
permissions (migration)
Time to Call it
a Night
TA
K
EAW
AY
[02:56 AM] SOC

Verifies this customer is
facing the issue
[02:52 AM] TLOC

Having problems with a
specific customer
[02:56-03:25 AM] TLOCs

Identify the problem - edge case
involving FT and manual customizations
[03:25 PM] SOC

Asks TLOC to discuss the
situation on a phone call
[-04:07 AM] SOC+TLOC

SOC asks TLOC to
commit to fix by EOD
[03:29- AM] SOC + TLOC

Sensitive customer, no
changes ,issue remains
[09:30 AM - 05:12 PM] TLOCs

Implemented a fix, deploy to production,
ask SOC to verify
[05:25 PM] SOC

Verifies issue resolved
Do not Commit
to Action Items
TA
K
EAW
AY
[19:00 PM] CS + R&D + PM

Joint postmortem,

Preparing customer’s updates
[WED 11:00 AM] R&D

Conduct a postmortem,

Share with R&D and CS
Chaos isn’t a pit.
Chaos is a ladder.
— Petyr “Littlefinger” Baelish
“
Recap
• Stick to the plan
• Rule of thumb: REVERT FIRST
• Do not assume an outage
• Time to call it a night
• Do not commit to action items
Nothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2am

Contenu connexe

Similaire à Nothing Good Ever Happens After 2am

3 steps to hosted success
3 steps to hosted success3 steps to hosted success
3 steps to hosted successVXSuite
 
DR planning and testing
DR planning and testingDR planning and testing
DR planning and testingJason Dea
 
DR Planning and Testing
DR Planning and TestingDR Planning and Testing
DR Planning and TestingJason Dea
 
2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR Workshop2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR WorkshopBluelock
 
Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]AgilePractitionersIL
 
Respond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saTom Cudd
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioDevOpsDays Tel Aviv
 
Critical incident management.pptx
Critical incident management.pptxCritical incident management.pptx
Critical incident management.pptxDavidForeroS
 
Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!Hostway|HOSTING
 
Harry Regan - It's Never So Bad That It Can't Get Worse
Harry Regan - It's Never So Bad That It Can't Get WorseHarry Regan - It's Never So Bad That It Can't Get Worse
Harry Regan - It's Never So Bad That It Can't Get Worsecentralohioissa
 
RPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slidesRPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slidesCal Leeming
 
Avoiding Technical Bankruptcy
Avoiding Technical BankruptcyAvoiding Technical Bankruptcy
Avoiding Technical Bankruptcymarkuskobler
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solutionmuralis3
 
Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017W3 Group Canada Inc.
 
World-Class Incident Response Management
World-Class Incident Response ManagementWorld-Class Incident Response Management
World-Class Incident Response ManagementKeith Smith
 
Protecting Against Disaster: Plan for the Inevitable Before it Happens
Protecting Against Disaster: Plan for the Inevitable Before it HappensProtecting Against Disaster: Plan for the Inevitable Before it Happens
Protecting Against Disaster: Plan for the Inevitable Before it HappensHostway|HOSTING
 
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...SolarWinds
 
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, StatuspageProduct Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, StatuspageAtlassian
 
Stop the Line practice in SW development
Stop the Line practice in SW developmentStop the Line practice in SW development
Stop the Line practice in SW developmentGabor Gunyho
 

Similaire à Nothing Good Ever Happens After 2am (20)

3 steps to hosted success
3 steps to hosted success3 steps to hosted success
3 steps to hosted success
 
Choked by technical debt?
Choked by technical debt?Choked by technical debt?
Choked by technical debt?
 
DR planning and testing
DR planning and testingDR planning and testing
DR planning and testing
 
DR Planning and Testing
DR Planning and TestingDR Planning and Testing
DR Planning and Testing
 
2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR Workshop2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR Workshop
 
Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]
 
Respond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an sa
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
 
Critical incident management.pptx
Critical incident management.pptxCritical incident management.pptx
Critical incident management.pptx
 
Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!
 
Harry Regan - It's Never So Bad That It Can't Get Worse
Harry Regan - It's Never So Bad That It Can't Get WorseHarry Regan - It's Never So Bad That It Can't Get Worse
Harry Regan - It's Never So Bad That It Can't Get Worse
 
RPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slidesRPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slides
 
Avoiding Technical Bankruptcy
Avoiding Technical BankruptcyAvoiding Technical Bankruptcy
Avoiding Technical Bankruptcy
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
 
Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017
 
World-Class Incident Response Management
World-Class Incident Response ManagementWorld-Class Incident Response Management
World-Class Incident Response Management
 
Protecting Against Disaster: Plan for the Inevitable Before it Happens
Protecting Against Disaster: Plan for the Inevitable Before it HappensProtecting Against Disaster: Plan for the Inevitable Before it Happens
Protecting Against Disaster: Plan for the Inevitable Before it Happens
 
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
 
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, StatuspageProduct Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
 
Stop the Line practice in SW development
Stop the Line practice in SW developmentStop the Line practice in SW development
Stop the Line practice in SW development
 

Dernier

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Dernier (20)

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Nothing Good Ever Happens After 2am

  • 1. Nothing Good Ever Happens After 2am Reversim 2019
  • 2. Daniel Korn Engineering Team Lead at BigPanda  korndaniel1
  • 3.
  • 4.
  • 6. Roles and responsibilities On-call Incident Manager
 On-Call (IMOC) Tech Lead
 On-Call (TLOC) Support 
 On-Call (SOC)
  • 7. Incident Priority Definitions Priority Affect Outage Resolution P1 • Core feature • Multiple customers 24/7 P2 • Core feature • Single customer 24/7 P3 • Secondary feature • No workaround Next business day
  • 12. Alert/Support notifies On-call IMOC asses impact, determine P1/P2/P3 On-call performs simple mitigation On-call escalate
 to IMOC IMOC escalate to TLOC and SOC 1 2 3 4 5
  • 13. 6 7 8 9 10 On-call If (P1) { 
 StatusPage;
 dedicated channel;
 } SOC update customers R&D mitigate till solved, update StatusPage IMOC Verifies resolved,
 summary in channel IMOC postmortem, share with stakeholders
  • 15. THIS IS A TRUE STORY. The events depicted in this postmortem took place in Tel Aviv and San Francisco in 2018.
 
 Despite the request of the survivors, the names have not been changed. Out of respect for our customers, the story has been told exactly as it occurred.
  • 18. Background • REMINDER: BigPanda’s SLA • New Access Control (RBAC) service • Not all customers migrated • Sunday: Multi-service deployment
  • 19. [MON 05:03 PM] SOC
 multiple tickets:“cannot update environments” [05:05 PM] On-call
 Asks SOC for details, opens a dedicated Slack channel [05:08 PM] On-call
 Identifies as Auth-related, notifies TLOCs
  • 20.
  • 21.
  • 22. [05:35 PM] On-call
 “we think it’s related to a deploy, working on a fix” [05:33 PM] SOC
 considers opening a status page, but “might be a P3” [06:16 PM] SOC
 Opens status page
  • 23. Stick to the Plan TA K EAW AY
  • 24. [07:41 PM] TLOCs
 Deploy fix to production [06:50-07:30 PM] TLOCs
 Fix is tested, not reproduced debate fix or revert [07:45-08:05 PM] SOC
 Verifies together with TLOCs the issue is resolved [08:10 PM] SOC
 Closes status page
 On-call and TLOCs leaving
  • 25. REVERT FIRST Rule of Thumb TA K EAW AY
  • 26. [12:57 AM] SOC
 “So it appears to be just a UI issue”. Notifies On-call [12:45 AM] Support
 “Some customers can’t see roles in the env editor” [12:59 AM] On-call
 Notifies TLOC [01:01 AM] TLOC
 Starts investigating the issue
  • 27.
  • 28. – Someone smart If it looks like an outage, and (support) sounds like an outage, then it might be just a bug“
  • 29. Do not Assume an Outage TA K EAW AY
  • 30. [01:54 AM] TLOCs
 Deploy fix to production, 
 ask SOC to verify with customers [01:20 AM] TLOCs
 Identifying the cause, 
 starting to work on a fix
  • 31. If you think this has a happy ending, you haven’t been paying attention. — Ramsay Bolton “
  • 32. [02:00 AM] SOC + Support 
 Debating on StatusPage re-open [01:57 AM] Support
 customers reporting the initial issue - “cannot update environments” [02:03 AM] TLOCs
 Start investigating the issue
  • 33. [02:15-02:51 AM] TLOC
 Manually adds missing permissions to customers DB [02:10 AM] TLOCs
 Identifying the cause - lack of permissions (migration)
  • 34.
  • 35. Time to Call it a Night TA K EAW AY
  • 36. [02:56 AM] SOC
 Verifies this customer is facing the issue [02:52 AM] TLOC
 Having problems with a specific customer [02:56-03:25 AM] TLOCs
 Identify the problem - edge case involving FT and manual customizations [03:25 PM] SOC
 Asks TLOC to discuss the situation on a phone call
  • 37. [-04:07 AM] SOC+TLOC
 SOC asks TLOC to commit to fix by EOD [03:29- AM] SOC + TLOC
 Sensitive customer, no changes ,issue remains [09:30 AM - 05:12 PM] TLOCs
 Implemented a fix, deploy to production, ask SOC to verify [05:25 PM] SOC
 Verifies issue resolved
  • 38. Do not Commit to Action Items TA K EAW AY
  • 39. [19:00 PM] CS + R&D + PM
 Joint postmortem,
 Preparing customer’s updates [WED 11:00 AM] R&D
 Conduct a postmortem,
 Share with R&D and CS
  • 40. Chaos isn’t a pit. Chaos is a ladder. — Petyr “Littlefinger” Baelish “
  • 41. Recap
  • 42. • Stick to the plan • Rule of thumb: REVERT FIRST • Do not assume an outage • Time to call it a night • Do not commit to action items