In October 2012, Google App Engine had an 8-hour global outage. This session walks through the incident and the "Reliability Fixit" it inspired in its aftermath. Learn how the team came together, and over the next 6 months, reduced reliability issues by 10x. Also take away broader insights around engineering tradeoffs, managing an incident, and driving improvement.
5. App Engine Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whiteboard
o Enumerated all known and suspected reliability issues
o Too much technical debt had accumulated
o Reliability issues had not been prioritized
o Identify 8-10 themes
@randyshoup
6. • Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After 1 week, all leads came back with
• Detailed list of issues
• Recommended steps to address them
• Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.)
App Engine Reliability Fixit
@randyshoup
7. • Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
App Engine Reliability Fixit
@randyshoup
8. • Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which engineers updated weekly
o Minimal effort from management (~1 hour / week) to summarize progress at
weekly team meeting
App Engine Reliability Fixit
@randyshoup
9. • Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and ownership of the future health of the platform
o Still remembered several years later
App Engine Reliability Fixit
@randyshoup
16. During the Incident
• Focus on restoring service
o Everything else is secondary, and should wait
• Shield the team
• Clear, structured communication
o Even when there is nothing to report!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
17. After the Incident
• Blameless postmortem
• Identify and understand the
contributing factors
• Action items and Learnings
• Follow Up!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
18. “Finally we can prioritize
fixing that broken system!”
@randyshoup
19. Psychological Safety
• Team is safe for interpersonal
risk-taking
• “Being able to show and employ
one’s self without fear of
negative consequences”
• More important than any other
factor in team success
20. Inclusive Decisionmaking
• Make better business decisions
87% of the time
• Make decisions 2x faster with
1/2 the meetings
• Deliver 60% better business
results
Cloverpop Inclusive Decisionmaking study, 2016
As we improve diversity, decisionmaking improves
@randyshoup
22. “Incidents are unplanned
investments, and they are also
opportunities. Your challenge
is to maximize the ROI on the
sunk cost.”
@randyshoup
-- John Allspaw, Adaptive Capacity Labs
26. Improvement Budget
• Explicit resource investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts)
• Retain autonomy, Provide transparency
o Making these decisions is exactly why they hired you
@randyshoup