One Terrible Day at Google, and How It Made Us Better

One Terrible Day at Google,
and How It Made Us Better
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup

http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
App Engine Outage - Oct 2012

App Engine Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whiteboard
o Enumerated all known and suspected reliability issues
o Too much technical debt had accumulated
o Reliability issues had not been prioritized
o Identify 8-10 themes
@randyshoup

• Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After 1 week, all leads came back with
• Detailed list of issues
• Recommended steps to address them
• Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.)
@randyshoup

• Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
@randyshoup

• Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which engineers updated weekly
o Minimal effort from management (~1 hour / week) to summarize progress at
weekly team meeting
@randyshoup

•  Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and ownership of the future health of the platform
o Still remembered several years later
@randyshoup

Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement

“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
@randyshoup

The more constrained you are
on time or resources, the more
important it is to get it done
the first time.
@randyshoup

Negotiating Tradeoffs
Scope
Time
Quality
@randyshoup

During the Incident
• Focus on restoring service
o Everything else is secondary, and should wait
• Shield the team
• Clear, structured communication
o Even when there is nothing to report!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response

After the Incident
• Blameless postmortem
• Identify and understand the
contributing factors
• Action items and Learnings
• Follow Up!

“Finally we can prioritize
fixing that broken system!”
@randyshoup

Psychological Safety
• Team is safe for interpersonal
risk-taking
• “Being able to show and employ
one’s self without fear of
negative consequences”
• More important than any other
factor in team success

Inclusive Decisionmaking
• Make better business decisions
87% of the time
• Make decisions 2x faster with
1/2 the meetings
• Deliver 60% better business
results
Cloverpop Inclusive Decisionmaking study, 2016
As we improve diversity, decisionmaking improves
@randyshoup

15 Million
“Never let a
good crisis go
to waste.”
@randyshoup

“Incidents are unplanned
investments, and they are also
opportunities. Your challenge
is to maximize the ROI on the
sunk cost.”
@randyshoup
-- John Allspaw, Adaptive Capacity Labs

Frame the Problem:
Quality and reliability are
business concerns
@randyshoup

Use Common Currency
Time
Money People
@randyshoup

Improvement Budget
• Explicit resource investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts)
• Retain autonomy, Provide transparency
o Making these decisions is exactly why they hired you
@randyshoup

Common Elements
• Unintentional, long-term accumulation of small, individually
reasonable decisions
• “Compelling event” catalyzes long-term change
• Blameless culture makes learning and improvement possible
• Structured post-incident approach
@randyshoup

Incident Response Patterns
• Incident Roles
• Incident Triggers
• On-Call Rotation and Onboarding
• Incident Command Training
• Incident Communication Plan
• Periodic Incident Updates
• Shared Incident State Doc
• Incident Call Recording
• Incident Swarming
• Local / Global Incident Reviews
• Post-Review Improvement Items

Thank you!
@randyshoup
linkedin.com/in/randyshoup
medium.com/@randyshoup

One Terrible Day at Google, and How It Made Us Better

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à One Terrible Day at Google, and How It Made Us Better

Similaire à One Terrible Day at Google, and How It Made Us Better (20)

Plus de Randy Shoup

Plus de Randy Shoup (10)

Dernier

Dernier (20)

One Terrible Day at Google, and How It Made Us Better