Most of us dread failures. But things go wrong. We can become paralyzed by the fear of being the creator of the next outage or critical bug. After a failure, we often hold a postmortem, but this rarely addresses how we can be more proactive in preventing catastrophes. Considering our missteps, failures, and outright crash and burns, we can learn how to ask the right questions at the right time. Siva Katir has thirteen years of experience causing and surviving failures—from the mundane to the maddening. Siva shares the lessons he has learned analyzing his, his co-workers, and his organization’s snafus—from shopping carts that let people purchase things for free to personnel tracking software that couldn't track personnel. Come listen, laugh, and learn how to see patterns that help us improve. Discover how to train yourself to anticipate potential problems well before any code is ever written. It is not about being risk averse—but risk aware.
3. Risk Aware, Not Risk Averse
November 9th, 2017 Siva Katir
Better Software East - 2017
Risk Aware Buys Flexibility & Speed
PlayFab Xbox Live
79 million players a month 55 million players a month
15 employees ~500 employees
5m players per employee 100k players per employee
Last 30 Days – Features and Fixes
4. It Buys Stability
PlayFab: Success over Error 30 Days99.99% Uptime
First, we fail
“Why do we fall, Bruce? So we can learn to pick ourselves up.” - Thomas Wayne
5. Learn To Love Failure
• Humans naturally avoid discomfort
• Failure must be about learning
Embracing learning is embracing risk
“I can accept
failure, but I
can’t accept
not trying.”
- Michael Jordan
Passing The Buck
• It is not my fault
• It is my fault
taxRate = 0.8;
subTotal = “1,070.00”;
tax = subTotal * taxRate;
total = subTotal + tax;
tax and total are both “0”
6. Not Our Fault…
But our responsibility. PlayFab: Not Authenticated Errors
Datacenter On Fire
• System failures occur
• Still our responsibility
8. Preparing for Risk
• Preparation means training
• Avoid failures through past lessons
“I never stopped
getting ready.
Just in case.”
- Cmdr. Chris Hadfield
Preparation Starts at Design
• Ask hard questions
• Ask lots of questions
• Never assume answers
Red Flags
– Broad solution
– No one thinks it’s a bad idea
Green Flags
– Focused
– Detailed
9. Document & Code Reviews
Red Flags
– It’s huge!
– Unrelated changes
Green Flags
– It’s small and focused
– Everything is related
Monitoring
• What is shipped?
• What is nominal?
• 100’s of metrics - pick 2 or 3
– Netflix: streams started &
call center calls
Red Flags
– Too many alarms
– Healthy not defined
Green Flags
– Health metrics known
– Nominal defined
11. Playing Devil’s Advocate
Why build it?
Should we build it?
Can we build it?
– No doesn’t mean don’t.
How will we build it?
Risk Aware is Hard
Risk aware is hard.
It’s hard to not cut corners to
avoid risk.
It’s hard to get buy-in for time
consuming best practices.
The quality that being risk-aware
buys is worth every penny.
The only easy day was
yesterday.
- U.S. Navy Seals