Engineers are frequently tasked with being front and center in intense, highly demanding situations that require clear lines of communication. Our systems fail not because of a lack of attention or laziness but due to cognitive dissonance between what we believe about our environments and the objective interactions both internal and external to them.
It’s time to revisit your established beliefs surrounding failure scenarios, with an emphasis not on the “who” in decision making but instead on the “why” behind those decisions. With attention to growth mindset, you can encourage your teams to reject shallow explanations of human error for said failures and focus on how to gain greater understanding of these complexities and push the boundaries on what you believe to be static, unchanging context outside your sphere of influence.
Will Gallego walks you through the structure of postmortems used at large tech companies with real-world examples of failure scenarios and debunks myths regularly attributed to failures. You’ll learn how to incorporate open dialogue within and between teams to bridge these gaps in understanding.
2. Hey, I’m Will!
Systems Engineer @ Etsy
Intro slide - Hi, I’m Will Gallego. I’m a Staff Engineer on Etsy’s Systems Engineering team. I’ve facilitated dozens of post mortems over the last 6+ years there ranging from major site outages due to accidental
upgrades to coffee pots overflowing in the morning. All involved our shared beliefs in how we interpret our complex systems function. I’m excited to be here today sharing some of that knowledge I’ve picked
up during this time with all of you.
3. Two Questions
(CfR) Two questions for you
1. by show of hands who here has never participated in a post mortem (doc or discussion) before?
2. The second I want you to think about for a moment - Why do we put together post mortems?
(For those folks who have raised their hands, have folks give some answers, keep them in mind for later)
4. Storytelling
How I start every PM. It’s my “Once upon a time”. It’s because PM meetings are not a rote listing of events to be dictated to a group of people. It is the collection of stories in which we collectively build
together.
A Senior Engineer is new to his company. He takes it upon himself to tweak code he doesn’t own. He deletes a file and deploys the code to production. The site then goes down. He reverts the file in git,
deploys - the site stays down. He then finds other experienced engineers close by to fix the bug to restore stability in production.
(CfR) What should we do with this engineer - retrain them, transfer, demote, fire? Walk them out that day or give them two weeks?
Who else do we correct in response to this incident?
• The hiring manager?
• The original author of the 404 page?
• The Ops team that set up the alerts for this?
• The testing team for not writing tests to catch this flaw?
• Dell - We had to ILO the machines to power cycle them to reboot them. Why can’t we power cycle a couple hundred machines at once?
5. Blame as a
Barrier
What has firing/demotion/etc taught us? What about hiring manager, QA team, Ops team w/ automation tools, Dell for machines that required us to power cycle?
What happens when the next engineer wants to change this file? This is where our discussions into the importance of Post Mortems is so critical, because we move past knee jerk reactions. Blame is
preventing us from any real learning.
6. Retributive vs. Restorative
We have two approaches when things go wrong. We can take a retributive approach to managing failure or a restorative one.
7. Retributive Justice
An emphasis on finding culprits, determining
who broke the rules and the punishment for
the infraction. In particular, it’s necessary to
make sure guilty parties pay.
The “lines to be crossed” when something goes wrong is constructed after the fact. Fault is built into the biases we insert into the story constructed after the fact. Retributive justice decides after actions are
taken what is and isn’t illegitimate, far after decisions are made.
8. Questions involved with
Retributive Justice
• What rule or rules are broken?
• How bad is the breach of contract?
• What should the consequence(s) be?
Story: Jr. Engineer drops production database. College degree, hired, follows instructions in wiki. Told not to use the credentials in the document, though. Drops production database, back ups don’t take. Is
immediately fired and told legal may be involved.
(CfR) What are the outcomes, directly and indirectly, from retributive justice?
9. Restorative Justice
Collaboratively discuss when things go wrong
and what we can do, together, to repair the
harm. Also known as “Just Culture”
Needs of victims before the needs of rules broken
Open transparency is key. Nothing is hidden or squirreled away
10. Questions involved with
Restorative Justice
• Who has been hurt?
• What are their needs?
• Whose obligation is it to meet those?
PM Culture. How do we get folks to believe that they will not be punished for opening up and teaching. Likewise, how can we feel comfortable to learn without feeling judged for not knowing.
Being blame aware is not the same as a lack of accountability. It’s tied up directly with accountability. We want to feel comfortable knowing what everyone’s actions were to avoid covering up and then to
empower ourselves to move towards stability.
12. Blameless Post Mortems
_Aware
V
Often an association of blamelessness with a lack of responsibility, which is far from the truth. Our tendency is to avoid naming names for fear we’re pointing fingers. This lends itself to hiding necessary data
points that shed light on the nature of the decision making.
13. Blameless Post Mortems
_Aware
V
?
Questioning whether “Post Mortem” is a negative term. No one died! Why are we hyping it up as if someone did?
It’s borrowed from other fields
We still call problems in our code “bugs” and ascribe negativity to that
Changing verbiage is really difficult
14. Alternatives to “Post Mortem”
• Learning Review
• Post Incident Review
• Retrospective
• Debriefing
• Correction of Error
• Root Cause Analysis
15. “Learning through shared discussion of our
beliefs on what transpired over an agreed
upon limited number of events”
Before we can talk about how to run a Post Mortem, it’s important to highlight my definition. It’s a tool and this is the tool I’ve found useful for expressing these ideas. Let’s break this into a few parts
16. “Learning through shared discussion of our
beliefs on what transpired over an agreed
upon limited number of events”
“Learning” - We’ve gathered together because we believe that through the sharing of experiences in open dialogue we can gain greater understanding.
No discussion of bug fixes, remediation items, etc. We’re here to learn, to understand how our systems work and what surprised us from these events.
17. Learning Culture
- sharing of experiences in open dialogue we can gain greater understanding.
- Blame aware post mortems do not guarantee you learn anything simply because you’re not blaming anyone!
- A facilitator is clearing the path of blame, which are blockers for uncovering hidden truths. The path can be clear, but it’s up to the participants to learn and share.
18. Fixed Mindset vs Growth Mindset
All of this predicated on beliefs we can get better.
Fixed Mindset - Some things we’re good at and some we’re not. If you’re not a good fit, you don’t belong. Remove all the people who made wrong choices and you’ve fixed the system.
Growth Mindset - Our knowledge of the working world is mutable. We can learn from our mistakes and in fact be made better by them
19. “Learning through shared discussion of our
beliefs on what transpired over an agreed
upon limited number of events”
• We were surprised. We had one interpretation of our world before these events. Now we have a different one.
• We will never know with complete clarity and thoroughness everything that occurs within a single. Always something else to dig into
• Our memories are faulty - we smooth out “rough edges”, limiting our scope of knowing what happened
• We can’t know exactly all the parts that are involved in a person’s decision making during an event. Recreating it to its exact nature is impossible
20. “As the complexity of a system increases, the
accuracy of any single agent's own model of
that system decreases rapidly.”
Woods’ Theorem, Stella Report (https://stella.report)
Stella report is the findings from a collaboration between the “SNAFU Catchers’ Consortium” consisting of IBM, Etsy, IEX, and several researchers from Ohio State University regarding several outages and
the investigations into them.
(CfR) Show of hands, who here knows every
• line of code?
• micro service in their company?
• within their primary stack?
Now think about every time we want add a new feature.
21. “Learning through shared discussion of our
beliefs on what transpired over an agreed
upon limited number of events”
We have finite time surrounding an event. Everything we can possibly know about a collection of events is limited - we have to agree on the constraints of what we can dig into, for how long, and to what
extent.
1. Only so much time to prepare a timeline and to review it.
2. Our ability to understand the entirety of system is limited
3. The system underneath us is constantly changing. So our understanding of a situation is pinpointing the location of a moving target. Systems adapt and so must we.
22. Understanding Decision Making
Spectre/Meltdown bugs (CfR)
• What if we make a design optimization to give us a speed boost, should we do it?
• It’s gone through QA, rigorous testing, and seems to be without flaws. Do we deploy to production?
• How long would you wait on a design optimization until you consider it a success?
• What if it were a hardware optimization?
• Every Intel processor since 1995 (!) is affected, along with AMDs, etc. 20 years+ to discover this bug!
• “Speculative execution”, Abuse: at worst, arbitrary kernel virtual memory read vulnerabilities
• Were they “wrong” to implement it?
23. Local
Rationality
People act in accordance with what they believe was the best course of actions given the information at the time. No one goes to work wanting to break something
When your instinct is to dismiss someone’s decision, remember this. Empathy can be a powerful tool in learning.
24. The ETTO Principle
Efficiency-Thoroughness Trade-Off :
“…people (and organizations) as part of their activities
frequently – or always – have to make a trade-off between the
resources (primarily time and effort) they spend on preparing
to do something and the resources (primarily time and effort)
they spend on doing it”
(semi CfR) - git merge someone else’s code in large company. Check everyone’s commits?
25. In search of Efficiency
• Not enough information
• Too much information:
Omission, Reduced Precision, Queuing, Filtering, Decentralization
• Not enough time
• Pressure to achieve
Omission - look at main dashboard!
Reduced Precision - What’s Pi?
Queueing - Site takes 30 seconds to load and also the logo is 3 pixels to the left too far. Which do you fix first?
Filtering - Errors coming in, which log lines do you look at for errors first?
Decentralization - Too many tasks to dig into, DBA’s look at error logs on db machines, Ops look at Apache logs, etc.
26. Safety
(CfR): Are we ever more safe than we should be? (How do you know you’re safer than necessary until after?)
• Can you give an example of safe behavior in our daily engineering practice? Is that rule always obeyed?
• What determines “safe” behavior in a given situation?
• Are the bounds for safe behavior equally shared? If you were to ask someone else what “safe” is…
27. “Despite the universal agreement that safety
is important, there is no unequivocal
definition of what safety is”
Source: The ETTO Principle
We want to be “just” thorough enough to do the job without things going *too* wrong. Conflict over decisions can often be over our mutual non-agreement on exactly what is “safe” is. An action I believe is
safe and act upon might not be shared with another person. Easy to see after the fact.
28. “Safety is not the absence of accidents. It is the
presence of capacity”
- Todd Conklin
Be wary not to confuse a lack of accidents with being safe. Accidents are the consequence, but we need to inspect the context before.
Capacity here is defined by the ability for our systems to meet escalating demands. We are safe when we have the ability to meet an increase in these demands.
29. “Workers are as safe as they need to be without
being too safe in order to be productive.”
- Todd Conklin
Why wouldn’t be always be safer if it meant fewer mistakes? “More safe” is increasing our thoroughness, but other factors limit that ability to be thorough. Efficiency pushes us to do more, thoroughness pulls
us back to add more capacity for both planned and unplanned demands.
32. Systems Fail. Constantly.
Our systems are *constantly* failing! It is the capacity of our systems to handle these failures that allow us to succeed at all despite these failures. This includes:
• Bad input
• Missing pages
• Deprecated features
• Undocumented features
• Unknown or misunderstood interactions
All of this is unplanned. Surprises are at the core of every failure
33. “If we dig deep enough and spend
enough time on a problem, we can find
the root cause (or causes)”
34. Root Cause
is a Fallacy
Root Causes don’t exist either - events and decisions are interrelated. We can choose to stop (ETTO) but that doesn’t make it a root cause
• It’s looking for simple answers, but they’re shallow.
• Success has no “root cause”. There are countless small decisions to make something work. There is no difference with failure.
35. “If you follow the proper instructions to
the letter, you will always succeed”
We fail all the time despite following what we believe to be “the correct way” or “the rules”.
Common myth: Following rules produces success, not following rules produces failures, and if you failed, then you must not have been following the rules.
This line of thinking very closely correlated to retributive justice
36. 1. Infinitely informed - knowledgeable of all
alternatives and their consequences
2. Infinitely sensitive - able to distinguish any and all
perceptible differences
3. Be rational - able to rank all choices based on
some criterion and choose the optimal one
For this to be true, person must be…
37. Break
Ok, next we’re going to use these theories on understanding our decision making to inform how we run our post mortem. During the break, think of an instance where you had something unexpected going
wrong. It can be work related or not, but I’ll ask some to share their stories to practice a little later in the session.
39. FAQs
“If ‘fixing things’ isn’t the focus of a Post
Mortem, what place do remediation items
have?”
nice to haves! PMs are time to think about it, but not solve or assign. Leave for “soak time” to discuss in depth.
40. FAQs
“How do I get buy in from senior
management?”
Work in small teams to build out trust. Focus on how much you can educate your groups at large and how much they’ll be willing to open up if given the chance/time.
41. FAQs
“What if I give details that are later used
against me in performance reviews or worse
I get fired from this?”
• You probably don’t want to work at a company that does that in the first place, but it’s understandable
• For Coworkers - Directly state your intent to learn, not to judge.
• “Well we don’t have time for this, we need to deploy the next iteration of the product”. What is the cost of another outage like this? An outage worse than this? In comparison, what is the cost of not
knowing how your system works?
42. FAQs
“How do I get folks to participate at all?”
Ask what would make a better engineering team. Find likeminded coworkers you can trust. Build from there
43. FAQs
“How do I get people to pick up remediation
items after?”
Accountability! How do you assign your tasks now? Fostering a desire to want to help other takes time, but it’s worth it.
44. FAQs
“What if people involved in an incident are
reluctant to hold a Post Mortem?”
Don’t force, don’t do it behind their backs. Invite them to PMs where they aren’t the focus. Encourage the learning, make it ok to admit you’ve made mistakes.
46. Before the Debriefing
Info gathering and Organizing before discussions are held. Interviewing folks individually, putting together graphs and chat logs
47. When do we hold a Post Mortem?
• If you have to ask…
• Is there something to learn/share?
• The best Post Mortems build from surprise
• Discussions on failure drive your systems
48. Who facilitates?
Typically someone experienced in PMs
If possible, avoid anyone who participated in the events surrounding the incident. They have a perspective that can be colored by their decision making and drive the timeline based on their own view
Keep this on a rotating pattern. Having multiple people facilitate on a regular basis can invite new questions/discussions and avoid local maxima
49. Who is invited?
Anyone else who wants to join. Why limit learning? Open it up to whoever wants to listen in and ask questions
Consult with the main actor or actors involved. This is a collaboration, not a demand.
What teams were affected directly or indirectly? This includes
• Downstream consumers
• Those involved in the incident
• Teams owning parts of the architecture that are affected (those who built it)
Participation is optional. If you’re forcing people to pile into a room for an hour to drag details from them, you won’t find anything useful.
50. Building a Timeline
Productive ways to recall a series of events
• Looking through chat logs
• Read through emails
• Inflection points on a graph
• Pages/Alerting
• User forum posts
Walk through their version of the timeline with them as they can best relate it. Resist asking questions if you can. This is because you want to ask those when everyone is gathered together.
What happens when one or more actors disagree on specifics of a timeline, or if they simply can’t remember details in particular?
These are places to highlight for your debriefing
51. Scheduling
Get information down ASAP. Memory loss is real
Schedule the PM within 2 weeks. Any more than that and people both lose interest (“oh that? That happened weeks ago, it’s fixed”) and smooth out details (Nodding your heads, agreeing with facts laid out
as opposed to asking questions to clarify them)
Helpful to have a recurring calendar event at specific times during a week. Finding time to collaborate can be difficult. Simplify with a known point in time when you can gather, less likely to schedule on that.
52. Taking Notes
Determine a note taker, try to avoid it being you as facilitator. It’s incredibly difficult to take useful notes while analyzing the timeline for an event. You’re dividing your attention over two tasks and doing neither
well.
53. Shadow/Co-Facilitator
Co-facilitators and shadow facilitators are very useful, both to help surface questions you hadn’t thought of (being in a position of power lets you feel more confident in doing such)
Can also help with the meta surrounding getting better at being a facilitator
Also great for assisting in timeboxing. It can be difficult keeping an eye on the clock when you’re deep into your debriefing
54. Know Your Inflection Points
Places that you want to dig in to ask questions or open up to the room for questions
Helps to keep pace through the timeline. You don’t want to rush through it in 5 minutes or spending 40 minutes on the first part. Allowing time for questions throughout
Keeps focus on places that you think are important to look into. That doesn’t mean you know all of the important places. Debriefings are places to learn about what surprises you!
56. The Criticality of
Conversation
Some companies skip this step. Value in the questions/document, definitely. When do you get to ask *each other*? Facilitator can (and should!) ask questions to dig deeper. There are so many questions and
threads that an interview or interviewers won’t ask, but a group might.
57. Timeboxing
One hour for a PM
• 5 min intro - First Post Mortems, why we’re here
• 35-40 min for timeline. Know inflection points
• 10 minutes for follow up Q&A
• Remaining time for Remediation, if needed
• Keep an eye on the clock
• Know inflection points and expectations on how much left to cover.
• tangents? Ask to put a pin in them “That’s really interesting and deserves more attention. Can we go into more detail at the end?”
58. Difficulty with Data
This may seem like a lot of time, but it goes quickly. Do we explain *everything*? What do we sacrifice for time? This is where your prep work pays off
59. • MTTR?
• MTBF?
• Time to Detect/Resolve?
• Severity?
Over Reliance On…
(CfR) How often have you said “oh, that? it’s fine” and then been surprised by an outage shortly after
• MttR - Not all incidents are the same. Averaging out time loses granularity.
• Frequency of incidents / MtBF
• Time to resolve
• Time to detect
• Severity - Your idea of the severity can grow and shrink through an incident. A false page at 2am can feel like high severity until you investigate. Likewise, a “normal routine” page (“It always does that”) can
be the start (ASK - How often have you said “that’s fine” and then were surprised by an outage shortly thereafter?)
60. Opening Up
This is the core of what a facilitator does
- You’re not answering questions and you’re not uncovering the mysteries. The actors have already done that - they need to share it with each other
- You don’t even have to be knowledgeable of the systems you’re working in.
61. Can be intimidating when you’re…
• New to the company
• Unfamiliar with the system in question
• Uncomfortable in large gatherings
• One or more participants are senior leadership
• The hindsight/outcome bias is strong
• Folks are afraid others’ language may be blameful
• Tensions escalating in the room
hindsight bias - creeping determinism or “knew-it-all-along” bias
outcome - quality of decision is determined by outcome
62. How can we help?
• Give folks time to speak!
• Ask questions, don’t give answers
• Have a system/queue for asking questions - Avoid
free-for-all conversations
• Allow for anonymous questions
• Take the time to stop and say “Anything unclear?”
• Faking ignorance on a question encourages others to
ask theirs
Remember: This can be difficult. You’re asking people to be vulnerable, especially with people who we’re expected to be judged by for our ideas (this is a little wishy-washy and needs clean up)
63. Looking Deeper
• Assumptions before an action and how they changed
• Acting (or not acting!) believed to be the right
decision
• Documentation, alerts, graph - when are they useful
and when they are discarded
• Sources of truth - People
• Get knowledgeable people to say out loud what they
think is common knowledge
This is the single most important job for a facilitator.
- What are people afraid to talk about?
- What do people think isn’t important because it’s “every day” to them but a black box to others?
- Where are people going down tangents that aren’t critical
66. Root Cause / Root Cause Analysis
(CfP) Why is Root Cause harmful to investigating?
Abandoning the fallacy of a quick fix by avoiding Root Cause.
• When you say “root cause”, were there factors that were involved in making that event/decision occur?
• Were there other influencing factors?
• Discuss further the issues with Root Cause if not already clear
• Five Whys
• “What Ultimately resolved…”
67. Counterfactuals
• “If only they had…”
• “They failed to…”
• “They should have…”
• “They could have…”
You’re creating a branching pathway of events that don’t exist and judging actions on that. Failing to learn from the actions that did take place here when we create a false narrative
• This ignores context/pressures actors existed in at the time of decision making which is key
69. Other language to focus on
• Obviously
• Basically
• Simply
• Of Course
• Clearly
• Just
• Everyone knows
• Easy
Source: https://css-tricks.com/words-avoid-educational-writing/
related to hindsight bias
70. • Ask questions to dig deeper
• Highlight what context a person was in during that
time that made it seem like a good idea
• Building supportive channels to reach out
• Focus on the role of learning
• Take a pause/put a pin in a conversation
Pushing back on
Non-constructive Language
Emphasize Empathy
71. Defusing Strong
Emotions
With outages comes stress. That doesn’t go away when the incident is resolved. A lot of that emotion is carried over to the PM as well. This is where you prep work comes in handy
• Review the timeline and interview folks who may be carrying this into the PM
• Find out who might feel like they’re up against the wall for “doing the wrong thing”. Relating your own mistakes can be helpful here
• You want people to feel confident and self assured. That’s when they speak up and consequently people learn!
• If you feel like people are getting defensive, pull back a bit on questions leading with “who” for a bit.
73. Send out a Recap
This should be sent to anyone interested as well! Any documents put together should be easily accessible
Leaving “soak time” in place as well. Sometimes a lot of great ideas and questions happen after the fact
Summarize, but…be wary of leaving out important details/language!
Encourage follow up questions and further digging
Remediation items (if applicable), confirming owners for such, but not up to you to make sure they are completed
83. Post Mortems
are not a bubble
The mantle of blame awareness is not taken on and off if we’re currently in a post mortem. Your colleagues won’t be open and honest with you in a PM if they can’t do it outside. PM’s are a great place to be
mindful of this, to then practice elsewhere in our daily lives.
85. All Incidents Can Be Worse
For every incident in which we dissect the nuances of our decisions, how we get down on ourselves or others for their actions - they could be much worse. Consider all the decisions you make (and those you
*avoid*) that prevent the situation from deteriorating further.
86. There’s no
Etsy Magic
You can do it too! It’ll be rough for a while. That’s how we get good at things.