- Real-life stories of blameless post-mortems in action
- An understanding of why blaming members of a team for an outage or issue is counter productive
- Actionable steps to help you perform post-mortems in a blameless manner.
4. A little about me…
Dir. of Platform Support - AppDirect
Dir. of Technical Support - Standing
Cloud
Dir. of Operational Systems - AFI Supply
Hiker, climber, brewer, runner, biker, boarder, surfer,
painter, singer, reader, writer, picker, coder, racer,
camper, volunteer …. all the usual “Colorado 1-upper”
@GoVictorOps #VOwebinar
@jasonhand
5. Alternative names
Also known as: (Note: Public & Internal)
Project Retrospectives
Post-mortem analysis Post-project review
Quality Improvement Review
Project Analysis Review
After Action Review
Autopsy Review
Santayana Review
Touchdown Meeting
@GoVictorOps #VOwebinar
@jasonhand
Learning Review
6. Post-mortem
Defined
What ?
A process intended to inform improvements by
determining aspects that were successful or
unsuccessful.
@GoVictorOps #VOwebinar
@jasonhand
7. Post-mortem
Defined
When ?
As soon as feasible after the Incident is resolved.
@GoVictorOps #VOwebinar
@jasonhand
8. Post-mortem
Poll
Who should be involved in a blameless post-mortem?
A. Management
B. The Dev & Ops teams
C. Only those that played a part in the outage &
resolution
D. All of the above
@GoVictorOps #VOwebinar
@jasonhand
9. Post-mortem
Answer
D. All of the
above
i.e. Everybody
Who ?
@GoVictorOps #VOwebinar
@jasonhand
10. Post-mortem
Defined
To communicate with your team
Why ?
To understand what happened for learning and
improving
@GoVictorOps #VOwebinar
@jasonhand
11. Post-mortem
Defined
Talk about the incident timeline
Escalation steps
What was done to resolve the
problem
Create a remediation plan
Make it available
How ?
@GoVictorOps #VOwebinar
@jasonhand
12. The Three R’s
Regret
Acknowledgement and apology
Reason
Initial incident detection to resolution,
including the so-called “root causes.”
Remedy
Actionable remediation items
Dave Zwieback
VP Engineering - Next Big Sound
@GoVictorOps #VOwebinar
@jasonhand
( simple format )
13. Moving from Reaction to Action
(Remedy)
Use SMART recommendations
Specific
Measurable
Agreed Upon/Agreeable
Realistic
Timebound
@GoVictorOps #VOwebinar
@jasonhand
15. Cool story, bro
2011 - Hired to Standing Cloud
Cloud marketplace & automated deployment of
apps
Build Support team
Provide Managed services
@GoVictorOps #VOwebinar
@jasonhand
16. – Sydney Dekker
“Reprimanding bad apples may
seem like a quick and rewarding
fix, but it’s like peeing in your
pants.
You feel relieved and perhaps even
nice and warm for a little while,
but then it gets cold and
uncomfortable.
And you look like a fool”
@GoVictorOps #VOwebinar
@jasonhand
17. What is a blameless
post-mortem?
Team members are accountable but not responsible
Complete Transparency
Deeper look at circumstances
What happened and how to improve it (specific
details)
Real conditions of failure in complex systems
Avoid counterfactuals
@GoVictorOps #VOwebinar
@jasonhand
18. “Your organization must continually affirm
that individuals are NEVER the “root
cause” of outages.”
– Dave Zwieback
@GoVictorOps #VOwebinar
@jasonhand
20. Paraphrased from “Fallible Humans” by Ian Malpass
- DevOpsDays - Minneapolis
source: http://www.indecorous.com/fallible_humans/
@GoVictorOps #VOwebinar
@jasonhand
21. ETTO
(Efficiency Thoroughness Trade Off)
The trade off between:
being efficient
vs
being thorough
Efficient
Thorough
@GoVictorOps #VOwebinar
@jasonhand
22. “We can be thorough and really dig into the
task at hand and understand it well but this
takes time:
It is inefficient.”
- Ian Malpass
@GoVictorOps #VOwebinar
@jasonhand
23. Cause & Effect
source: http://xkcd.com
There are many factors that played a part in the
problem
“may be”
@GoVictorOps #VOwebinar
@jasonhand
24. How many times does the letter “F”
appear in the following sentence?
Finished files are the re-sult
@GoVictorOps #VOwebinar
@jasonhand
of years of scientific
study combined with the
experience of many years.
25. How many times do you see the letter F
@GoVictorOps #VOwebinar
@jasonhand
A. 3
B. 4
C. 5
D. 6
31. Reduce Stress?
… build
muscle memory
Simulate many types of
problems and outages as
“practice” …
@GoVictorOps #VOwebinar
@jasonhand
#VOwebinar
32. Evaluative Threat
Being negatively
judged plays a big role
in stress
@GoVictorOps #VOwebinar
@jasonhand
#VOwebinar
33. What is stress surface?
Variables of a situation
Novel or unusual
Unpredictable
Controllable situation
Negative judgement
Relationships
Health
Problems at home
Lack of sleep
@GoVictorOps #VOwebinar
@jasonhand
Evaluative threats
ALSO
Etc…
35. Stress Questionnaire
0 = Never 1 = Almost Never 2 = Sometimes
3 = Fairly Often 4 = Very Often
During the outage, how often have you felt or thought
that:
The situation was novel or unusual?
The situation was unpredictable?
You were unable to control the situation?
Others could judge your actions negatively?
@GoVictorOps #VOwebinar
@jasonhand
36. Why we DON’T punish
De-incentivized to give the details
Practically guarantees a repeat of the problem
Understand why actions made sense (at the time)
Create safety AND accountability
Move away from idea of “individuals are problems”
Create new “experts”
@GoVictorOps #VOwebinar
@jasonhand
38. Promoting from within
Where do we start?
The basics:
• Document your timeline or log data
• Document conversations
• Leave room for notes
• Mean time to resolution / Time calculations
• Level of severity
• Archive it for historical retrieval
• Remediation. Make it actionable
@GoVictorOps #VOwebinar
@jasonhand
43. Next Webinar:
ChatOps Unplugged
http://victorops.com/chatops-webinar
Try VictorOps!
@GoVictorOps #VOwebinar
@jasonhand
Now What?
See for yourself how we're solving
the problem of post-mortem
reporting, on-call scheduling, alert
management and remote
collaboration. Start your free trial
or join our weekly product demo
at www.victorops.com
Post-mortem Guides
44. Resources
“The Human Side of Postmortems” - Dave Zwieback
“The Field Guide to Understanding Human Error” - Sydney Dekker
“A Look at Looking in the Mirror” - J. Paul Reed
“Fallible Humans” - Ian Malpass (http://www.indecorous.com/fallible_humans/)
“4 Questions to ask for an effective Technical Post Mortem” - Jeffrey O’Brien
(http://www.maintenanceassistant.com/blog/4-questions-effective-technical-post-mortem/)
“Nine steps to IT post-mortem excellence” - Michael Krigsman (http://www.zdnet.com/blog/projectfailures/nine-steps-
to-it-post-mortem-excellence/1069)
“Postmortem reviews: purpose and approaches in software engineering” - Torgeir Dingsøyr
(http://www.uio.no/studier/emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.
pdf)
“Blameless PostMortems and a Just Culture” - John Allspaw (http://codeascraft.com/2012/05/22/blameless-postmortems/)
“What blameless really means” - Jessica Harllee (http://www.jessicaharllee.com/notes/what-blameless-really-means/)
“Each necessary, but only jointly sufficient” - John Allspaw (http://www.kitchensoap.com/2012/02/10/each-necessary-
@GoVictorOps but-only-jointly-sufficient/)
#VOwebinar
@jasonhand
Notes de l'éditeur
Good afternoon everyone.
Thank you for attending today’s webinar
As you know, I’ll be presenting on the subject of Post-mortems (specifically… blameless)
I’ll have a Q&A towards of the end of the presentation, but please feel free to reach out to me any time after…
Here are a few ways to connect with me.
Good afternoon everyone.
Thank you for joining us for today’s webinar – I’m Tara. I’m a Pisces that enjoys long hikes on deserted trails and…
I’m honored to introcude our DevOps evangelist, Jason Hand, who will be presenting on the subject of Post-mortems (specifically… blameless)
Dir. of Platform Support - AppDirect
Dir. of Tech Support - Standing Cloud
Dir. of Operational Systems - AFI Supply … where I started my professional career.
Public vs Internal
A post-mortem exists in many different formats across all industries. They can be commonly referred to as:
Everyone here has a pretty good idea of what a post-mortem is.
Let’s QUICKLY review.
First of all. It’s totally common for organizations to hold post-mortems after a successful event.
What we are talking about today are post-mortems related to an outage, so that we can focus on the idea of “blameless”.
Definition beginning:
- What happened (in detail) … the good … the bad … all of it.
It’s important that everyone involved .. be fully recovered. Take a step back and get “some” rest.
… but you don’t want to wait so long that important details begin to fade.
Who should be involved in a blameless post-mortem?
Ideally, the entire team takes part in the post-mortem.
If that’s not possible, then you should have all team members that played a part in the outage and resolution.. as well as any senior people.. or other vital teams that need to know all the details of exactly what happened.
Not specific enough?
- Introduced the problem
- Identified the problem
- Responded to the problem
- Debugged the problem
Anyone else that is interested
Keep in mind … we are participating … to LEARN
We want to know what happened in as much detail as possible so we can learn and improve our systems and processes.
…Once you begin leaning towards complete transparency .. you’re on your way towards a truly blameless post-mortem.
Here’s a general suggestion on the “How” of a post-mortem
Mention Dave Zwieback’s book “The Human Side of Post-mortems”
Regret - an acknowledgement of the impact of the outage and an apology. (usually customer facing)
Reason - a linear outage timeline.. from initial incident detection to resolution, including the so-called “root causes.” - Notice that he says “so called” root causes. More on that later.
Remedy - a list of remediation items to ensure that this particular outage won’t repeat.
- Let’s talk about Remedy for a minute as the others are pretty straight forward.
- Are you using the SMART method?
- Are you entering a JIRA ticket?
How are you following up with real “ACTION”?
- Those are the basics and overview of post-mortems
Now, let’s focus on the blameless aspect.
Let’s start with a story
- Standing Cloud was a cloud marketplace for automated deployment of we apps.
I was brought on to build a support program and
Provide basic managed services for customers.
A new role for me .. AND my first startup. VERY exciting times for me.
Tell my post-mortem story on losing customer data and how I felt like the “bad apple”
Give audience a moment to read the quote.
- Earlier this summer I attended Velocity - Santa Clara and one of the presentations I caught was titled:
“A Look at Looking in the Mirror” by J. Paul Reed.
This quote was included in his presentation but I liked it so much I had to include it as well.
Blameless port-mortems. What is it exactly?
- Blameless post-mortems means, team members should be accountable but not responsible. This is a gray area. Define better.
- Transparency; Be open and honest about what took place
- What was the larger set of circumstances that caused the “incident” or “outage”?
- The purpose of your post-mortem is not to put blame on anyone on the team, the purpose is to figure out what happened and how to improve it. Focus more on “How” rather than “Why”
The idea of a blameless post-mortem stems from the understanding that the real conditions of failure in the complex systems (that all of us are likely building or striving to build) are VERY real and play a HUGE role in how we approach an emergency situation.
Counterfactuals are NOT allowed. Talking about something that did not happen. Not useful or allowed in postmortem.
It is stating what did not happen. Words like “should” (i.e. .. you should have seen this problem) ..
This is my favorite quote from Dave’s book. (That I mentioned earlier)
- Earlier he mentioned “so called” root cause. People .. Humans .. Individuals … are never it!
- Searching for a root cause is a dead end. There isn’t a single root cause!
- You’re not going to find the root cause of a failure any easier than you’ll find the root cause of a success.
If you think back to my unfortunate example, I was lucky to be part of a team that got that idea.
They never blamed me. They never pointed fingers at someone who didn’t test something correctly.
They simply looked for ways to improve. Improve the system. Improve the product. Improve the business.
• Instead of asking why .. ask how.
• Why; insinuates you are looking for a root cause.
• How; brings us to the conditions that allowed the event to take place to begin with.
Quote:
“Cause is not something found in the rubble. Cause is created in the minds of the investigators” - Sydney Dekker
Where do you stand on the following question? Is stress good or bad?
Good
Bad
As a young athlete and musician, most of my “break out” performances were under high stress situations.
Do you ever feel like you play at your competition’s level (sometimes)?
I believe much of that has to do with stress. I finally found time to sit down and work on this presentation just this past weekend. The stress of the deadline forced me to pull together.
That was a trick question.
Stress can be good … up to a point.
This diagram is extremely interesting … although I feel like I’ve known it for years.
- Stress can sometimes be good. In fact, many claim that they work better under stress.
- You can see in the diagram that simple tasks are much more resilient to the effects of stress than complex ones.
Simple tasks would be things that are well-learned, practiced, and performed with little-to-no effort.
Complex tasks will be more unpredictable, or a feeling of a lack of control over the situation.
If any of you have ever participated in competitive sports or games..
Do you ever feel like you play at your competition’s level (sometimes)?
I believe much of that has to do with stress. I finally found time to sit down and work on this presentation just this past weekend. The stress of the deadline forced me to pull together.
Should we seek methods to reduce stress, despite the Yerkes-Dobson model shows us? No. We manage it.
- Netflix’s Chaos Monkey (part of their Simian Army) to simulate outages.
- Develop a muscle-memory of how to deal with problems which can reduce stress levels when charged with dealing with actual outages you are already familiar with. ** Like fire drills in school.
- Compare this to a guitar and the muscle memory that is developed over lots and lots of practice. You eventually just get to where your hands and fingers are doing all of the work with very little effort from your brain.
- All of this indicates that you SHOULDN’T make an effort to eliminate stress, but rather manage it and use it as a tool.
BUT don’t lose site that it CAN and DOES play a role in outages.
And keep in mind:
Reducing the impact of stress through practice and developing this “muscle memory” doesn’t address the “evaluative threat”.
“Evaluative threat”. What’s that? An example is the finger pointing and blaming of outages to specific people or teams.
Organizations where postmortems are far from blameless and where being “the root cause” of an outage could result in a demotion or getting fired… creates larger stress surfaces. .. What kind of impact does that have on your team .. and as a result .. your product .. and business?
How do we address these stress surface?
What are stress surfaces?
To a certain degree, it’s the evaluative threats we just mentioned.
ALSO But it’s also stress as a whole when you look at situations
… due to many factors
Such as …
… all of these make up the stress surface.
Advance slide
Some companies will issue a questionnaire immediately after an outage to measure stress levels during the incident. - Team members are asked a series of questions independently from each other to avoid group think.
- This is all rooted in the understanding that real conditions of failure in complex systems exist.. and finding ways to improve performance during outages can only be achieved by reducing their stress surface.
- Create an environment where there is no fear of punishment
- De-incentivizes everyone to give the details necessary to get an understanding of what actually took place.
- Lack of understanding of how the accident occurred pretty much guarantees that it will repeat. If not with the original person, certainly someone else in the future. Because the facts weren’t allowed to surface.
- It made sense to take that action (at that time) … why?
- We want a culture where team members make an effort to find balance between safety AND accountability
- Get away from idea that individuals, not situations, cause errors.
We create a situation where people who do make mistakes become experts on it and can educate the rest of the team on how not to make them in the future.
Think of it in terms of offering criminals immunity. We do it to get more information rather than just punish and stop the flow of information.
- Becoming more accepting of failure at an organization level isn’t a new concept.
I didn’t just come up with this hippie-dippie idea because I work for a startup in Boulder.
M.J. here didn’t come up with it either.
This also isn’t something that only really intuitive companies like tech startups are doing. We see it in all kinds of industries and companies of many sizes.
Why? Because it works!
As J. Paul Reed has said to me more than once…
“There is science behind it”.
What if your company isn’t doing post-mortems at all? Where do you start?
Begin by documenting everything. … the log details, conversations, escalations .. all of it
Have a place to keep notes
Calculations on time (mean time to resolution)
What was the severity of it?
Save it somewhere with easy access
Enter JIRA tickets
- I know several of you use VictorOps so you are likely aware of the post-mortem report tool.
- Those who aren’t should check out Etsy’s “Morgue” which is available as open source on Github.
- Even if it’s an internal wiki, it’s important to use some sort of tool to build and store your post-mortem.
Leave you with one final thought.