WEBINAR: VictorOps Blameless Post-Mortems

(Blameless)
post-mortems
@GoVictorOps #VOwebinar
@jasonhand

Jason Hand
DevOps
“Handyman”
jason@VictorOps.com
@jasonhand
@jasonhand

Tara Calihman
Social Media
Marketing Director
@Tarable

A little about me…
Dir. of Platform Support - AppDirect
Dir. of Technical Support - Standing
Cloud
Dir. of Operational Systems - AFI Supply
Hiker, climber, brewer, runner, biker, boarder, surfer,
painter, singer, reader, writer, picker, coder, racer,
camper, volunteer …. all the usual “Colorado 1-upper”
@jasonhand

Alternative names
Also known as: (Note: Public & Internal)
Project Retrospectives
Post-mortem analysis Post-project review
Quality Improvement Review
Project Analysis Review
After Action Review
Autopsy Review
Santayana Review
Touchdown Meeting
@jasonhand
Learning Review

Post-mortem
Defined
What ?
A process intended to inform improvements by
determining aspects that were successful or
unsuccessful.
@jasonhand

Post-mortem
Defined
When ?
As soon as feasible after the Incident is resolved.
@jasonhand

Post-mortem
Poll
Who should be involved in a blameless post-mortem?
A. Management
B. The Dev & Ops teams
C. Only those that played a part in the outage &
resolution
D. All of the above
@jasonhand

Post-mortem
Answer
D. All of the
above
i.e. Everybody
Who ?
@jasonhand

Post-mortem
Defined
To communicate with your team
Why ?
To understand what happened for learning and
improving
@jasonhand

Post-mortem
Defined
Talk about the incident timeline
Escalation steps
What was done to resolve the
problem
Create a remediation plan
Make it available
How ?
@jasonhand

The Three R’s
Regret
Acknowledgement and apology
Reason
Initial incident detection to resolution,
including the so-called “root causes.”
Remedy
Actionable remediation items
Dave Zwieback
VP Engineering - Next Big Sound
@jasonhand
( simple format )

Moving from Reaction to Action
(Remedy)
Use SMART recommendations
Specific
Measurable
Agreed Upon/Agreeable
Realistic
Timebound
@jasonhand

Blameless
image from “Across the Universe”
#VOwebinar

Cool story, bro
2011 - Hired to Standing Cloud
Cloud marketplace & automated deployment of
apps
Build Support team
Provide Managed services
@jasonhand

– Sydney Dekker
“Reprimanding bad apples may
seem like a quick and rewarding
fix, but it’s like peeing in your
pants.
You feel relieved and perhaps even
nice and warm for a little while,
but then it gets cold and
uncomfortable.
And you look like a fool”
@jasonhand

What is a blameless
post-mortem?
Team members are accountable but not responsible
Complete Transparency
Deeper look at circumstances
What happened and how to improve it (specific
details)
Real conditions of failure in complex systems
Avoid counterfactuals
@jasonhand

“Your organization must continually affirm
that individuals are NEVER the “root
cause” of outages.”
– Dave Zwieback
@jasonhand

Why
vs
How

Paraphrased from “Fallible Humans” by Ian Malpass
- DevOpsDays - Minneapolis
source: http://www.indecorous.com/fallible_humans/
@jasonhand

ETTO
(Efficiency Thoroughness Trade Off)
The trade off between:
being efficient
vs
being thorough
Efficient
Thorough
@jasonhand

“We can be thorough and really dig into the
task at hand and understand it well but this
takes time:
It is inefficient.”
- Ian Malpass
@jasonhand

Cause & Effect
source: http://xkcd.com
There are many factors that played a part in the
problem
“may be”
@jasonhand

How many times does the letter “F”
appear in the following sentence?
Finished files are the re-sult
@jasonhand
of years of scientific
study combined with the
experience of many years.

How many times do you see the letter F
@jasonhand
A. 3
B. 4
C. 5
D. 6

@jasonhand
Answer:
6
Cognitive Bias

Stress
& Cognitive Bias
@jasonhand

@jasonhand
Is stress good or bad?
A.Good
B.Bad

Yerkes-Dodson Model
source: The Human Side of Postmortems
@jasonhand

@jasonhand
Is stress good or bad?
Answer:
Both

Reduce Stress?
… build
muscle memory
Simulate many types of
problems and outages as
“practice” …
@jasonhand
#VOwebinar

Evaluative Threat
Being negatively
judged plays a big role
in stress
@jasonhand
#VOwebinar

What is stress surface?
Variables of a situation
Novel or unusual
Unpredictable
Controllable situation
Negative judgement
Relationships
Health
Problems at home
Lack of sleep
@jasonhand
Evaluative threats
ALSO
Etc…

Capturing the
Human-side
Ask questions
@jasonhand

Stress Questionnaire
0 = Never 1 = Almost Never 2 = Sometimes
3 = Fairly Often 4 = Very Often
During the outage, how often have you felt or thought
that:
The situation was novel or unusual?
The situation was unpredictable?
You were unable to control the situation?
Others could judge your actions negatively?
@jasonhand

Why we DON’T punish
De-incentivized to give the details
Practically guarantees a repeat of the problem
Understand why actions made sense (at the time)
Create safety AND accountability
Move away from idea of “individuals are problems”
Create new “experts”
@jasonhand

@jasonhand
#VOwebinar

Promoting from within
Where do we start?
The basics:
• Document your timeline or log data
• Document conversations
• Leave room for notes
• Mean time to resolution / Time calculations
• Level of severity
• Archive it for historical retrieval
• Remediation. Make it actionable
@jasonhand

Tools
Etsy’s Morgue
VictorOps
Post-mortem Report
@jasonhand
Internal Wiki

VictorOps
Post-Mortem
Report
Etsy’s
Morgue
Tool

Seek the truth
Don’t blame others …
Don’t blame yourself
Thank You
@jasonhand

Questions ?
@jasonhand

Next Webinar:
ChatOps Unplugged
http://victorops.com/chatops-webinar
Try VictorOps!
@jasonhand
Now What?
See for yourself how we're solving
the problem of post-mortem
reporting, on-call scheduling, alert
management and remote
collaboration. Start your free trial
or join our weekly product demo
at www.victorops.com
Post-mortem Guides

Resources
“The Human Side of Postmortems” - Dave Zwieback
“The Field Guide to Understanding Human Error” - Sydney Dekker
“A Look at Looking in the Mirror” - J. Paul Reed
“Fallible Humans” - Ian Malpass (http://www.indecorous.com/fallible_humans/)
“4 Questions to ask for an effective Technical Post Mortem” - Jeffrey O’Brien
(http://www.maintenanceassistant.com/blog/4-questions-effective-technical-post-mortem/)
“Nine steps to IT post-mortem excellence” - Michael Krigsman (http://www.zdnet.com/blog/projectfailures/nine-steps-
to-it-post-mortem-excellence/1069)
“Postmortem reviews: purpose and approaches in software engineering” - Torgeir Dingsøyr
(http://www.uio.no/studier/emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.
pdf)
“Blameless PostMortems and a Just Culture” - John Allspaw (http://codeascraft.com/2012/05/22/blameless-postmortems/)
“What blameless really means” - Jessica Harllee (http://www.jessicaharllee.com/notes/what-blameless-really-means/)
“Each necessary, but only jointly sufficient” - John Allspaw (http://www.kitchensoap.com/2012/02/10/each-necessary-
@GoVictorOps but-only-jointly-sufficient/)
#VOwebinar
@jasonhand

WEBINAR: VictorOps Blameless Post-Mortems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

Similaire à WEBINAR: VictorOps Blameless Post-Mortems

Similaire à WEBINAR: VictorOps Blameless Post-Mortems (20)

Plus de VictorOps

Plus de VictorOps (20)

Dernier

Dernier (20)

WEBINAR: VictorOps Blameless Post-Mortems

Notes de l'éditeur