Incident Reviews for a Learning Organisation
We all aspire to have a culture of learning and continuous improvement in our teams and organisations but learning and improving when things go wrong is far from easy.
When dealing with the fallout from failure - Incident reviews, Incident reports, investigations etc. - the way in which we respond to is a crucial to improving safety and the performance of our organisations.
Andy will talk about how Major Incident Reviews are run in IT Operations at Auto Trader. He’ll discuss what works well for them and will bring together practical advice from industry experts for creating a culture of safety and learning. Andy will also cover what mistakes they’ve made, what to avoid and the factors that can prevent learning.
3. What is a Learning Organisation?
What is the Reality?
What are my Choices?
Incident Reviews - things to Avoid
Incident Reviews - things to Encourage
What about holding people to Account?
A bit on Our process
Learning from Incidents
4. Our People
PRIVATE Car Sellers
Trade Car Dealers
30,000
15,000
Auto Trader Staff
Product & Tech Teams
850
275
Our Customers
5. Our Technology Platform
1.2 billion page views per month
70 million peak page views per day
15 million unique visitors per month
Supported by 100 live applications
6. Further Reading up front
Links:
John Allspaw - The Infinite Hows
Steve Shorrock - if it werent for the people
EuroControl - Systems Thinking for Safety
Lyndsay Holmwood - Blame-Language-Sharing
Sydney Dekker - Just Culture
Black Box Thinking –
Matthew Syed
People:
Steven Shorrock
Erik Hollnagel
Sidney Dekker
Matthew Syed
John Allspaw
Lindsay Holmwood
Dave Zwieback
Nancy Leveson
Field Guide to
Understanding
Human Error –
Sidney Dekker
Beyond Blame –
Dave Zwieback Nancy Leveson -
Engineering a Safer
World
Further Reading up front
31. Don’t go too Deep!
Environment
Capabilities
Behavior
Values and Beliefs
Identity
Contexts – WHERE?
Methods, Approaches – HOW?
Skills and Actions – WHAT?
What is important/true – WHY?
Sense of Self – WHO?
Dilts Model
38. Incident Review Prompts
(from The Field Guide To Understanding Human Error, by Sidney Dekker)
At each juncture in the sequence of events (if that is how you want to structure this part of the accident story), you want to get to
know:
• Which cues were observed (what did he or she notice/see or did not notice what he or she had expected to notice?)
• What knowledge was used to deal with the situation? Did participants have any experience with similar situations that was useful in dealing with this
one?
• What expectations did participants have about how things were going to develop, and what options did they think they have to influence the course
of events?
• How did other influences (operational or organizational) help determine how they interpreted the situation and how they would act?
Here are some questions Gary Klein and his researchers typically ask to find out how the situation looked to people on the inside at each of the critical
junctures:
Debriefings need not follow such a scripted set of questions, of course, as the relevance of questions depends on the event. Also, the questions can come across
to
participants as too conceptual to make any sense. You may need to reformulate them in the language of the domain.
Cues What were you seeing?
What were you focusing on?
What were you expecting to happen?
Interpretation If you had to describe the situation to your colleague at that point, what would you have told?
Errors What mistakes (for example in interpretation) were likely at this point?
Previous
experience/knowledge
Were you reminded of any previous experience?
Did this situation fit a standard scenario?
Were you trained to deal with this situation?
Were there any rules that applied clearly
here?
Did any other sources of knowledge suggest what to do?Goals What were you trying to achieve?
Were there multiple goals at the same time?
Was there time pressure or other limitations on what you could do?
Taking action How did you judge you could influence the course of events?
Did you discuss or mentally imagine a number of options or did you know straight away what to do?
Outcome Did the outcome fit your expectation?
Did you have to update your assessment of the situation?
Communications What communication medium(s) did you prefer to use? (phone, chat, email, video conf,
etc.?) Did you make use of more than one communication channels at once?
Help Did you ask anyone for help?
What signal brought you to ask for support or assistance?
Were you able to contact the people you needed to
contact?
39. Timelines
14:00 Alert
received from
Site confidence
15:15 Incident
communication
sent
16:00 Incident
closure comms
sent
1. Factual timeline entries
can be filled in prior to the
Review Meeting
40. Timelines
14:00 Alert
received from
Site confidence
15:15 Incident
communication
sent
16:00 Incident
closure comms
sent
1. Factual timeline entries
can be filled in prior to the
Review Meeting
13:10 Slow server
performance
observed by BIll
14:20 Bill spoke to John
about SC issues and
decided to recover DB
15:50 John finished DB
recovery
2. As a group,
overlay the basic
timeline with key
decisions and
junctures
51. We understand and truly believe that everyone did
the best job they could, given what they knew at the
time, their skills and abilities, the resources
available, and the situation at hand
We are here to learn and find solutions to improve
our ways of working
Why we are here:
52. Open Minded
Go back in time
No single ‘Root Cause’
How not Why
Things that help us learn
53. • Blaming people
• Human Error
• Arse Covering
• Points scoring
• ‘Trying Harder’
• Talking over people
Things that stop us learning:
54. After the review:
• Incident details recorded
• Actions (owners, dates) recorded
• Owned by Service Management Team
55. Further Reading up front
Links:
John Allspaw - The Infinite Hows
Steve Shorrock - if it werent for the people
EuroControl - Systems Thinking for Safety
Lyndsay Holmwood - Blame-Language-Sharing
Sydney Dekker - Just Culture
Black Box Thinking –
Matthew Syed
People:
Steven Shorrock
Erik Hollnagel
Sidney Dekker
Matthew Syed
John Allspaw
Lindsay Holmwood
Dave Zwieback
Nancy Leveson
Field Guide to
Understanding
Human Error –
Sidney Dekker
Beyond Blame –
Dave Zwieback Nancy Leveson -
Engineering a Safer
World
Further Reading Again
Private Sellers: us selling our Cars
Trade Car Dealers 15,000 - Independent dealers, Franchise dealers, Car Supermarkets
Availability at 99.99%
supporting products for Consumers, Private Sellers and Trade Retailers.
Supporting access across multiple platforms
Supporting Commercial and International Autotrader sites
e.g. Dealer Websites
Automotive leader for dealer websites with just under 5000 dealers’ sites hosted
Peter Senge – 1990 – the Fifth Discipline
Learning and transformation are central functions of the organisation – always changing , never steady state.
A Learning Organisation is a term given to a company that facilitates the learning of its members and continuously transforms itself.
A learning organisation is a place where people are continually discovering how they create their reality.
The loss of the stable state means that our society and all of its institutions are in continuous processes of transformation. We cannot expect new stable states that will endure for our own lifetimes.
We must learn to understand, guide, influence and manage these transformations. We must make the capacity for undertaking them integral to ourselves and to our institutions.
We must, in other words, become adept at learning. We must become able not only to transform our institutions, in response to changing situations and requirements; we must invent and develop institutions which are ‘learning systems’, that is to say, systems capable of bringing about their own continuing transformation. (Schon 1973: 28)
http://infed.org/mobi/the-learning-organization/
A story from Toyota’s origins when it used to build automatic looms. Upon hearing
that the plans for one of the looms had been stolen, Kiichiro Toyoda is said to
have remarked:
Certainly the thieves may be able to follow the design plans and produce
a loom. But we are modifying and improving our looms every
day. So by the time the thieves have produced a loom from the plans
they stole, we will have already advanced well beyond that point. And
because they do not have the expertise gained from the failures it took
to produce the original, they will waste a great deal more time than us
as they move to improve their loom. We need not be concerned about
what happened. We need only continue as always, making our
improvements.
The long-term value of an enterprise is not captured by the value of its products
and intellectual property but rather by its ability to continuously increase
the value it provides to customers—and to create new customers—through
Learning and innovation.
(Lean Enterprise p18)
And we all do this within our organisations right???
ITIL – continuous improvementDeming Cycle – PDCA
OODA
DMAIC – Six sigma lean process improvement
Our attitudes, culture and behavior prevent learning
WHY DO WE DO IT???
Fundamental Attribution Error:
How do we explain the behavior of others
It turns out there is we are biased towards.
Explain the behavior of others due to their personality
Explain our own behavior as a result of context.
We need to overcome this bias to learn from Incidents and other kinds of failure
Image
http://www.ffxiah.com/forum/topic/26676/fundamental-attribution-error
WHY DO WE DO IT???
We assume that
All accidents or incidents require a human mistake
The severity of the accident is proportional to the size of the mistake
Punishment acts as a deterrent to prevent issues happening in the future.
Need for Retributive justice
Punishment
Deterrent
We often diminish the need for restorative justice.
Preventing the issue happening again
WHY DO WE DO IT???
Hindsight BIAS - knew-it-all-along effect
is the inclination, after an event has occurred, to see the event as having been predictable, despite there having been little or no objective basis for predicting it.
Narrative written after the fact
Does not make sense
England football manager Fabio Capello – From Black Box thinking – Matthew Syed.
Came into English football in 2008 – 2012
He introduced a strict regime of diet, rules around lateness, bans for family members from training and tournements
He was pretty successful and lots of commentators put this down to
Retributive vs. Restorative Justice
This table illustrates the differences in the approach to justice between Retributive Justice and Restorative Justice. As you will see, Restorative Justice is much more community centric and focuses on making the victim whole.
Retributive Justice
Restorative Justice
Crime is an act against the state, a violation of a law, an abstract idea
Crime is an act against another person and the community
The criminal justice system controls crime
Crime control lies primarily in the community
Offender accountability defined as taking punishment
Accountability defined as assuming responsibility and taking action to repair harm
Crime is an individual act with individual responsibility
Crime has both individual and social dimensions of responsibility
Punishment is effective:
Threats of punishment deter crime
Punishment changes behavior
Punishment alone is not effective in changing behavior and is disruptive to community harmony and good relationships
Victims are peripheral to the process
Victims are central to the process of resolving a crime.
The offender is defined by deficits
The offender is defined by capacity to make reparation
Focus on establishing blame or guilt, on the past (did he/she do it?)
Focus on the problem solving, on liabilities/obligations, on the future (what should be done?)
Emphasis on adversarial relationship
Emphasis on dialogue and negotiation
Imposition of pain to punish and deter/prevent
Restitution as a means of restoring both parties; goal of reconciliation/restoration
Community on sideline, represented abstractly by state
Community as facilitator in restorative process
Response focused on offender’s past behavior
Response focused on harmful consequences of offender’s behavior; emphasis is on the future
Dependence upon proxy professionals
Direct involvement by participants
WHY DO WE DO IT???
Bad Apple Theory:
Complacency
We assume that systems and procedures are safe and reliable
It’s only a few ‘bad apples’
http://radar.oreilly.com/2014/11/if-it-werent-for-the-people.html
Steve Shorrock
Our view is often that the system is basically safe, so long as the human works as imagined. When things go wrong, we have a seemingly innate human tendency to blame the person at the sharp end. We don’t seem to think of that someone – pilot, controller, train driver or surgeon – as a human being who goes to work to ensure things go right in a messy, complex, demanding and uncertain environment.
Work as imagined
vs
Work as done
We don’t understand.
Trade offs
Completing pressures
Conflicting incentives
Procedures adapted for real world
Blame is easy
It removes accountability from the organisation
We don’t need to consider organizational changes, system changes,
(difficult things)
It removes the need for self criticism
It’s cheap
It’s quick
Miss Universe 2015
Steve Harvey – veteran TV presenter in America
Announced the winner as Columbia and not Miss Phillipines
It’s just one of those things that happen
Human Error
https://www.linkedin.com/pulse/how-bad-design-wrecked-steve-harveys-universe-eric-Thomas
It’s just one of those things that happen
Human Error
Lights,
Sounds
What was on the card?
What was on the teleprompter?
Blame impact:
Fewer issues reported
Culture of fear
Less responsibility taken – safety is someone else’s responsibility to implement.
The wrong data – incorrect accounts
Dishonesty – distortion
Denying error, diminishing the impact
Our attitudes, culture and behavior prevent learning
Often Learnt behavior from Leaders
Will prevent learning
John Allspaw – Blameless Postmortem – Web Operations
We need to find ways to allow practitioners to tell their stories
Without fear that there will be retribution
In a supportive atmosphere where failure is not stigmatized
Where we regularly talk about (celebrate) our mistakes and take ownership of improving things
This cycle of name/blame/shame can be looked at like this:
Engineer takes action and contributes to a failure or incident.
Engineer is punished, shamed, blamed, or retrained.
Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
Errors more likely, latent conditions can’t be identified due to #5, above
Repeat from step 1
Need a wide range of review attendees taking actions
Trust between engineers taking action (sharp end) and managers (blunt end)
Especially important to share these actions across teams, departments, disciplines
Things to avoid – managers taking no action, or all the actions!
Refer John Allspaw – Inifinte Hows
DILTS Model – logical levels - -levels of learning and change
Useful as a coaching aim
How the language you use can affect the impact and depth to which you get a response.
Asking Who and Why really probe deep through these logical levels
John Allspaw – infinite hows
1. A new release disabled a feature for some customers. WHY? Because a particular server failed
2.
Environment – Contexts
Behavior – Skills and Actions
Capabilities – Methods, Approaches, Strategies
Values and Beliefs – What is important and true
Identity – You sense of self
Why
Asks people to justify their actions
Leads to
Who
No single root causes with any incident involving complex systems (all our incidents)
Cherry picking of data to prove pre-existing ideas about what happened.
WHAT YOU FIND IS WHAT YOU LOOK FOR
Points scoring: It’s easy to use examples of when things go wrong to prove a point or win battles with others. This is generally cherry picking of information and unhelpful to us as an organisation. If unchallenged it will lead to more defensiveness, hiding/manipulation of data etc.
Our attitudes, culture and behavior prevent learning
Good psychological effect
States what’s expected
Frames the conversation
Example from Matthew Syed again – priming experiment and walking the corridor
Good example – Agile Prime Directive
"Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand."
Open Minded
Everyone is expected to come to an incident review keen to learn new information and listen to the experiences and stories of their colleagues. It’s not acceptable to bring your pre-formulated, rigid ideas of what happened/causes/solutions.
Explore differences of opinion
Listen to peoples stories of how events unfolded
Focus on Going back in Time
Understand the nature of the events AS THEY UNFOLDED over time
Are we talking about ordinary routine work
Special event?
Something never seen before?
Consider the predictability of the even at the time.
what was known at the time.
Go back in time
What information was available to you at this point?
What cues, what alerts, what information was available
What other pressures did you have
Time pressure, multiple focusses
Actual vs Ideal
Timeline should probably be created by the people attending the incident review, but we’ve amended so that a ‘factual’ bare bones of the time line is pre-populated by the Incident owner prior to the meeting to save time.
Some facts can be added to the timeline before the meeting to save time – Duty Manager can collate this data from Chat, Logs, Emails etc.
When adding information as a group about decisions made , explore the differences between peoples perception of what happened
Be careful not to get trapped into ‘single root cause’ and listen to as many contributing factors as possible.
Actual vs Ideal
Some facts can be added to the timeline before the meeting to save time – Duty Manager can collate this data from Chat, Logs, Emails etc.
When adding information as a group about decisions made , explore the differences between peoples perception of what happened
Be careful not to get trapped into ‘single root cause’ and listen to as many contributing factors as possible.
Ensure everyone gets a chance to speak and be listened to by everyone
Need to keep the whole room to one conversation.
Actions shared, visible, completed
Don’t always have to have an action!
It might be that understanding how colleagues dealt with the incident and learning more about normal working of your organisation is enough.
Are you the right person to run the incident review ??
Are you seen as impartial??
I’ve done this !!! Give example of not doing this right.
Are you the right person to lead this Review?
Are you independent
Are you independent enough to be fair and impartial?
And be seen by others as such?
Celebrate the things that went well
(timeline shows how well response unfolded, can report on how well people worked together)
Not just pat on back
Analyse the things that worked well – good decisions that prevented more downtime, how people adapted what they knew to a new situation
How can the good patterns be replicated or enhanced even further?
If you truly understand what went well and HOW it went well – you can re-produce this is more situations.
Retrospectives
Reviews
If you only review the most serious of incidents – you will not get the atmosphere right, people will not be used to it
People will be defensive
So this is all great, but what about when people need to be held responsible for their actions?
Good PDF
http://www.saa.com.sg/saaWeb2011/export/sites/saa/en/Publication/downloads/JustCulture_ReportingtheLine_Accountability.pdf
Negligence (turning up to work drunk)
Malicious damage (intentionally trying to harm people or the organisation)
Incompetence (Making stupid mistakes, not following clear procedures)
Gross Misconduct
e.g.
Defined in a nursing malpractice situation, negligence means the following: The doing of something which a reasonably prudent person would not do, or the failure to do something which a reasonably prudent person would do, under circumstances similar to those shown by the evidence.
http://ccn.aacnjournals.org/content/23/5/72.full
http://www.saa.com.sg/saaWeb2011/export/sites/saa/en/Publication/downloads/JustCulture_ReportingtheLine_Accountability.pdf
Accountability is often interpreted as blaming practitioners for mistakes. This creates a conflict between learning and accountability.
This paper proposes three simultaneous directions to achieve a Just Culture:
not using incident reports as evidence for disciplinary action,
deciding and getting broad support for who gets to decide what is acceptable and unacceptable behaviour
switching from blame and backward-looking accountability to forward-looking accountability.
We have an poor view of accountability
BLAME does not equal accountability
Blame has a massive cost
Blame limits accountability
We need to encourage forward-accountability
not using incident reports as evidence for disciplinary action, b)
deciding and getting broad support for who gets to decide what is acceptable and unacceptable behaviour and
c) switching from blame and backward-looking accountability to forward-looking accountability
Cost:
The fear of blame, sanction and punishment, however, is known to change the behaviour of practitioners. They might be induced to hide, downplay or redefine incidents, rather than reporting and sharing them (Merry & Smith, 2001), creating a culture of ‘risk secrecy.’ The possibility of disciplinary action (or worse, prosecution) creates a conflict between accountability and learning. Blame is known as the enemy of safety (Leveson, 2011).
Forward-looking accountability needs an environment which encourages sharing accounts and takes away the idea of blame
All Major Incidents and High Severity Incidents have a review
All failed large changes have a review (including things that should have been large)
All failed Releases have a review
We use a similar format for Team / Project retrospectives - certainly in atmosphere
All Major Incidents and High Severity Incidnets have a review
All failed large changes have a review (including things that should have been large)
All failed Releases have a review
We use a similar format for Team / Project retrospectives - certainly in atmosphere
Timeline written on wall
Paper prompts at the top are the ‘priming’ bit and read out at the start of the meeting
They are also emailed with the Incident Review invite.
We use a similar format for Team / Project retrospectives - certainly in atmosphere
State what we are here for:
State our approach:
A note from Martin Fowler on PRIMING
http://martinfowler.com/bliki/PrimingPrimeDirective.html
Open Minded
Everyone is expected to come to an incident review keen to learn new information and listen to the experiences and stories of their colleagues. It’s not acceptable to bring your pre-formulated, rigid ideas of what happened/causes/solutions.
Go back in time
It’s critical we avoid using HINDSIGHT – we need to understand what information was available at the time when decisions were made and actions were performed. Best way to do this is put yourself back in time – into the context of how things unfolded.
No single ‘Root Cause’
In complex systems (of people interacting with technology) there is never a single root cause. We often have many contributing factors to how events unfold – lots of those contributing factors are present all the time even when things go right. Stopping at one root cause will miss all this information.
How not Why
Questions that start WHY (or even worse WHO) tend to force people to justify actions, to attribute and apportion blame. WHY focuses the inquiry on people which is not what we want.
We want to gather information about how events unfolded – asking HOW thinks appeared, changed, worked, WHAT happened next, WHAT was expected. Questions around HOW, WHAT, WHEN are much more effective for this.
Blaming people: It’s a popular belief that we have basically safe systems and if you just sorted out the behaviour of a few ‘bad apples’ things would be OK. That’s not the way to improve safety and is generally a cop-out. Please see Agile Prime Directive for what we do believe.
Human Error: As above, we are all on the same side trying to do the best job we can. Everyone has variable performance and we need systems/processes etc that are better able to accommodate and expect that.
Arse Covering: Hiding information that could help us improve as an organisation would a terrible symptom of something that is wrong with our company culture. We need to make every effort to remove fear of judgement/consquences etc. from Incident Reviews. We need everyone to be open and honest.
Points scoring: It’s easy to use examples of when things go wrong to prove a point or win battles with others. This is generally cherry picking of information and unhelpful to us as an organisation. If unchallenged it will lead to more denfensiveness, hiding/manipulation of data etc.
‘Trying Harder’: We will never take an action to ‘be more careful’ or ‘try harder’ not to break things. We all try pretty damn hard already and that is never the solution we are looking for.
Talking over people: We can only have one conversation at a time if we are to get a shared understanding of what’s happened and what we can do about it.