2. In this lecture…
• What do we mean
by human failure
• Human error and
socio-technical
systems
• Designing error
tolerant systems
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 2
3. Human failure
• Human failures are said to account for
– 50-70% of aviation disasters
– 44,000 – 98,000 deaths each year in America result from
medication errors
– 60-85% of shuttle incidents at NASA
– 92-95% of car crashes
– 70% of shipping accidents
• A common reaction to human failure is to blame the
user
• But “To Err is Human”, so we should be concerned
with designing systems that are resilient to human
error
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 3
4. What is human error?
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 4
5. Definitions of human error
• An inappropriate or undesirable human
decision or behaviour that reduces or has the
potential for reducing the
effectiveness, dependability or performance
of a system
• Examples:
– Errors of omission - forget to do something
– Errors of commission - doing something incorrectly
– Sequence errors - out of order
– Timing errors - too slow - too fast - too late
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 5
6. System dependability model
System fault System error
A system An erroneous system
characteristic that state that can (but need
can (but need not) not) lead to a system
lead to a system failure
error
System failure
Externally-
observed, unexpected
and undesirable system
behaviour
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 6
7. A dependability perspective
• Human error – behaviour that leads to the
introduction of a fault into a system
– Development errors
– Operational errors
– Maintenance errors
• Emphasises that human errors do not necessarily
lead to system failures
• We are not just interested in errors in system
operation
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 7
8. Example
• Operator specifies a value of 12 rather than -12 for
the temperature of a freezer
• Developer has not included a check that values set
are below 0 degrees
• The resultant system fault is that the system
thermostat is set to the wrong value
• The resultant system error (erroneous state) is that
the refrigerant pump is not switched on
• The observed system failure is that the freezer is
warm and its contents have defrosted
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 8
9. What is an error?
• Determining a human error often involves a
judgment.
– Sometimes a human action is clearly a human error
– Sometimes a human action is only clearly an error with
hindsight
– Sometimes a human action that would ordinarily be an error
is not an error.
• Users are not just people who cause errors, but they
are often the ones who trap and correct errors
(human or technical errors)
– Never simply assume that systems are inherently safe and
humans introduce errors
– Many now prefer the terms “human reliability” or “resilience”
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 9
10. Human error ambiguity?
• It can be difficult to distinguish between safe and
erroneous behaviour.
– An action that is an error in one context may not be in
another
• Failing to follow procedures for the safe use of ladders
• Not an error if the goal is to rescue a child trapped in a fire
• Erroneous actions?
– Following a rule or instruction if following it causes a failure
– Deliberately not following a rule or instruction if resources are
unavailable or if not following the rule avoids a failure
– Deviating from a defined process or procedure to save time
or improve quality
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 10
11. GEMS
• GEMS (Generic Error Modelling System) was
developed by a psychologist, James Reason, at
Manchester University
• Based on the notion that human actions are based
around:
• Intentions, Goals, Plans and Actions
• In GEMS, human error occurs as:
– The failure to perform some plan or task properly
– The failure to apply the correct plan
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 11
12. Types of human activity
• GEMS distinguishes three ways in which actions are
performed:
– Skills-based performance
• Routine things done without much cognitive effort e.g. driving a
car
– Rule-based performance
• Following a set of rules or procedure e.g. transferring data from
one system to another
– Knowledge based performance
• Applying knowledge in completing some task e.g. planning travel
from St Andrews to a meeting in Rome
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 12
13. Human error classification
• Slips, which occur in skills based performance
– Are an “execution failure”, where the operator‟s intentions are
correct but actions are not carried out properly
• Lapses, which also occur in skills based performance
– Also are an “execution failure”, but this time where an
operator forgets to do something, loses their place in a
task, etc.
• Mistakes, which occur in rule and knowledge based
performance
– These are “planning failures”, where an inappropriate set of
actions is carried out
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 13
14. Human error in complex socio-
technical systems
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 14
15. Influences on human actions
Blunt
Regulations
Organisations
Groups
Users
Technology
Sharp End
End
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 15
16. The socio-technical systems
stack
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16
17. Human fallibility and
dependability
• Human fallibility can influence the dependability of an
LSCITS:
– During the development process
– During the deployment process
– During the maintenance/management process
– During the operational process
• Errors made during development, deployment and
maintenance create vulnerabilities that may interact
with „errors‟ during the operational process to cause
system failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 17
18. Example
• Maintenance error leads to vulnerability in system
– Say the automatic backup disk is switched from A to B to
check that a change to the backup system has been made
correctly. The maintainer then forgets to switch the backup
disk back to A and dismounts B
– Consequence of maintenance error is that backups are not
made
• Operator error leads to an erroneous command being
input to the system
– Operator accidentally overwrites a file in the system with
incorrect data
– Goes to backup system to recover previous version of file
• File cannot be recovered
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 18
19. Failure trajectories
• Failures rarely have a single cause. Generally, they
arise because several events occur simultaneously
– Loss of data in a critical system
• User mistypes command and instructs data to be deleted
• System does not check and ask for confirmation of destructive
action
• No backup of data available
• A failure trajectory is a sequence of undesirable
events that coincide in time, usually initiated by some
human action. It represents a failure in the defensive
layers in the system
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 19
20. Vulnerabilities and defences
• Vulnerabilities
– Faults in the (socio-technical) system which, if triggered by a
human error, can lead to system failure
– e.g. missing check on input validity
• Defences
– System features that avoid, tolerate or recover from human
error
– Type checking that disallows allocation of incorrect types of
value
• When an adverse event happens, the key question is
not „whose fault was it‟ but „why did the system
defences fail?‟
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 20
21. Reason‟s Swiss Cheese Model
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 21
22. Active failures
• Active failures
– Active failures are the unsafe acts committed by people who are in
direct contact with the system (slips, lapses, mistakes, and
procedural violations).
– Active failures have a direct and usually short-lived effect on the
integrity of the defenses.
• Latent conditions
– Fundamental vulnerabilities in one or more layers of the socio-
technical system such as system faults, system and process
misfit, alarm overload, inadequate maintenance, etc.
– Latent conditions may lie dormant within the system for many years
before they combine with active failures and local triggers to create
an accident opportunity.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 22
23. Defensive layers
• Complex IT systems should have many defensive
layers:
– some are engineered - alarms, physical barriers, automatic
shutdowns,
– others rely on people - surgeons, anesthetists, pilots, control
room operators,
– and others depend on procedures and administrative
controls.
• In an ideal word, each defensive layer would be intact.
• In reality, they are more like slices of Swiss
cheese, having many holes- although unlike in the
cheese, these holes are continually
opening, shutting, and shifting their location.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 23
24. Dynamic vulnerabilities
• While some vulnerabilities are static (e.g.
programming errors), others are dynamic and depend
on the context where the system is used.
• For example
– vulnerabilities may be related to human actions whose
performance is dependent on workload, state of mind, etc. An
operator may be distracted and forget to check something
– vulnerabilities may depend on configuration – checks may
depend on particular programs being up and running so if
program A is running in a system then a check may be made
but if program B is running, then the check is not made
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 24
25. Human error and complexity
• System complexity and change adds to the ambiguity
of human errors
– Many human errors are insignificant and do not lead to failure
– Many human errors are spotted and resolved by defensive
layers in the system
– Human actions that are correct may become erroneous
because of a change elsewhere in the system
– An error can be made many times without contributing to a
failure, but then suddenly, because some system
components has changed in some way, it will.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 25
26. Human error and complexity
• An error can be made many times without
contributing to a failure, but then suddenly one day it
will
• Example
– An operator logs information by sending it to an email
address which uses an obsolete domain name (dcs.st-
and.ac.uk)
– Version X of the system relies on a DNS that maps the
obsolete name to the new name (cs.st-andrews.ac.uk) so this
works OK – no error is reported
– Four years after the initial change, a new DNS is installed
and the domain name mapping is removed
– The day after this happens, the system fails because the 26
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide
email log message cannot be sent
28. System resilience
• Failure avoidance
– Fault avoidance
– Fault detection
– Fault tolerance
• Failure recovery
– Returning to normal operation after the occurrence of a
system failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 28
29. Incident reduction
• Reduce the number of latent conditions in the
different layers of the system (plug the holes)
– If the number of faults in a software system is reduced, this
increases the strength of the defensive layer
– However, this technical approach ON ITS OWN cannot be
completely effective as it is practically impossible to reduce
the number of latent conditions in the system to zero
• Increase the number of defensive layers and hence
reduce the probability of an accident trajectory
occurring
• Reduce the number of active faults that occur
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 29
30. Conditions leading to human
error
• Distractions
• Incomplete or incorrect data
• Boredom
• Inadequate resources
• Cognitive overload
• Stress
• Illness
• Time pressure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 30
31. Systems design and human
error
• Once we begin to understand what human errors are
possible and how they can come about, we can start
designing systems that better withstand human error
• Avoidance
– Design the system so that certain classes of human error are
eliminated
• Detection
– Make it easier for the operator and others to spot errors
• Tolerance
– Ensure that individual errors are unlikely to lead to system
failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 31
32. Design guidelines
• Minimise potential for slips, lapses and mistakes by
designing systems and work environments where
people aren‟t distracted or overwhelmed, but aren‟t
bored either
• Minimise potential for mistakes by designing systems
and work environments where people are able to
understand what is happening and the consequences
of an action
• Minimise potential for mistakes by making sure
people are trained properly
• Minimise potential for deliberate violations by making
Human Failure, LSCITS, EngD course in Socio-technicaldesigned and well understood.
sure rules are well Systems,, 2012 Slide 32
33. Detection and tolerance
• Detecting and correcting error
– Automated correction can be useful, but can be dangerous!
– Alarms and alerts may be better than automated
correction, but need to be well designed.
– Allow for human correction, by making it possible to „undo‟
– Make it easier for the user or another person to spot errors
• Tolerating human error
– Remember, breaking the rules might be for good reasons.
Think of users as people attempting to do things, rather than
simply as operators of the system.
– Human error is common, so try not to create systems where
a single human error can cause a failure.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 33
34. Recovery
• Design for failure
– Discussed in previous module on systems engineering for
LSCITS
• Make work visible
• Switch from enforcing mode to auditing mode
• Support role transferability
• Balance recovery and security
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 34
35. Key points
• Human error accounts for the majority of all systems failures
• Human error is very common, and only occasionally leads to
failure
• The same human action may or may not be an error depending
on the context of that action
• There are methods for analysing and predicting human
errors, and these can be used to improve system design
• Some systems are more prone to accidents than others because
of the way they have been designed
• Critical systems should be designed to minimise or detect
human error
• Blaming the user is a common response to human error, but the
fault lies with the system and the system engineers.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 35
Notes de l'éditeur
Following rules leading to failureExample of departing train with no passengersNot following rules to avoid failureNot following rules because impractical to do soRules are application must be checked and approved by Finance but no-one available to checkApplication submitted without checkingNot following safety procedures (but no failure ensues)Ticket collector on train does not charge extra for an invalid ticket because the ticket holder genuinely misunderstood validity restrictions
Example – taking a corner too fast when drivingForgetting to set the permissions on files copied from system A to system BFailing to understand that meeting is in Rome NY rather than Rome Italy