Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Socio-technical systems failure (LSCITS EngD 2012)
1. Systems failure – a socio-
technical perspective
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1
2. Complex software systems
• Multi-purpose. Organisational systems that support
different functions within an organisation
• System of systems. Usually distributed and normally
constructed by integrating existing
systems/components/services
• Unlimited. Not subject to limitations derived from the
laws of physics (so, no natural constraints on their
size)
• Data intensive. System data orders of magnitude
larger than code; long-lifetime data
• Dynamic. Changing quickly in response to changes
in the business environment
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 2
3. Systems of systems
• Operational
independence
• Managerial
independence
• Multiple
stakeholder
viewpoints
• Evolutionary
development
• Emergent
behaviour
•
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 3
Geographic
4. Complex system realities
• There is no definitive specification of what the system
should ‘do’ and it is practically impossible to create
such a specification
• The complexity of the system is such that it is not
‘understandable’ as a whole
• It is likely that, at all times, some parts of the system
will not be fully operational
• Actors responsible for different parts of the system
are likely to have conflicting goals
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 4
6. System dependability model
System fault System error
A system An erroneous system
characteristic that state that can (but need
can (but need not) not) lead to a system
lead to a system failure
error
System failure
Externally-
observed, unexpected
and undesirable system
behaviour
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 6
7. A hospital system
• A hospital system is designed to maintain information about
available beds for incoming patients and to provide information
about the number of beds to the admissions unit.
• It is assumed that the hospital has a number of empty beds and
this changes over time. The variable B reflects the number of
empty beds known to the system.
• Sometimes the system reports that the number of empty beds is
the actual number available; sometimes the system reports that
fewer than the actual number are available .
• In circumstances where the system reports that an incorrect
number of beds are available, is this a failure?
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 7
8. What is failure?
• Technical, engineering
view: a failure is ‘a
deviation from a
specification’.
• An oracle can examine a
specification, observe a
system’s behaviour and
detect failures.
• Failure is an absolute -
the system has either
failed or it hasn’t
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 8
9. Bed management system
• The percentage of system users who considered the
system’s incorrect reporting of the number of
available beds to be a failure was 0%.
• Mostly, the number did not matter so long as it was
greater than 1. What mattered was whether or not
patients could be admitted to the hospital.
• When the hospital was very busy (available beds =
0), then people understood that it was practically
impossible for the system to be accurate.
• They used other methods to find out whether or not a
bed was available for an incoming patient.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 9
10. Failure is a judgement
• Specifications are a gross simplification of reality for
complex systems.
• Users don’t read and don’t care about specifications
• Whether or not system behaviour should be considered
to be a failure, depends on the observer’s judgement
• This judgement depends on:
– The observer’s expectations
– The observer’s knowledge and experience
– The observer’s role
– The observer’s context or situation
– The observer’s authority
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 10
11. Failures are inevitable
• Technical reasons
– When systems are composed of opaque and uncontrolled
components, the behaviour of these components cannot be
completely understood
– Failures often can be considered to be failures in data rather than
failures in behaviour
• Socio-technical reasons
– Changing contexts of use mean that the judgement on what
constitutes a failure changes as the effectiveness of the system in
supporting work changes
– Different stakeholders will interpret the same behaviour in different
ways because of different interpretations of ‘the problem’
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 11
12. Conflict inevitability
• Impossible to establish a set of requirements where
stakeholder conflicts are all resolved
• Therefore, successful operation of a system for one
set of stakeholders will inevitably mean ‘failure’ for
another set of stakeholders
• Groups of stakeholders in organisations are often in
perennial conflict (e.g. managers and clinicians in a
hospital). The support delivered by a system
depends on the power held at some time by a
stakeholder group.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 12
13. Normal failures
• ‘Failures’ are not just catastrophic events but
normal, everyday system behaviour that disrupts
normal work and that mean that people have to
spend more time on a task than necessary
• A system failure occurs when a direct or indirect user
of a system has to carry out extra work, over and
above that normally required to carry out some
task, in response to some inappropriate or
unexpected system behaviour
• This extra work constitutes the cost of recovery from
system failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 13
14. The Swiss Cheese model
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 14
15. Failure trajectories
• Failures rarely have a single cause. Generally, they
arise because several events occur simultaneously
– Loss of data in a critical system
• User mistypes command and instructs data to be deleted
• System does not check and ask for confirmation of destructive
action
• No backup of data available
• A failure trajectory is a sequence of undesirable
events that coincide in time, usually initiated by some
human action. It represents a failure in the defensive
layers in the system
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 15
16. Vulnerabilities and defences
• Vulnerabilities
– Faults in the (socio-technical) system which, if triggered by a
human or technical error, can lead to system failure
– e.g. missing check on input validity
• Defences
– System features that avoid, tolerate or recover from human
error
– Type checking that disallows allocation of incorrect types of
value
• When an adverse event happens, the key question is
not ‘whose fault was it’ but ‘why did the system
defences fail?’
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16
17. Reason’s Swiss Cheese Model
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 17
18. Active failures
• Active failures
– Active failures are the unsafe acts committed by people who are in
direct contact with the system or failures in the system technology.
– Active failures have a direct and usually short-lived effect on the
integrity of the defenses.
• Latent conditions
– Fundamental vulnerabilities in one or more layers of the socio-
technical system such as system faults, system and process
misfit, alarm overload, inadequate maintenance, etc.
– Latent conditions may lie dormant within the system for many years
before they combine with active failures and local triggers to create
an accident opportunity.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 18
19. Defensive layers
• Complex IT systems should have many defensive
layers:
– some are engineered - alarms, physical barriers, automatic
shutdowns,
– others rely on people - surgeons, anesthetists, pilots, control
room operators,
– and others depend on procedures and administrative
controls.
• In an ideal word, each defensive layer would be intact.
• In reality, they are more like slices of Swiss
cheese, having many holes- although unlike in the
cheese, these holes are continually
opening, shutting, and shifting their location.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 19
20. Dynamic vulnerabilities
• While some vulnerabilities are static (e.g.
programming errors), others are dynamic and depend
on the context where the system is used.
• For example
– vulnerabilities may be related to human actions whose
performance is dependent on workload, state of mind, etc. An
operator may be distracted and forget to check something
– vulnerabilities may depend on configuration – checks may
depend on particular programs being up and running so if
program A is running in a system then a check may be made
but if program B is running, then the check is not made
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 20
22. Coping with failure
• People are good at
coping with
unexpected situations
when things go
wrong.
– They can take the
initiative, adopt
responsibilities
and, where
necessary, break the
rules or step outside
the normal process of
doing things.
– People can prioritise
and focus on the
essence of a problem
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 22
23. Recovery strategies
• Local knowledge
– Who to call; who knows what; where things are
• Process reconfiguration
– Doing things in a different way from that defined in the ‘standard’
process
– Work-arounds, breaking the rules (safe violations)
• Redundancy and diversity
– Maintaining copies of information in different forms from that
maintained in a software system
– Informal information annotation
– Using multiple communication channels
• Trust
– Relying on others to cope
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 23
24. Design for recovery
• Holistic systems engineering
– Software systems design has to be seen as part of a wider
process of socio-technical systems engineering
• We cannot build ‘correct’ systems
– We must therefore design systems to allow the broader
socio-technical systems to recognise, diagnose and recover
from failures
• Extend current systems to support recovery
• Develop recovery support systems as an integral part
of systems of systems
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 24
25. Recovery strategy
• Designing for recovery is a holistic approach to system design and
not (just) the identification of ‘recovery requirements’
• Should support the natural ability of people and organisations to
cope with problems
– Ensure that system design decisions do not increase the amount
of recovery work required
– Make system design decisions that make it easier to recover
from problems (i.e. reduce extra work required)
• Earlier recognition of problems
• Visibility to make hypotheses easier to formulate
• Flexibility to support recovery actions
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 25
26. Key points
• Failures are inevitable in complex systems because
multiple stakeholders see these systems in different
ways and because there is no single manager of
these systems
• Failures are a judgement – they are not absolute –
but depend on the system observer
• The Swiss cheese model is a failure model based on
active failures (trigger events) and latent errors
(system vulnerabilities).
• People have developed strategies for coping with
failure and systems should not be designed to make
coping more difficult.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 26