SlideShare une entreprise Scribd logo
1  sur  28
Design for Recovery, York EngD Programme, 2010 Slide 1
Resilience and recovery
Prof. Ian Sommerville
Design for Recovery, York EngD Programme, 2010 Slide 2
Objectives
• To discuss the notion of ‘failure’ in software systems
• To explain why this conventional notion of ‘failure’ is
not appropriate for many LSCITS
• To propose an approach to failure management in
LSCITS based on recoverability rather than failure
avoidance
Design for Recovery, York EngD Programme, 2010 Slide 4
Resilience
• A judgment of the ability of a system to continue to
deliver services to its user community in the presence
of failure of some of its components
• Closely related to recovery – the recovery of a
system to normal service after a system failure
• In order to provide a resilient system, it must be
possible to recover from failure as quickly as possible
without disrupting the operation of the existing system
services.
Design for Recovery, York EngD Programme, 2010 Slide 5
What is failure?
• Reductionist view: a
failure is ‘a deviation
from a specification’.
• An oracle can examine a
specification, observe a
system’s behaviour and
detect failures.
• Failure is an absolute -
the system has either
failed or it hasn’t
Design for Recovery, York EngD Programme, 2010 Slide 7
Is this a failure?
• A hospital system is designed to maintain information about
available beds for incoming patients and to provide information
about the number of beds to the admissions unit.
• It is assumed that the hospital has a number of empty beds and
this changes over time. The variable B reflects the number of
empty beds known to the system.
• Sometimes the system reports that the number of empty beds is
the actual number available; sometimes the system reports that
fewer than the actual number are available .
• In circumstances where the system reports that an incorrect
number of beds are available, is this a failure?
Design for Recovery, York EngD Programme, 2010 Slide 8
Bed management system
• The percentage of system users who considered the
system’s incorrect reporting of the number of
available beds to be a failure was 0%.
• Mostly, the number did not matter so long as it was
greater than 1. What mattered was whether or not
patients could be admitted to the hospital.
• When the hospital was very busy (available beds =
0), then people understood that it was practically
impossible for the system to be accurate.
• They used other methods to find out whether or not a
bed was available for an incoming patient.
Design for Recovery, York EngD Programme, 2010 Slide 9
Failure is a judgement
• Specifications are a simplification of reality.
• Users don’t read and don’t care about specifications
• Whether or not system behaviour should be considered
to be a failure, depends on the judgement of an
observer of that behaviour
• This judgement depends on:
• The observer’s expectations
• The observer’s knowledge and experience
• The observer’s role
• The observer’s context or situation
• The observer’s authority
Design for Recovery, York EngD Programme, 2010 Slide 10
System failure
• ‘Failures’ are not just catastrophic events but normal,
everyday system behaviour that disrupts normal work
and that mean that people have to spend more time
on a task than necessary
• A system failure occurs when a direct or indirect user
of a system has to carry out extra work, over and
above that normally required to carry out some task,
in response to some inappropriate system behaviour
• This extra work constitutes the cost of recovery from
system failure
Design for Recovery, York EngD Programme, 2010 Slide 11
Failures are inevitable
• Technical reasons
• When systems are composed of opaque and uncontrolled
components, the behaviour of these components cannot be
completely understood
• Failures often can be considered to be failures in data rather than
failures in behaviour
• Socio-technical reasons
• Changing contexts of use mean that the judgement on what
constitutes a failure changes as the effectiveness of the system in
supporting work changes
• Different stakeholders will interpret the same behaviour in different
ways because of different interpretations of ‘the problem’
Design for Recovery, York EngD Programme, 2010 Slide 12
Conflict inevitability
• Impossible to establish a set of requirements where
stakeholder conflicts are all resolved
• Therefore, successful operation of a system for one
set of stakeholders will inevitably mean ‘failure’ for
another set of stakeholders
• Groups of stakeholders in organisations are often in
perennial conflict (e.g. managers and clinicians in a
hospital). The support delivered by a system
depends on the power held at some time by a
stakeholder group.
Design for Recovery, York EngD Programme, 2010 Slide 13
Where are we?
• Large-scale information systems are inevitably
complex systems
• Such systems cannot be created using a reductionist
approach
• Failures are a judgement and this may change over
time
• Failures are inevitable and cannot be engineered out
of a system
Design for Recovery, York EngD Programme, 2010 Slide 14
The way forward
• We need to accept that technical system
failures will always occur and examine how
we can design these systems to allow the
broader socio-technical systems, in which
these technical systems are used, to
recognise, diagnose and recover from these
failures
Design for Recovery, York EngD Programme, 2010 Slide 15
Software dependability
• A reductionist approach to software dependability takes the view
that software failures are a consequence of software faults
• Techniques to improve dependability include
• Fault avoidance
• Fault detection
• Fault tolerance
• These approaches have taken us quite a long way in improving
software dependability. However, further progress is unlikely to
be achieved by further improvement of these techniques as they
rely on a reductionist view of failure.
Design for Recovery, York EngD Programme, 2010 Slide 16
Failure recovery
• Recognition
• Recognise that inappropriate behaviour has occurred
• Hypothesis
• Formulate an explanation for the unexpected behaviour
• Recovery
• Take steps to compensate for the problem that has arisen
Design for Recovery, York EngD Programme, 2010 Slide 18
Coping with failure
• People are good at
coping with
unexpected situations
when things go
wrong.
• They can take the
initiative, adopt
responsibilities and,
where necessary, break
the rules or step
outside the normal
process of doing things.
• People can prioritise
and focus on the
essence of a problem
Design for Recovery, York EngD Programme, 2010 Slide 19
Recovering from failure
• Local knowledge
• Who to call; who knows what; where things are
• Process reconfiguration
• Doing things in a different way from that defined in the ‘standard’
process
• Work-arounds, breaking the rules (safe violations)
• Redundancy and diversity
• Maintaining copies of information in different forms from that
maintained in a software system
• Informal information annotation
• Using multiple communication channels
• Trust
• Relying on others to cope
Design for Recovery, York EngD Programme, 2010 Slide 20
Design for recovery
• Holistic systems engineering
• Software systems design has to be seen as part of a wider
process of socio-technical systems engineering
• We cannot build ‘correct’ systems
• We must therefore design systems to allow the broader
socio-technical systems to recognise, diagnose and recover
from failures
• Extend current systems to support recovery
• Develop recovery support systems as an integral part
of systems of systems
Design for Recovery, York EngD Programme, 2010 Slide 22
Recovery strategy
• Designing for recovery is a holistic approach to system design and
not (just) the identification of ‘recovery requirements’
• Should support the natural ability of people and organisations to
cope with problems
• Ensure that system design decisions do not increase the amount
of recovery work required
• Make system design decisions that make it easier to recover
from problems (i.e. reduce extra work required)
• Earlier recognition of problems
• Visibility to make hypotheses easier to formulate
• Flexibility to support recovery actions
Design for Recovery, York EngD Programme, 2010 Slide 28
Design guidelines
• Local knowledge
• Process reconfiguration
• Redundancy and diversity
Design for Recovery, York EngD Programme, 2010 Slide 29
Local knowledge
• Local knowledge includes
knowledge of who does what,
how authority structures can be
bypassed, what rules can be
broken, etc.
• Impossible to replicate entirely in
distributed systems but some
steps can be taken
• Maintain information about the
provenance of data
• Maintain organisational models,
contact lists, etc.
Design for Recovery, York EngD Programme, 2010 Slide 31
Process reconfiguration
• Make workflows explicit rather than
embedding them in the software
• Not just ‘continue’ buttons! Users should
know where they are and where they are
supposed to go
• Support workflow
navigation/interruption/restart
• Design systems with an ‘emergency
mode’ where the the system changes
from enforcing policies to auditing actions
• This would allow the rules to be broken
but the system would maintain a log of
what has been done and why so that
subsequent investigations could trace
what happened
Design for Recovery, York EngD Programme, 2010 Slide 34
Redundancy and diversity
• Data replication
• Encourage the creation of ‘shadow systems’ and provide
import and export from these systems
• Schema extension
• Schemas for data are rarely designed for problem solving.
Always allow informal extension (a free text box) so that
annotations, explanations and additional information can be
provided
• Organisational models
• To allow for multi-channel communications when things go
wrong
• Paper
• How to recover when the computers go off
Design for Recovery, York EngD Programme, 2010 Slide 35
Barriers to recovery
• Security
• Automation
• Process
• Organisations
Design for Recovery, York EngD Programme, 2010 Slide 36
Security and recoverability
• Recoverability
• Relies on trusting operators
of the system not to abuse
privileges that they may
have been granted to help
recover from problems
• Security
• Relies on mistrusting users
and restricting access to
information on a ‘need to
know’ basis
Design for Recovery, York EngD Programme, 2010 Slide 37
Automation hiding
• A problem with automation is that
information becomes subject to
organizational policies that restrict
access to that information.
• Even when access is not restricted, we
don’t have any shared culture in how to
organise a large information store
• Generally, authorisation models
maintained by the system are based on
normal rather than exceptional
operation.
• When problems arise and/or when
people are unavailable, breaking the
rules to solve these problems is made
more difficult.
Design for Recovery, York EngD Programme, 2010 Slide 38
Process tyranny
• There is a notion that ‘standard’
business processes can be defined
and embedded in systems that
support these processes
• Implicitly or explicitly, the system
enforces the use of the ‘standard’
process
• But this assumes three things:
• The standard process is always
appropriate
• The standard process has
anticipated all possible failures
• The system can be respond in a
timely way to process changes
Design for Recovery, York EngD Programme, 2010 Slide 39
Multi-organisational systems
• Many rules enforced in
different ways by different
systems.
• Information is distributed -
users may not be aware of
where information is located,
who owns information, etc.
• Processes involve remote
actors so process
reconfiguration is more
difficult
• Restricted information
channels (e.g. help
unavailable outside normal
business hours; no phone
Design for Recovery, York EngD Programme, 2010 Slide 41
Take away messages
• Improving existing software engineering methods will help but
will not deal with the problems of complexity that are inherent in
distributed systems of systems or LSCITS
• We must learn to live with normal, everyday failures
• Failure management in complex IT systems is not simply a
technical problem.
• Don’t stop clever people working out how to recover from
failures
• Recovery strategies include supporting information redundancy
and annotation and maintaining organisational models

Contenu connexe

Tendances

Problem solving method
Problem solving methodProblem solving method
Problem solving method
BSEPhySci14
 
Information Technology Project Management
Information Technology Project ManagementInformation Technology Project Management
Information Technology Project Management
Goutama Bachtiar
 

Tendances (20)

PERT, GANTT CHART and BENCHMARKING
PERT, GANTT CHART and BENCHMARKINGPERT, GANTT CHART and BENCHMARKING
PERT, GANTT CHART and BENCHMARKING
 
Problem solving techniques
Problem solving techniquesProblem solving techniques
Problem solving techniques
 
PERT & Gantt Chart
PERT & Gantt Chart PERT & Gantt Chart
PERT & Gantt Chart
 
Problem solving
Problem solvingProblem solving
Problem solving
 
Project Management Basics
Project Management BasicsProject Management Basics
Project Management Basics
 
Goal attainment theory by Imogene.M.King
Goal attainment theory by Imogene.M.KingGoal attainment theory by Imogene.M.King
Goal attainment theory by Imogene.M.King
 
Fundamentals of project management
Fundamentals of project managementFundamentals of project management
Fundamentals of project management
 
Teamwork in Software Engineering Projects
Teamwork in Software Engineering ProjectsTeamwork in Software Engineering Projects
Teamwork in Software Engineering Projects
 
Rehabilitation and recovery from mental illness
Rehabilitation and  recovery from mental illnessRehabilitation and  recovery from mental illness
Rehabilitation and recovery from mental illness
 
transcultural nursing.ppt
transcultural nursing.ppttranscultural nursing.ppt
transcultural nursing.ppt
 
Requirements Engineering - Domain Models
Requirements Engineering - Domain ModelsRequirements Engineering - Domain Models
Requirements Engineering - Domain Models
 
Problem solving method
Problem solving methodProblem solving method
Problem solving method
 
Pert and gantt chart
Pert and gantt chartPert and gantt chart
Pert and gantt chart
 
Supervision santhosh
Supervision santhoshSupervision santhosh
Supervision santhosh
 
Project Management Concepts
Project Management ConceptsProject Management Concepts
Project Management Concepts
 
Information Technology Project Management
Information Technology Project ManagementInformation Technology Project Management
Information Technology Project Management
 
Time management in nursing
Time management in nursingTime management in nursing
Time management in nursing
 
Discipline in Nursing Education
Discipline in Nursing EducationDiscipline in Nursing Education
Discipline in Nursing Education
 
Critical path method
Critical path methodCritical path method
Critical path method
 
General systems theory - a brief introduction
General systems theory - a brief introductionGeneral systems theory - a brief introduction
General systems theory - a brief introduction
 

Similaire à Resilience and recovery

46393833 e banking
46393833 e banking46393833 e banking
46393833 e banking
dipali2009
 
system development life cycle
system development life cycle system development life cycle
system development life cycle
Sumit Yadav
 

Similaire à Resilience and recovery (20)

Designing Complex Systems for Recovery (LSCITS EngD 2011)
Designing Complex Systems for Recovery (LSCITS EngD 2011)Designing Complex Systems for Recovery (LSCITS EngD 2011)
Designing Complex Systems for Recovery (LSCITS EngD 2011)
 
System success and failure
System success and failureSystem success and failure
System success and failure
 
REQUIREMENT ENGINEERING
REQUIREMENT ENGINEERINGREQUIREMENT ENGINEERING
REQUIREMENT ENGINEERING
 
Dependability requirements for LSCITS
Dependability requirements for LSCITSDependability requirements for LSCITS
Dependability requirements for LSCITS
 
SRE.pptx
SRE.pptxSRE.pptx
SRE.pptx
 
Requirements Engineering for LSCITS
Requirements Engineering for LSCITSRequirements Engineering for LSCITS
Requirements Engineering for LSCITS
 
Requirementengg
RequirementenggRequirementengg
Requirementengg
 
system development life cycle
system development life cyclesystem development life cycle
system development life cycle
 
Usability Evaluation
Usability EvaluationUsability Evaluation
Usability Evaluation
 
Lesson 9 system develpment life cycle
Lesson 9 system develpment life cycleLesson 9 system develpment life cycle
Lesson 9 system develpment life cycle
 
Towards Dementia-friendly Smart Home
Towards Dementia-friendly Smart HomeTowards Dementia-friendly Smart Home
Towards Dementia-friendly Smart Home
 
Presentation2
Presentation2Presentation2
Presentation2
 
46393833 e banking
46393833 e banking46393833 e banking
46393833 e banking
 
Requirement Gathering
Requirement GatheringRequirement Gathering
Requirement Gathering
 
system development life cycle
system development life cycle system development life cycle
system development life cycle
 
Unit 2 SEPM_ Requirement Engineering
Unit 2 SEPM_ Requirement EngineeringUnit 2 SEPM_ Requirement Engineering
Unit 2 SEPM_ Requirement Engineering
 
Observability in highly distributed systems
Observability in highly distributed systemsObservability in highly distributed systems
Observability in highly distributed systems
 
Unit 3_Evaluation Technique.pptx
Unit 3_Evaluation Technique.pptxUnit 3_Evaluation Technique.pptx
Unit 3_Evaluation Technique.pptx
 
Intro to requirements eng.
Intro to requirements eng.Intro to requirements eng.
Intro to requirements eng.
 
SDLC
SDLCSDLC
SDLC
 

Plus de Ian Sommerville

Security case buffer overflow
Security case buffer overflowSecurity case buffer overflow
Security case buffer overflow
Ian Sommerville
 
CS5032 Case study Ariane 5 launcher failure
CS5032 Case study Ariane 5 launcher failureCS5032 Case study Ariane 5 launcher failure
CS5032 Case study Ariane 5 launcher failure
Ian Sommerville
 
CS5032 Case study Kegworth air disaster
CS5032 Case study Kegworth air disasterCS5032 Case study Kegworth air disaster
CS5032 Case study Kegworth air disaster
Ian Sommerville
 
CS5032 L19 cybersecurity 1
CS5032 L19 cybersecurity 1CS5032 L19 cybersecurity 1
CS5032 L19 cybersecurity 1
Ian Sommerville
 
CS5032 L20 cybersecurity 2
CS5032 L20 cybersecurity 2CS5032 L20 cybersecurity 2
CS5032 L20 cybersecurity 2
Ian Sommerville
 
L17 CS5032 critical infrastructure
L17 CS5032 critical infrastructureL17 CS5032 critical infrastructure
L17 CS5032 critical infrastructure
Ian Sommerville
 
CS5032 Case study Maroochy water breach
CS5032 Case study Maroochy water breachCS5032 Case study Maroochy water breach
CS5032 Case study Maroochy water breach
Ian Sommerville
 
CS 5032 L18 Critical infrastructure 2: SCADA systems
CS 5032 L18 Critical infrastructure 2: SCADA systemsCS 5032 L18 Critical infrastructure 2: SCADA systems
CS 5032 L18 Critical infrastructure 2: SCADA systems
Ian Sommerville
 
CS5032 L9 security engineering 1 2013
CS5032 L9 security engineering 1 2013CS5032 L9 security engineering 1 2013
CS5032 L9 security engineering 1 2013
Ian Sommerville
 
CS5032 L10 security engineering 2 2013
CS5032 L10 security engineering 2 2013CS5032 L10 security engineering 2 2013
CS5032 L10 security engineering 2 2013
Ian Sommerville
 
CS5032 L11 validation and reliability testing 2013
CS5032 L11 validation and reliability testing 2013CS5032 L11 validation and reliability testing 2013
CS5032 L11 validation and reliability testing 2013
Ian Sommerville
 
CS 5032 L12 security testing and dependability cases 2013
CS 5032 L12  security testing and dependability cases 2013CS 5032 L12  security testing and dependability cases 2013
CS 5032 L12 security testing and dependability cases 2013
Ian Sommerville
 

Plus de Ian Sommerville (20)

Ultra Large Scale Systems
Ultra Large Scale SystemsUltra Large Scale Systems
Ultra Large Scale Systems
 
Resp modellingintro
Resp modellingintroResp modellingintro
Resp modellingintro
 
LSCITS-engineering
LSCITS-engineeringLSCITS-engineering
LSCITS-engineering
 
Requirements reality
Requirements realityRequirements reality
Requirements reality
 
Conceptual systems design
Conceptual systems designConceptual systems design
Conceptual systems design
 
An introduction to LSCITS
An introduction to LSCITSAn introduction to LSCITS
An introduction to LSCITS
 
Internet worm-case-study
Internet worm-case-studyInternet worm-case-study
Internet worm-case-study
 
Designing software for a million users
Designing software for a million usersDesigning software for a million users
Designing software for a million users
 
Security case buffer overflow
Security case buffer overflowSecurity case buffer overflow
Security case buffer overflow
 
CS5032 Case study Ariane 5 launcher failure
CS5032 Case study Ariane 5 launcher failureCS5032 Case study Ariane 5 launcher failure
CS5032 Case study Ariane 5 launcher failure
 
CS5032 Case study Kegworth air disaster
CS5032 Case study Kegworth air disasterCS5032 Case study Kegworth air disaster
CS5032 Case study Kegworth air disaster
 
CS5032 L19 cybersecurity 1
CS5032 L19 cybersecurity 1CS5032 L19 cybersecurity 1
CS5032 L19 cybersecurity 1
 
CS5032 L20 cybersecurity 2
CS5032 L20 cybersecurity 2CS5032 L20 cybersecurity 2
CS5032 L20 cybersecurity 2
 
L17 CS5032 critical infrastructure
L17 CS5032 critical infrastructureL17 CS5032 critical infrastructure
L17 CS5032 critical infrastructure
 
CS5032 Case study Maroochy water breach
CS5032 Case study Maroochy water breachCS5032 Case study Maroochy water breach
CS5032 Case study Maroochy water breach
 
CS 5032 L18 Critical infrastructure 2: SCADA systems
CS 5032 L18 Critical infrastructure 2: SCADA systemsCS 5032 L18 Critical infrastructure 2: SCADA systems
CS 5032 L18 Critical infrastructure 2: SCADA systems
 
CS5032 L9 security engineering 1 2013
CS5032 L9 security engineering 1 2013CS5032 L9 security engineering 1 2013
CS5032 L9 security engineering 1 2013
 
CS5032 L10 security engineering 2 2013
CS5032 L10 security engineering 2 2013CS5032 L10 security engineering 2 2013
CS5032 L10 security engineering 2 2013
 
CS5032 L11 validation and reliability testing 2013
CS5032 L11 validation and reliability testing 2013CS5032 L11 validation and reliability testing 2013
CS5032 L11 validation and reliability testing 2013
 
CS 5032 L12 security testing and dependability cases 2013
CS 5032 L12  security testing and dependability cases 2013CS 5032 L12  security testing and dependability cases 2013
CS 5032 L12 security testing and dependability cases 2013
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Resilience and recovery

  • 1. Design for Recovery, York EngD Programme, 2010 Slide 1 Resilience and recovery Prof. Ian Sommerville
  • 2. Design for Recovery, York EngD Programme, 2010 Slide 2 Objectives • To discuss the notion of ‘failure’ in software systems • To explain why this conventional notion of ‘failure’ is not appropriate for many LSCITS • To propose an approach to failure management in LSCITS based on recoverability rather than failure avoidance
  • 3. Design for Recovery, York EngD Programme, 2010 Slide 4 Resilience • A judgment of the ability of a system to continue to deliver services to its user community in the presence of failure of some of its components • Closely related to recovery – the recovery of a system to normal service after a system failure • In order to provide a resilient system, it must be possible to recover from failure as quickly as possible without disrupting the operation of the existing system services.
  • 4. Design for Recovery, York EngD Programme, 2010 Slide 5 What is failure? • Reductionist view: a failure is ‘a deviation from a specification’. • An oracle can examine a specification, observe a system’s behaviour and detect failures. • Failure is an absolute - the system has either failed or it hasn’t
  • 5. Design for Recovery, York EngD Programme, 2010 Slide 7 Is this a failure? • A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit. • It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system. • Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available . • In circumstances where the system reports that an incorrect number of beds are available, is this a failure?
  • 6. Design for Recovery, York EngD Programme, 2010 Slide 8 Bed management system • The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%. • Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital. • When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate. • They used other methods to find out whether or not a bed was available for an incoming patient.
  • 7. Design for Recovery, York EngD Programme, 2010 Slide 9 Failure is a judgement • Specifications are a simplification of reality. • Users don’t read and don’t care about specifications • Whether or not system behaviour should be considered to be a failure, depends on the judgement of an observer of that behaviour • This judgement depends on: • The observer’s expectations • The observer’s knowledge and experience • The observer’s role • The observer’s context or situation • The observer’s authority
  • 8. Design for Recovery, York EngD Programme, 2010 Slide 10 System failure • ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary • A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate system behaviour • This extra work constitutes the cost of recovery from system failure
  • 9. Design for Recovery, York EngD Programme, 2010 Slide 11 Failures are inevitable • Technical reasons • When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood • Failures often can be considered to be failures in data rather than failures in behaviour • Socio-technical reasons • Changing contexts of use mean that the judgement on what constitutes a failure changes as the effectiveness of the system in supporting work changes • Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’
  • 10. Design for Recovery, York EngD Programme, 2010 Slide 12 Conflict inevitability • Impossible to establish a set of requirements where stakeholder conflicts are all resolved • Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders • Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.
  • 11. Design for Recovery, York EngD Programme, 2010 Slide 13 Where are we? • Large-scale information systems are inevitably complex systems • Such systems cannot be created using a reductionist approach • Failures are a judgement and this may change over time • Failures are inevitable and cannot be engineered out of a system
  • 12. Design for Recovery, York EngD Programme, 2010 Slide 14 The way forward • We need to accept that technical system failures will always occur and examine how we can design these systems to allow the broader socio-technical systems, in which these technical systems are used, to recognise, diagnose and recover from these failures
  • 13. Design for Recovery, York EngD Programme, 2010 Slide 15 Software dependability • A reductionist approach to software dependability takes the view that software failures are a consequence of software faults • Techniques to improve dependability include • Fault avoidance • Fault detection • Fault tolerance • These approaches have taken us quite a long way in improving software dependability. However, further progress is unlikely to be achieved by further improvement of these techniques as they rely on a reductionist view of failure.
  • 14. Design for Recovery, York EngD Programme, 2010 Slide 16 Failure recovery • Recognition • Recognise that inappropriate behaviour has occurred • Hypothesis • Formulate an explanation for the unexpected behaviour • Recovery • Take steps to compensate for the problem that has arisen
  • 15. Design for Recovery, York EngD Programme, 2010 Slide 18 Coping with failure • People are good at coping with unexpected situations when things go wrong. • They can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things. • People can prioritise and focus on the essence of a problem
  • 16. Design for Recovery, York EngD Programme, 2010 Slide 19 Recovering from failure • Local knowledge • Who to call; who knows what; where things are • Process reconfiguration • Doing things in a different way from that defined in the ‘standard’ process • Work-arounds, breaking the rules (safe violations) • Redundancy and diversity • Maintaining copies of information in different forms from that maintained in a software system • Informal information annotation • Using multiple communication channels • Trust • Relying on others to cope
  • 17. Design for Recovery, York EngD Programme, 2010 Slide 20 Design for recovery • Holistic systems engineering • Software systems design has to be seen as part of a wider process of socio-technical systems engineering • We cannot build ‘correct’ systems • We must therefore design systems to allow the broader socio-technical systems to recognise, diagnose and recover from failures • Extend current systems to support recovery • Develop recovery support systems as an integral part of systems of systems
  • 18. Design for Recovery, York EngD Programme, 2010 Slide 22 Recovery strategy • Designing for recovery is a holistic approach to system design and not (just) the identification of ‘recovery requirements’ • Should support the natural ability of people and organisations to cope with problems • Ensure that system design decisions do not increase the amount of recovery work required • Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required) • Earlier recognition of problems • Visibility to make hypotheses easier to formulate • Flexibility to support recovery actions
  • 19. Design for Recovery, York EngD Programme, 2010 Slide 28 Design guidelines • Local knowledge • Process reconfiguration • Redundancy and diversity
  • 20. Design for Recovery, York EngD Programme, 2010 Slide 29 Local knowledge • Local knowledge includes knowledge of who does what, how authority structures can be bypassed, what rules can be broken, etc. • Impossible to replicate entirely in distributed systems but some steps can be taken • Maintain information about the provenance of data • Maintain organisational models, contact lists, etc.
  • 21. Design for Recovery, York EngD Programme, 2010 Slide 31 Process reconfiguration • Make workflows explicit rather than embedding them in the software • Not just ‘continue’ buttons! Users should know where they are and where they are supposed to go • Support workflow navigation/interruption/restart • Design systems with an ‘emergency mode’ where the the system changes from enforcing policies to auditing actions • This would allow the rules to be broken but the system would maintain a log of what has been done and why so that subsequent investigations could trace what happened
  • 22. Design for Recovery, York EngD Programme, 2010 Slide 34 Redundancy and diversity • Data replication • Encourage the creation of ‘shadow systems’ and provide import and export from these systems • Schema extension • Schemas for data are rarely designed for problem solving. Always allow informal extension (a free text box) so that annotations, explanations and additional information can be provided • Organisational models • To allow for multi-channel communications when things go wrong • Paper • How to recover when the computers go off
  • 23. Design for Recovery, York EngD Programme, 2010 Slide 35 Barriers to recovery • Security • Automation • Process • Organisations
  • 24. Design for Recovery, York EngD Programme, 2010 Slide 36 Security and recoverability • Recoverability • Relies on trusting operators of the system not to abuse privileges that they may have been granted to help recover from problems • Security • Relies on mistrusting users and restricting access to information on a ‘need to know’ basis
  • 25. Design for Recovery, York EngD Programme, 2010 Slide 37 Automation hiding • A problem with automation is that information becomes subject to organizational policies that restrict access to that information. • Even when access is not restricted, we don’t have any shared culture in how to organise a large information store • Generally, authorisation models maintained by the system are based on normal rather than exceptional operation. • When problems arise and/or when people are unavailable, breaking the rules to solve these problems is made more difficult.
  • 26. Design for Recovery, York EngD Programme, 2010 Slide 38 Process tyranny • There is a notion that ‘standard’ business processes can be defined and embedded in systems that support these processes • Implicitly or explicitly, the system enforces the use of the ‘standard’ process • But this assumes three things: • The standard process is always appropriate • The standard process has anticipated all possible failures • The system can be respond in a timely way to process changes
  • 27. Design for Recovery, York EngD Programme, 2010 Slide 39 Multi-organisational systems • Many rules enforced in different ways by different systems. • Information is distributed - users may not be aware of where information is located, who owns information, etc. • Processes involve remote actors so process reconfiguration is more difficult • Restricted information channels (e.g. help unavailable outside normal business hours; no phone
  • 28. Design for Recovery, York EngD Programme, 2010 Slide 41 Take away messages • Improving existing software engineering methods will help but will not deal with the problems of complexity that are inherent in distributed systems of systems or LSCITS • We must learn to live with normal, everyday failures • Failure management in complex IT systems is not simply a technical problem. • Don’t stop clever people working out how to recover from failures • Recovery strategies include supporting information redundancy and annotation and maintaining organisational models