SERENE 2014 School on Engineering Resilient Cyber Physical Systems
Talk: Resilience in Cyber-Physical Systems: Challenges and Opportunities, by Gabor Karsai
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Opportunities
1. Resilience in Cyber-Physical Systems:
Challenges and Opportunities
Gabor Karsai
Institute for Software-Integrated Systems
Vanderbilt University
SERENE 2014 – Autumn School
2. Acknowledgements
People: Janos Sztipanovits, Daniel Balasubramanian,
Abhishek Dubey, Tihamer Levendovszky, Nag Mahadevan,
and many others at the Institute for Software-Integrated
Systems @ Vanderbilt University
Sponsors: AFRL, DARPA, NASA, NSF through various
programs
3. Outline
Introduction
Cyber-physical Systems
Resilience
Building resilient CPS
System-level fault diagnostics
Software health management
Resilient architectures and autonomy
Conclusions
5. What is a Cyber-Physical System?
An engineered system that integrates physical and cyber
components where relevant functions are realized
through the interactions between the physical and cyber
parts.
Physical = some tangible, physical device + environment
Cyber = computational + communicational
6. Courtesy of Kuka Robotics Corp.
Cyber-Physical Systems (CPS):
Integrating networked computational
resources with physical systems
E-Corner, Siemens
Courtesy of Doug Schmidt
Power
generation and
distribution
Courtesy of
General Electric
Military systems:
Transportation
(Air traffic
control at
Avionics SFO)
Telecommunications
Factory automation
Instrumentation
(Soleil Synchrotron)
Daimler-Chrysler
Automotive
Building Systems
Courtesy of Ed Lee, UCB
10. A Typical Cyber-Physical System
Printing Press
• Application aspects
• local (control)
• distributed (coordination)
• global (modes)
• Ethernet network
• Synchronous, Time-Triggered
• IEEE 1588 time-sync protocol
• High-speed, high precision
• Speed: 1 inch/ms (~100km/hr)
• Precision: 0.01 inch
Bosch-Rexroth -> Time accuracy: 10us
Courtesy of Ed Lee, UCB
11. Example – Flying Paster
Source: http://offsetpressman.blogspot.com/2011/03/how-flying-paster-works.html
Courtesy of Ed Lee, UCB
S.e n so r top dead center
Active
paper
feed
Paper
cutt,er
Idle roller
!Flyi ng
R,.~_$J.fil:.
Drive roll,er
Dancer
Idle roller
!D rive roller
~--------------------------------------------------------------------------------------------------------------------------
13. Example: Medical Devices
Emerging direction: Cell phone
based medical devices for
affordable healthcare
e.g. “Telemicroscopy” project
at Berkeley
e.g. Cell-phone based blood
testing device developed at
UCLA
Courtesy of Ed Lee, UCB
16. The Good News…
Networking and computing delivers unique precision and flexibility in
interaction and coordination
Computing/Communication Integrated CPS
Rich time models
Precise interactions across highly
extended spatial/temporal
dimension
Flexible, dynamic communication
mechanisms
Precise time-variant, nonlinear
behavior
Introspection, learning, reasoning
Elaborate coordination of
physical processes
Hugely increased system size
with controllable, stable
behavior
Dynamic, adaptive architectures
Adaptive, autonomic systems
Self monitoring, self-healing
system architectures and better
safety/security guarantees.
17. …and the Challenges
Fusing networking and computing with physical processes brings new
Computing/Communication Integrated CPS
Cyber vulnerability
New type of interactions across
highly extended spatial/temporal
dimension
Flexible, dynamic communication
mechanisms
Precise time-variant, nonlinear
behavior
Introspection, learning, reasoning
Physical behavior of systems
can be manipulated
Lack of composition theories for
heterogeneous systems: much
unsolved problems
Vastly increased complexity
and emergent behaviors
Lack of theoretical foundations
for CPS dynamics
Verification, certification,
predictability has fundamentally
new challenges.
problems
18. Abstraction layers allow
the verification of
different properties .
Key Idea: Manage design complexity by creating abstraction
layers in the design flow.
Abstraction layers define
platforms.
Physical Platform
Software Platform
Computation/Communication Platform
Abstractions are linked
through mapping.
Claire Tomlin, UC Berkeley
Example for a CPS Approach
19. Abstraction layers and models:
Real-time Software
Sifakis at al: “Building Models of Real-Time
Systems from Application Software,”
Proceedings of the IEEE Vol. 91, No. 1. pp.
100-111, January 2003
Software models
T Out f T In : 2
correctness: implementation
Real-time system models
In CPS, essential system properties
such as stability, safety,
performance are expressed in
terms of physical behavior
• f
: reactive program. Program execution
creates a mapping between logical-time
inputs and outputs.
• f
: real-time system. Programs are
R packaged into interacting components.
Scheduler control access to computational
and communicational resources according
to time constraints P
timing analysis (P)
, ( ( )) ( ( )) out R in E f f
T Out
f T In R : 2
R R
E f P R , (), (, )
20. Abstraction layers and models:
Cyber-Physical Systems
Physical models
T Out
p T In R : 2 T Out
R R
f T In R ; : 2
R R
implementation
Software models
T Out f T In : 2
correctness: implementation
Real-time system models
Re-defined Goals:
• Compositional verification of
essential dynamic properties
− stability
− safety
• Derive dynamics offering
robustness against
implementation changes,
uncertainties caused by faults
and cyber attacks
− fault/intrusion induced
reconfiguration of SW/HW
− network uncertainties
(packet drops, delays)
• Decreased verification
complexity
timing analysis (P)
, ( ( )) ( ( )) out R in E f f
T Out
f T In R : 2
R R
E f P R , (), (, )
21. Why is CPS Hard?
Software Control Systems
package org.apache.tomcat.session;
import org.apache.tomcat.core.*;
import org.apache.tomcat.util.StringManager;
import java.io.*;
import java.net.*;
import java.util.*;
import javax.servlet.*;
import javax.servlet.http.*;
/**
* Core implementation of a server session
*
* @author James Duncan Davidson [duncan@eng.sun.com]
* @author James Todd [gonzo@eng.sun.com]
*/
public class ServerSession {
private StringManager sm =
StringManager.getManager("org.apache.tomcat.session");
private Hashtable values = new Hashtable();
private Hashtable appSessions = new Hashtable();
private String id;
private long creationTime = System.currentTimeMillis();;
private long thisAccessTime = creationTime;
private long lastAccessed = creationTime;
private int inactiveInterval = -1;
ServerSession(String id) {
this.id = id;
}
public String getId() {
return id;
}
public long getCreationTime() {
return creationTime;
}
public long getLastAccessedTime() {
return lastAccessed;
}
public ApplicationSession getApplicationSession(Context context,
boolean create) {
ApplicationSession appSession =
(ApplicationSession)appSessions.get(context);
if (appSession == null && create) {
// XXX
// sync to ensure valid?
appSession = new ApplicationSession(id, this, context);
appSessions.put(context, appSession);
}
// XXX
// make sure that we haven't gone over the end of our
// inactive interval -- if so, invalidate and create
// a new appSession
return appSession;
}
void removeApplicationSession(Context context) {
appSessions.remove(context);
}
/**
* Called by context when request comes in so that accesses and
* inactivities can be dealt with accordingly.
*/
void accessed() {
// set last accessed to thisAccessTime as it will be left over
// from the previous access
lastAccessed = thisAccessTime;
thisAccessTime = System.currentTimeMillis();
}
void validate()
Crosses Interdisciplinary Boundaries
• Disciplinary boundaries need to be realigned
• New fundamentals need to be created
• New technologies and tools need to be developed
• Education need to be restructured
23. Cyber-Physical Systems:
Software Intensive Systems
Embedded software ….
is a crucial ingredient in modern
systems
is the ‘universal system integrator’
could exhibit faults that lead to
system failures
complexity has progressed to the
point that zero-defect systems
(containing both hardware and
software) are very difficult to build
need to evolve while in operation
The challenge is to build software intensive systems that
anticipate change: uncertain environments, faults, updates, and
exhibit resilience: they survive and adapt to changes, while
being dependably functional.
24. Resilience
Webster:
Capable of withstanding shock without permanent deformation or
rupture
Tending to recover from or adjust easily to misfortune or change.
Technical:
The persistence of the avoidance of failures that are unacceptably
frequent or severe, when facing changes. [Laprie, ‘04]
A resilient system is trusted and effective out of the box in a wide
range of contexts, and easily adapted to many others through
reconfiguration or replacement. [R. Neches, OSD]
Resilient system detects anomalies in itself, diagnoses
its causes, and is able to recover lost functionality.
Research issues
•Model-driven engineering of Resilient Software Systems
•Design-time + Run-time aspects
•Resilience to: (1) faults, (2) environmental changes, (3) updates
•Target system category: distributed, real-time, embedded systems
Objective: Model-based
engineering approach
and tools to build
verifiably resilient
systems
28. Cyber-Physical Systems
Faults and resilience
In CPS faults can appear in (and cascade to) any place
Physical system
Hardware (computing and communication) system
Software (application and platform) system
In CPS physical and cyber elements are integrated
Many interaction pathways: P2P, P2C, C2C, P2C2P, C2P2P2C
Many modeling paradigms for physical systems
Consider engineering or physics!
Heterogeneous models need to be integrated for detailed analysis
In CPS recovery can take many forms
Physical action
Cyber restart
Software adaptation
29. CPS and Model-based Design
Design of CPS layers via MDE
Software models
Platform models
Physical models
Challenge: How to integrate the models so that cross-domain
interactions can be understood and managed?
30. A Strategy for Resilient CPS
Overall scheme:
Faults can originate in any layer of a hierarchy, in any component
Anomalies caused by the fault can be detected in the same or a higher layer
Based on anomalies a fault source isolation (diagnosis) is performed. The
diagnosis result may be reported to a higher layer, depending on the nature of
the fault.
The fault is locally mitigated first, but when that mitigation fails the higher layer is
informed about the anomaly, the diagnosed fault, and the mitigation action taken.
High-level view: Fault management is a control problem.
Faults are disturbances in the system whose effects prevent the system to
provide the required service/s
Anomalies are the sensory inputs, mitigation actions are the actuators of the
fault management system
Fault mitigation must happen by considering (1) the current functional goals and
(2) the actual state of the system, on the right level of abstraction
31. A Strategy for Resilient CPS
Layered fault management
Concepts:
1. Faults propagate to neighboring layers via
guaranteed behaviors
2. Each layer includes pro-active and reactive fault
management mechanisms
Each layer provides a ‘fault reporting’ and
‘fault management’ interface
Fault management services are built into the
‘middleware’:
Temporal/spatial Isolation
Fault Tolerant Clock Sync
Time-triggered Communications
Group Communication and Transactions
Fault-tolerant Resource Sharing
Component/Service Migration
Primary/Backup
Replication
Autonomous Failure Management
Safe Dynamic Composition of Components
33. The need for resilience
In complex systems even simple
failures lead to complex cascades of
events that are difficult to understand
and manage.
How to
•detect and isolate faults?
•react to faults to mitigate their effect?
34. FACT:
A model-driven toolsuite for system-level diagnostics
Visual modeling tool for creating:
•System architecture models
•Timed failure propagation graph models
Run-time Platform (RTOS)
Modular run-time environment contains:
•Monitors detect anomalies in sensor data
and track mode changes
•TFPG Diagnostics Engine performs
diagnosis and isolates the source(s) of
observed anomalies
•Reports are generated for operators and
maintainers
Modules can be used standalone on an
embedded target processor with an RTOS
TFPG DIAGNOSTICS ENGINE
MONITORS
OPERATOR
MAINTAINER
35. Modeling Language
Temporal Failure Propagation Graphs
•Failure modes
•Discrepancies
•Monitors/Alarms
•Propagation links with:
•Time delay
•Mode
Fault model:
Known physical failure modes whose functional effects (discrepancies) are
monitored.
Diagnostic problem:
Given a set of active monitors and their temporal activation sequence,
which failure mode(s) explain the observations?
A causal network-like
model describing how
component failure effects
propagate across the
system activating
monitors.
Failure propagation links
and monitors could be
mode-dependent.
37. TFPG Example
Example
Not shown:
- Timing on propagation links
- Components/hierarchy
- Modal propagation
TFPG captures cause-effect
relationships that can be modal
and temporal. Effects may be
cumulative and/or monitored.
Legend
Component
38. Timed Failure Propagation Graphs
Causal models that describe the system behavior in
presence of faults.
Model is a labeled directed graph where
Nodes represent either failure modes or discrepancies
Edges between nodes in the graph represent causality
Edges are attributed with timing and mode constraints on failure
propagation.
A discrepancy can be either monitored unmonitored.
The monitor detects a sensory manifestation of an anomaly and
generates alarms.
Failure Cascades
Propagation
links
Alarm
Allocation
Failure Modes Discrepancies Alarms
43. TFPG Reasoning
On-line diagnostics:
Input: Sequence of alarms and mode changes
Output: Sequence of sorted and ranked hypotheses containing failure mode(s)
that explain the observations (alarms, mode changes)
44. TFPG Hypothesis
TFPG Hypothesis: estimation of the current system state.
Directly, points to failure modes that “best” explain the
current set of observed alarms.
Indirectly, points to failed monitored discrepancies; those
with a state that is inconsistent with the (hypothesized) state
of the failure modes
Structure
List of possible Failure Modes
List of alarms in each set ( Consistent (C)/ Inconsistent
(IC)/Missing (M) / Expected (E))
Metrics : Plausibility/ Robustness/ Failure Rate
45. Hypotheses Evaluation Metrics
Hypotheses are evaluated based on the following
metrics:
Plausibility: reflects the support of a hypothesis based on the
current observed alarm state. It answers the question: Which
hypothesis to consider?
Robustness: reflects the potential of a hypothesis (evidence)
to change based on remaining alarms. It answers the question:
When to take an action?
Failure Rate: is a measure of how often a particular failure
mode has occurred in the past.
46. Run-time System
Diagnostics Engine
Algorithm outline:
Check if new evidence is explained by
current hypotheses.
If not, create a new hypothesis that
assumes a hypothetical state of the
system consistent with observations
Rank hypotheses for plausibility and
robustness
Discard low-rank hypotheses, keep
plausible ones
Fault state: ‘total state vector’ of the
system, i.e. all failure modes and
discrepancies
Alarms could be
Missing: should have fired but did not
Inconsistent: fired, but it is not consistent
with the hypothesis
Robust diagnostics: tolerates missing and
inconsistent alarms
Metrics:
Plausibility: how plausible is the
hypothesis w.r.t alarm consistency
Robustness: how likely is that the
hypothesis will change in the future
47. Run-time System
Diagnostics Engine
Novel properties:
Multi-fault hypothesis is the default
Fault state == State of all failure
modes/discrepancies
Reasoner works with sets of failure modes
(instead of individual failure modes)
Robust algorithm: can tolerate
missing/inconsistent alarms
Parsimony principle: Use simplest
explanation
Time-dependent diagnosis
Reasoner can be asked to recompute
diagnosis upon the advance of time
Extensions:
Modal edges: Propagation happens only if
edge is enabled (controlled by system
mode)
Diagnosis takes into consideration the last
propagation effect
Non-monotonic alarms:
Alarm retraction triggers a re-computation
of the diagnosis
48. Run-time System
Discrete (TFPG) Diagnostics
Additional capabilities:
Intermittent failure modes
Consequence: alarm/s change to ‘Off’
Assumption: low frequency intermittents
Upon alarm changing to ‘Off’, backtrack to
last change to ‘On’ and re-evaluate
Maintain alternate branches (for alarms ‘On’
and ‘Off’)
Test alarms: can be considered only
after activation
If inactive, it is an un-monitored
discrepancy.
If activated, it is used but timing may be
inconsistent (re: parent’s timing)
Metrics summary:
Plausibility:
Robustness:
49. Performance Evaluation
For n failure modes and m discrepancies, maximum number of
hypotheses is nm but more likely to be O(n).
Updating hypothesis is polynomial with the number of nodes and
exponential w.r.t sensor faults.
Model #C #FM #D #A #M #P #R Avg.
Time (sec)
#1 15 36 48 21 0 120 1 0.000311
#2 11 36 120 174 27 3 1 0.000445
#3 153 481 1973 270 9 3409 1 0.013589
#4 24 64 116 116 0 695 4 0.016
#5 21 100 282 69 0 431 18 0.00288
• Keys: #C – Number of Components / #FM – Failure Modes/ #D – Discrepancies/ #A – Alarms/ #M –
Modes/ #P – Propagation links, #R – Regions
• Avg. Time = Average Computational Time taken by the reasoner (in seconds) after every event on
2.67GHz Intel Xeon® CPU, 8 GB RAM.
50. Tool Operations
1. Modeling
2. Desktop experimentation,
validation
3. Feedback
4. Deployment on
embedded platform
Model
Interpretation
52. Motivation: Software as Failure Source?
Qantas 72 - Oct 7, 2008 – A330 (Australia) – ATSB Report
At 1240:28, while the aircraft was cruising at 37,000 ft, the autopilot disconnected. From about
the same time there were various aircraft system failure indications. At 1242:27, while the
crew was evaluating the situation, the aircraft abruptly pitched nose-down. The aircraft reached a
maximum pitch angle of about 8.4 degrees nose-down, and descended 650 ft during the event.
After returning the aircraft to 37,000 ft, the crew commenced actions to deal with multiple
failure messages. At 1245:08, the aircraft commenced a second uncommanded pitch-down event.
The aircraft reached a maximum pitch angle of about 3.5 degrees nose-down, and descended
about 400 ft during this second event. At 1249, the crew made a PAN urgency broadcast to air
traffic control, and requested a clearance to divert to and track direct to Learmonth. At 1254,
after receiving advice from the cabin of several serious injuries, the crew declared a MAYDAY.
The aircraft subsequently landed at Learmonth at 1350.
The investigation to date has identified two significant safety factors related to the pitch-down
movements. Firstly, immediately prior to the autopilot disconnect, one of the air data
inertial reference units (ADIRUs) started providing erroneous data (spikes) on
many parameters to other aircraft systems. The other two ADIRUs continued to
function correctly. Secondly, some of the spikes in angle of attack data were not
filtered by the flight control computers, and the computers subsequently commanded
the pitch-down movements.
http://www.atsb.gov.au/publications/investigation_reports/2008/AAIR/pdf/AO2008070_interim.pdf
53. Understanding the Problem
Embedded software is a complex engineering artifact that can have latent
faults, uncaught by testing and verification. Such faults become apparent
during operation when unforeseen modes and/or (system) faults appear.
The problem:
General: How to construct a Software Health Management system that
detects such faults, isolates their source/s, prognosticates their progression,
and takes mitigation actions in the system context?
Specific: How to specify, design, and implement such a system using a model-based
framework?
The larger picture:
General: Software Health Management must be integrated with System
Health Management – ‘Software Health Effects’ must be understood on the
System (Vehicle) Level.
54. What is ‘Systems Health Management’ ?
The ‘on-line’ view:
1. Detection of anomalies in system or component behavior
2. Identification and isolation of the fault source/s
3. Prognostication of impending faults that could lead to system failures
4. Mitigation of current or impending fault effects while preserving mission objective/s
Reports
Observations Corrections
Detection
Isolation
Prognostics
Mitigation
Examples:
- Automotive OBD (detection)
- Boeing 777 CMC (detection + isolation)
- Spacecraft fault protection (detection + isolation + mitigation)
- Aircraft fleet (detection + isolation + prognostics)
55. Software Health Management
Software is a complex
engineering artifact.
Software can have latent faults.
Faults appear during operation
when unforeseen modes or
interactions happen.
Techniques like Voting and Self-
Checking pairs have
shortcomings
Common mode faults
Fault cascades
• SHM is the extension of FDIR
techniques used in Physical systems to
Software.
Stimuli Responses
Fault mitigation
Fault detection
Environmental
Assumptions
Observed
Behavior
Domain
Assumptions
Observed Inputs
Fault isolation
56. Why ‘Software Health Management’?
Complexity of systems necessitates an additional layer ‘above’ SFT that
manages ‘Software Health’
Embedded software ….
is a crucial ingredient in aerospace systems
is a method for implementing functionality
is the ‘universal system integrator’
could exhibit faults that lead to system failures
complexity has progressed to the point that zero-defect systems (containing
both hardware and software) are very difficult to build
Systems Health Management is an emerging field that addresses precisely
this problem: How to manage systems’ health in case of faults ?
‘Software Health Management’ is not…
A replacement for existing and robust engineering processes and standards
(DO-178B)
A substitute for hardware- and software fault tolerance
An ‘ultimate’ solution for fault tolerance
57. Software Health Management
Key ideas
Use software components as units of fault management: detection, diagnosis,
and mitigation
Components must be observable, provide fault isolation, and be capable of mitigation
Use a two-level architecture:
Component level: detect anomalies and mitigate locally
System level: received anomaly reports, isolate faulty component(s), and mitigate
on the component
Use models to represent
anomalous conditions
fault cascades
mitigation actions (when / what)
Use model-based generators to synthesize code artifacts
Developer can use higher-level abstractions to design and implement the
software health management functions of a system
58. Software Component Framework
The Component Model should enable:
Monitoring
Interfaces (synchronous/asynchronous calls)
Component state
Scheduling and timing (WCET)
Resource usage
Anomaly Detection via:
Pre/post conditions over call parameters, rates, and component state
Conditions over (1) timing properties, (2) resource usage (e.g. memory footprint), and (3)
usage patterns
Combinations of the above
Mitigation:
Given detected anomaly and state of the component take action
Can be time- or event-triggered
Actions: restart, initialize, block call, inject value, inject call, release resource, modify state;
checkpoint/restore, combination of the above
59. Notional Component Model
Parameter
Component
Resource Trigger
Subscribe
(Event)
Publish
(Event)
Provided
(Interface)
Required
(Interface)
State
A component is a unit (containing potentially many objects). The component is parameterized, has
state, it consumes resources, publishes and subscribes to events, provides interfaces to
and requires interfaces from other components.
Publish/Subscribe: Event-driven, asynchronous communication (publisher does not wait)
Required/Provided: Synchronous communication using call/return semantics.
Triggering can be periodic or sporadic.
Extension of a Component Model defined by OMG (CCM) : state, resource, trigger interfaces.
60. Example: Component Interactions
Sampler
Component GPS
Component
Display
Component
P
S
S
Components can interact via asynchronous/event-triggered and synchronous/call-driven connections.
Example: The Sampler component is triggered periodically and it publishes an event upon each
activation. The GPS component subscribes to this event and is triggered sporadically to obtain
GPS data from the receiver, and when ready it publishes its own output event. The Display
component is triggered sporadically via this event and it uses a required interface to retrieve the
position data from the GPS component.
61. Component Monitoring
Component
Monitor arriving
events
Monitor incoming
calls
Monitor published
events
Monitor outgoing
calls
Observe state
Monitor resource
usage
Monitor control flow/
triggering
62. ACM:
The ARINC Component Model
Provide a CCM-like layer on top of ARINC-653 abstractions
Notional model:
Terminology:
Synchronous: call/return
Asynchronous: publish-return/trigger-process
Periodic: time-triggered
Aperiodic: event-triggered
Note:
All component interactions are realized via the framework
Process (method) execution time has deadline, which is monitored
63. ACM:
The ARINC Component Model
Each ‘input interface’ has its own process
Process must obtain read-write/lock on component
Asynchronous publisher (subscriber) interface:
Listener (publisher) process
Pushes (receives) one event (a struct), with a validity flag
Can be event-triggered or time-triggered (i.e. 4 variations)
Synchronous provided (required) interface:
Handles incoming synchronous RMI call
Forwards outgoing synchronous RMI call
Other interfaces:
State: to observe component state variables
Resource: to monitor resource usage
Trigger: to monitor execution timing
64. ACM:
A Prototype Implementation
ARINC-653 Emulator
Emulates APEX services using Linux API-s
Partition Process, Process Thread
Module manager: schedules partition set
Partition level scheduler: schedules threads within partition
CORBA foundation
CCM Implementation
No modification
ACM component interactions
Mainly implemented via APEX
RMI interactions use threads
65. Implementation: Mapping ACM to APEX
APEX - Abstractions Platform (Linux)
Module Host/Processor
Partition Process
Process Thread
ACM: APEX Component Model APEX APEX Concept Used
Component method Periodic Periodic process Process start, stop
Semaphores
Sporadic Aperiodic process
Invocation Synchronous
Call-Return
Periodic
Target
Co-located N/A
Non-co-located N/A
Sporadic
Target
Co-located Caller method signals callee to release
then waits for callee until completion.
Event, Blackboard
Non-co-located Caller method sends RMI (via CM) to
release callee then waits for RMI to
complete.
TCP/IP, Semaphore,
Event
Asynchronous
Publish-Subscribe
Periodic
Target
Co-located Callee is periodically triggered and polls
‘event buffer’ – validity flag indicates
whether data is stale or fresh
Blackboard
Non-co-located Sampling port, Channel
Sporadic
Target
Co-located Callee is released when event is available Blackboard,
Semaphore, Event
Non-co-located Caller notifies via TCP/IP, callee is
released upon receipt
Queuing port,
Semaphore, Event
66. ACM:
Modeling Language
Modeling elements:
Data types: primitive, structs, vectors
Interfaces: methods with arguments
Components:
Publish/Subscribe ports (with data type)
Provided/Required interfaces (with i/f type)
Health Manager
Assemblies
Deployment
Modules, Partitions
Component Partition
68. Anomaly Detection
Model-Based Specification of
monitoring expressions
Post/Pre condition violations:
threshold, rate, custom filter
(moving average)
Resource Violations: Deadline
Validity Violation: Stale data on
a consumer
Concurrency Violations: Lock
timeouts.
User code violations: reported
error conditions from
application code.
Code Generators
Synthesize code for
implementing the monitors
Monitor
arriving events
Monitor
incoming calls
Monitor published
events
Monitor outgoing
calls
Observe state Monitor resource
usage
Monitor control
flow/ triggering
Port Monitors
Non-Port Monitors
• Based on these local detection,
each component developer can
implement a local health
manager
• It is a reactive timed state
machine with pre specified
actions.
• All alarms, actions are reported
to the system health manager
69. ACM:
Modeling Language: Monitoring
Monitoring on component interfaces
Subscriber port ‘Subscriber process’ and
Publisher port ‘Publisher process’
Monitor: pre-conditions and post-conditions
On subscriber: Data validity (‘age’ of data)
Deadline (hard / soft)
Provided interface ‘Provider methods’ and
Required interface ‘Required methods’
Monitor: pre-conditions and post-conditions
Deadline (hard / soft)
Can be specified on a per-component basis
Monitoring language:
Simple, named expressions over input (output)
parameters, component state, delta(var), and
rate(var,dt). The expression yields a Boolean condition.
74
70. Component-level Health Management
Manager’s behavioral model:
Finite-state machine
Triggers: monitored events, time
Actions: mitigation activities
Manager is local to component
container (for efficiency) but shall be
protected from the faults of functional
components
Notional behavior:
Track component state changes via
detected events and progression of
time
Take mitigation actions as needed
Design issues:
Co-location with component (fault containment)
Local detection may implicate another component
Component
Monitor
WCET
Component Framework
Manager
Actions
Events
Events
Idle
Exec
InvA
start
finish
timeout
/init
invA_violation
/reset
71. ACM - Modeling Language:
Component Health Manager
Reactive Timed State Machine
Event trigger:
Predefined conditions (e.g. deadline violation, data validity validation)
User-defined conditions (e.g. pre-condition violation)
Reaction: mitigation action (start, reset, refuse, ignore, etc.)
State: current state of the machine
(Event X State) Action
72. Component Health Management
Available Actions
Component Health Manager (High priority ARINC-653 process)
Error
Message /Action
HM Response
Component
NOMINAL ERROR CHECK
RESULT FAILURE
Action Successful
Timeout or
Action Failed
B
U
F
F
E
R
Incoming
Events
Component
Port (653
PRocess)
PPrroocceessss 3 1
HM Response
BBlalacckkBBooaardrd
BlackBoard
Blocking
Read
Architecture
73. Assembly Definition
Validity(GPS.data_in)<4ms
Delta(Nav.data_in.time)>0
Rate(gps_data_src.data)>1
Specified Monitoring Conditions
The Sensor component is triggered periodically and it publishes an event upon each
activation.
The GPS component subscribes to this event and is triggered periodically to obtain GPS
data from the receiver. It publishes its own output event.
The Nav Display component is triggered sporadically via this event and it uses a required
78 Model-Based Software Health Management
interface to retrieve the position data from the GPS component.
74. System-level Health Management
Focus issue: Cascading faults
Hypothesis: Fault effects cascade via component interactions
Anomalies detected on the component level are not
‘diagnosed’ can be caused by other components
Problem:
How to model fault cascades?
How to diagnose and isolate fault cascade root causes?
How to mitigate fault cascades?
76. Recap: Fault diagnosis
Fault diagnosis algorithm:
• Outline:
– Check if new evidence is explained
by current hypotheses.
– If not, create a new hypothesis that
assumes a hypothetical state of the
system consistent with observations
– Rank hypotheses for plausibility and
robustness metrics
– Discard low-rank hypotheses, keep
plausible ones Fault state: ‘total state vector’ of the system,
i.e. all failure modes and discrepancies
Alarms could be
Missing: should have fired but did not
Inconsistent: fired, but it is not consistent
with the hypothesis
Robust diagnostics: tolerates missing and
inconsistent alarms
Metrics:
Plausibility: how plausible is the
hypothesis w.r.t. alarm consistency
Robustness: how likely is that the
hypothesis will change in the future
77. Modeling Cascading Faults
Not needed - the cascades can be computed from the
component assemblies, if the anomaly types and their
interactions are known.
Component ‘elements’
Every method belongs to one of these (7)
Fault cascades within component
(A few of the 38 patterns)
78. Modeling Cascading Faults
Inter-component propagation is regular – always follows the
same pattern
Intra-component propagation depends on the component!
Need to model internal dataflow and control flow of the
component.
Note: Could be determined via source code analysis.
79. Modeling Cascading Faults
Fault Propagation Graph for GPS Example
Here: hand-crafted, but it is generated automatically in the
system
80. System-level Fault Mitigation
Model-based system-level mitigation engine
Model-based diagnoser is automatically generated
Designer specifies fault mitigation
strategies using a reactive state machine
Advantages:
Diagnoser Engine Mitigation Engine • Models are higher-level
programs to specify
(potentially complex)
D D FM
behavior – more readable and
comprehensible
•Models lend themselves to
formal analysis – e.g. model
Managed Component
checking
Component CHM
Component Platform
Managed Component
Component CHM
Component Fault Model
Component Fault Model
FM
FM
FM
FM
D
D
D
D
D
81. System-level Fault
Mitigation
Model-based mitigation specification at
two levels
Component level: quick action
System level: Reactive action taking the
system state into consideration
System designer specifies them as a
parallel timed state machine.
Fixed set of mitigation actions are
available
Runtime code is generated from
models
Advantages:
Models are higher-level programs to
specify (potentially complex) behavior –
more readable and comprehensible
Models lend themselves to formal
analysis – e.g. model checking
List of predefined Mitigation Actions
HM Action Semantics
CLHM: IGNORE Continue as if nothing has happened
CLHM:ABORT Discontinue current operation, but opera-tion
can run again
CLHM:
USE PAST
DATA
Use most recent data (only for operations
that expect fresh data)
CLHM: STOP Discontinue current operation
Aperiodic processes (ports): operation can
run again
Periodic processes (ports): operation must
be enabled by a future START HM action
CLHM: START Re-enable a STOP-ped periodic operation
CLHM RESTART A Macro for STOP followed by a START
for the current operation
SLHM: RESET Stop all operations, initialize state of component, start all periodic
operations
SLHM: STOP Stop all operations
Diagnoser Engine Mitigation Engine
Alarms
Alarms
Alarms
82. System-level Health Management
Functional components
1. Aggregator:
Integrates (collates) health information coming
from components (typically in one hyperperiod)
2. Diagnoser:
Performs fault diagnosis, based on the fault
propagation graph model
Ranks hypotheses
Component that appears in all hypotheses with
the highest rank is chosen for mitigation
3. Response Engine:
Issues mitigation actions to components based
on diagnosis results
Based on a state machine model that maps
diagnostic results to mitigation actions
These components are generated
automatically from the models
The Health Management Approach:
1. Locally detected anomalies are mitigated
locally first. – Quick reactive response.
2. Anomalies and local mitigation actions are
reported to the system level.
3. Aggregated reports are subjected to
diagnosis, potentially followed by a system-level
mitigation action.
4. System-level response commands are
propagated down to components.
83. Example:
2005 Malaysian Air Boeing 777 in-flight upset
Low airspeed advisory.
Airplane’s autopilot experienced excessive acceleration values.
Vertical acceleration decreased to -2.3g within ½ second
Lateral acceleration decreased to -1.01g (left) within ½ second
Longitudinal acceleration increases to +1.2 g within ½ second
Autopilot pitched nose-up to 17.6 degree and climbed at a vertical speed
of 10,650 fpm.
Airspeed reduced to 241 knots.
Stick shaker activated at top of the climb.
Aircraft descended 4,000 ft.
Re-engagement of autopilot followed by another climb of 2,000 ft.
Maximum rate of climb = 4440 fpm.
84. B777 ADIRU Architecture
• Designed to be serviceable with
one fault in each FCA
• Can fly but maintenance
required upon landing with two
faults in each FCA
• Each ARINC 629 end unit voted
on the processor data bit-by-bit.
• Processors monitor the ARlNC
629 modules by full data wrap-around
• Processors also monitor the
power supplies, any one of
which can power the entire unit
• Accelerometer and gyro in
skewed redundant configuration
• A S(econdary)AARU also
provided inertial data
Based on Air Data Inertial Reference Unit (ADIRU)
Architecture (ATSB, 2007, p.5)
85. Cause of Inflight Upset
June 2001: accelerometer 5 fails with high output value, ADIRU disregards it.
A power cycle on ADIRU occurs. A latent software bug disregards the faulty status
of accelerometer 5.
Status of failed unit was recorded on-board maintenance memory, but that memory was
not checked by the software.
An inflight fault was recorded in accelerometer 6 and it was disregarded.
FDI software allowed use of accelerometer 5.
High acceleration value was passed to all computers.
Due to common-mode nature of fault, voters allowed high accelerometer data to
go on all channels.
This high value was used by primary flight computer.
Mid value select function used by the flight computer lessened the effect of pitch
motion.
Pro blem: System relied on redundancy to mask a fault. But due to latent software
bug and common-mode fault, the effect cascaded into the system failure
Reading Material: The dangers of failure masking in fault-tolerant software: aspects of a recent in-flight upset event
C.W. Johnson and C.M. Holloway, IET Conf. Pub. 2007, 60 (2007), DOI:10.1049/cp:20070442
86. Case Study
• Modeled the architecture as a
software component assembly
• Created the fault scenario
• Only modeled part of the system
to illustrate the point of SHM
• Accelerometers are arranged on
six faces of a dodecahedron.
Used for regression Equations
93. System Health Manager
other machines have similar specification
These components are auto generated
The hypothesis generated by the diagnoser is translated to
Component(s) that is most likely faulty. This list is fed to
Response Engine, which triggers the mitigation state machine
94. Demonstration
Fault Scenario
Accelerometer 5 has initial fault
It is started which causes an alarm
Then Accelerometer 6 develops fault
Successful mitigation
Identifying the faulty components
Stopping the fault components
Processors can still function with four accelerometers.
97. Resilience and autonomy
Model-based Software Health Management
Requires explicit specification of component-level and system-level
health management (recovery) actions
Complex and error-prone… too many options!
Resilient systems should recover autonomously
Concepts:
Model the system architecture + functions.
Express what is needed from the system to implement
functions.
Embed models into the run-time system
Use a reasoner to figure out how to recover function upon
failures
98. Modeling
Functional Requirements for IMU
Inertial Position
• Determine inertial position.
• Functional Requirement (AND)
GPS Position
Position Tracking
GPS Position
• Sense GPS position for computing
Inertial Position
Position Tracking
• Continuously track position to compute
Inertial position
• Functional Requirement
Body Acceleration Measurement
Body Acceleration Measurement
• Sense body acceleration for Position
Tracking.
Inertial
Position
GPS
Position
Position
Tracking
Body
Acceleration
Measurement
100. Modeling the Architecture
Function Allocation
Body Acceleration
Measurement
EXACTLY ONE (Primary /Secondary ADIRU
Subsystem)
ADIRU Subsystem has
• Accelerometers (6)
• ADIRU Computers (4)
• Voters (3)
Functional / Operational ADIRU Subsystem
requires
• ATLEAST 4 of 6 Accelerometers
• ATLEAST 2 of 4 Filters or ADIRU
computers
• ATLEAST 1 of 3 Voter
Inside one ADIRU:
101. Modeling the Architecture
Function Allocation
GPS Position
EXACTLY ONE (Primary/Secondary
GPS Subsystem)
GPS Subsystem includes
GPS Receiver (1)
GPS Processor (1)
Functional / Operational GPS
subsystem requires
EXACTLY ONE of GPS Receiver
EXACTLY ONE of GPS Processor
Inside one GPS Subsystem:
102. Modeling the Architecture
Function Allocation
POSITION TRACKING
ATLEAST ONE OF ( LEFT/ CENTER/
RIGHT PFC NavFilter Subsystem)
PFC NavFilter Subsystem includes
PFC Nav Filter (1)
PFC Processor (1)
Functional/ Operational Requirement
for PFC Subsystem
EXACTLY ONE PFC NavFilter
EXACTLY ONE PFC Processor
Inside one PFC Subsystem:
103. Component Operational Requirement
EXPLICIT – Local dependency
Display Subsystem
ATLEAST 1 of 3 Consumers (Left, Center, Right)
EXPLICIT – Local dependency
ADIRU Computer inside ADIRU Subsystem
ATLEAST 4 of 6 Consumer Port
Implies
ATLEAST 4 of 6 Accelerometer Components
104. Component Operational Requirement
IMPLICIT – Inferred dependency
PFC NavFilter in PFC Subsystem
EXACTLY 1 of 1 Consumer Port AND
ATLEAST 1 of 1 Requires Port
Implies
EXACTLY 1 of 2 ADIRU Subsystems AND
ATLEAST 1 of 2 GPS Subsystem
105. Component Operational Requirement
IMPLICIT – Inferred dependency
PFC Processor inside PFC Subsystem
EXACTLY 1 of 1 Consumer Port
Implies
EXACTLY 1 of 1 PFC NavFilter
GPS Processor inside GPS Subsystem
EXACTLY 1 of 1 Consumer Port
Implies
EXACTLY 1 of 1 GPS Receiver
106. Modeling the problem:
Boolean SAT
Functional Requirements + Function allocation +
Component operational requirements + Component states
Encoded as Boolean (CNF) Expression for SATisfiability
problem
Solution: valid component architecture
Size: #Variables: 493/ #Clauses: 1776
FAULT / Scenario SAT solver -
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Verifying Initial State 0.004228 No commands. Initial State accepted as satisfying/
meeting functional requirements.