SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Opportunities

Resilience in Cyber-Physical Systems:
Challenges and Opportunities
Gabor Karsai
Institute for Software-Integrated Systems
Vanderbilt University
SERENE 2014 – Autumn School

Acknowledgements
 People: Janos Sztipanovits, Daniel Balasubramanian,
Abhishek Dubey, Tihamer Levendovszky, Nag Mahadevan,
and many others at the Institute for Software-Integrated
Systems @ Vanderbilt University
 Sponsors: AFRL, DARPA, NASA, NSF through various
programs

Outline
 Introduction
 Cyber-physical Systems
 Resilience
 Building resilient CPS
 System-level fault diagnostics
 Software health management
 Resilient architectures and autonomy
 Conclusions

What is a Cyber-Physical System?
 An engineered system that integrates physical and cyber
components where relevant functions are realized
through the interactions between the physical and cyber
parts.
 Physical = some tangible, physical device + environment
 Cyber = computational + communicational

Courtesy of Kuka Robotics Corp.
Cyber-Physical Systems (CPS):
Integrating networked computational
resources with physical systems
E-Corner, Siemens
Courtesy of Doug Schmidt
Power
generation and
distribution
Courtesy of
General Electric
Military systems:
Transportation
(Air traffic
control at
Avionics SFO)
Telecommunications
Factory automation
Instrumentation
(Soleil Synchrotron)
Daimler-Chrysler
Automotive
Building Systems
Courtesy of Ed Lee, UCB

CPS Challenge Problem: Prevent This

A Typical Cyber-Physical System
Printing Press
• Application aspects
• local (control)
• distributed (coordination)
• global (modes)
• Ethernet network
• Synchronous, Time-Triggered
• IEEE 1588 time-sync protocol
• High-speed, high precision
• Speed: 1 inch/ms (~100km/hr)
• Precision: 0.01 inch
Bosch-Rexroth -> Time accuracy: 10us

Example – Flying Paster
Source: http://offsetpressman.blogspot.com/2011/03/how-flying-paster-works.html
S.e n so r top dead center
Active
paper
feed
Paper
cutt,er
Idle roller
!Flyi ng
R,.~_$J.fil:.
Drive roll,er
Dancer
Idle roller
!D rive roller
~--------------------------------------------------------------------------------------------------------------------------

Source: http://offsetpressman.blogspot.com/2011/03/how-flying-paster-works.html
Flying Paster

Example: Medical Devices
Emerging direction: Cell phone
based medical devices for
affordable healthcare
e.g. “Telemicroscopy” project
at Berkeley
e.g. Cell-phone based blood
testing device developed at
UCLA

Example: Toyota autonomous vehicle technology
roadmap, c. 2007
Source: Toyota Web site

DARPA Robotics Challenge
 http://www.theroboticschallenge.org/

The Good News…
Networking and computing delivers unique precision and flexibility in
interaction and coordination
Computing/Communication Integrated CPS
 Rich time models
 Precise interactions across highly
extended spatial/temporal
dimension
 Flexible, dynamic communication
mechanisms
 Precise time-variant, nonlinear
behavior
 Introspection, learning, reasoning
 Elaborate coordination of
physical processes
 Hugely increased system size
with controllable, stable
behavior
 Dynamic, adaptive architectures
 Adaptive, autonomic systems
 Self monitoring, self-healing
system architectures and better
safety/security guarantees.

…and the Challenges
Fusing networking and computing with physical processes brings new
Computing/Communication Integrated CPS
 Cyber vulnerability
 New type of interactions across
highly extended spatial/temporal
dimension
 Flexible, dynamic communication
mechanisms
 Precise time-variant, nonlinear
behavior
 Introspection, learning, reasoning
 Physical behavior of systems
can be manipulated
 Lack of composition theories for
heterogeneous systems: much
unsolved problems
 Vastly increased complexity
and emergent behaviors
 Lack of theoretical foundations
for CPS dynamics
 Verification, certification,
predictability has fundamentally
new challenges.
problems

Abstraction layers allow
the verification of
different properties .
Key Idea: Manage design complexity by creating abstraction
layers in the design flow.
Abstraction layers define
platforms.
Physical Platform
Software Platform
Computation/Communication Platform
Abstractions are linked
through mapping.
Claire Tomlin, UC Berkeley
Example for a CPS Approach

Abstraction layers and models:
Real-time Software
Sifakis at al: “Building Models of Real-Time
Systems from Application Software,”
Proceedings of the IEEE Vol. 91, No. 1. pp.
100-111, January 2003
Software models
  T Out f T In  :  2
correctness: implementation
Real-time system models
In CPS, essential system properties
such as stability, safety,
performance are expressed in
terms of physical behavior
• f
: reactive program. Program execution
creates a mapping between logical-time
inputs and outputs.
• f
: real-time system. Programs are
R packaged into interacting components.
Scheduler control access to computational
and communicational resources according
to time constraints P
timing analysis (P)
, ( ( )) ( ( )) out    R in  E  f  f 
  T Out
f T In R  :  2
R R
E f P R   ,   (), (, )

Abstraction layers and models:
Cyber-Physical Systems
Physical models
  T Out
p T In R  :  2   T Out
R R
f T In R  ; :  2
R R
implementation
Software models
  T Out f T In  :  2
correctness: implementation
Real-time system models
Re-defined Goals:
• Compositional verification of
essential dynamic properties
− stability
− safety
• Derive dynamics offering
robustness against
implementation changes,
uncertainties caused by faults
and cyber attacks
− fault/intrusion induced
reconfiguration of SW/HW
− network uncertainties
(packet drops, delays)
• Decreased verification
complexity
timing analysis (P)
, ( ( )) ( ( )) out    R in  E  f  f 
  T Out
f T In R  :  2
R R
E f P R   ,   (), (, )

Why is CPS Hard?
Software Control Systems
package org.apache.tomcat.session;
import org.apache.tomcat.core.*;
import org.apache.tomcat.util.StringManager;
import java.io.*;
import java.net.*;
import java.util.*;
import javax.servlet.*;
import javax.servlet.http.*;
/**
* Core implementation of a server session
*
* @author James Duncan Davidson [duncan@eng.sun.com]
* @author James Todd [gonzo@eng.sun.com]
*/
public class ServerSession {
private StringManager sm =
StringManager.getManager("org.apache.tomcat.session");
private Hashtable values = new Hashtable();
private Hashtable appSessions = new Hashtable();
private String id;
private long creationTime = System.currentTimeMillis();;
private long thisAccessTime = creationTime;
private long lastAccessed = creationTime;
private int inactiveInterval = -1;
ServerSession(String id) {
this.id = id;
}
public String getId() {
return id;
}
public long getCreationTime() {
return creationTime;
}
public long getLastAccessedTime() {
return lastAccessed;
}
public ApplicationSession getApplicationSession(Context context,
boolean create) {
ApplicationSession appSession =
(ApplicationSession)appSessions.get(context);
if (appSession == null && create) {
// XXX
// sync to ensure valid?
appSession = new ApplicationSession(id, this, context);
appSessions.put(context, appSession);
}
// XXX
// make sure that we haven't gone over the end of our
// inactive interval -- if so, invalidate and create
// a new appSession
return appSession;
}
void removeApplicationSession(Context context) {
appSessions.remove(context);
}
/**
* Called by context when request comes in so that accesses and
* inactivities can be dealt with accordingly.
*/
void accessed() {
// set last accessed to thisAccessTime as it will be left over
// from the previous access
lastAccessed = thisAccessTime;
thisAccessTime = System.currentTimeMillis();
}
void validate()
Crosses Interdisciplinary Boundaries
• Disciplinary boundaries need to be realigned
• New fundamentals need to be created
• New technologies and tools need to be developed
• Education need to be restructured

Cyber-Physical Systems:
Software Intensive Systems
 Embedded software ….
 is a crucial ingredient in modern
systems
 is the ‘universal system integrator’
 could exhibit faults that lead to
system failures
 complexity has progressed to the
point that zero-defect systems
(containing both hardware and
software) are very difficult to build
 need to evolve while in operation
The challenge is to build software intensive systems that
anticipate change: uncertain environments, faults, updates, and
exhibit resilience: they survive and adapt to changes, while
being dependably functional.

Resilience
 Webster:
 Capable of withstanding shock without permanent deformation or
rupture
 Tending to recover from or adjust easily to misfortune or change.
 Technical:
 The persistence of the avoidance of failures that are unacceptably
frequent or severe, when facing changes. [Laprie, ‘04]
 A resilient system is trusted and effective out of the box in a wide
range of contexts, and easily adapted to many others through
reconfiguration or replacement. [R. Neches, OSD]
 Resilient system detects anomalies in itself, diagnoses
its causes, and is able to recover lost functionality.
Research issues
•Model-driven engineering of Resilient Software Systems
•Design-time + Run-time aspects
•Resilience to: (1) faults, (2) environmental changes, (3) updates
•Target system category: distributed, real-time, embedded systems
Objective: Model-based
engineering approach
and tools to build
verifiably resilient
systems

Building Resilient Cyber-Physical
Systems

Refinement/Compilation
Abstraction
Computational Interaction
Physical Interaction
Layers and Interactions
Implementation Implementation
Platform Layer
Physical
Object
Physical Layer
Cyber-Physical
Object
Physical
Physical Object
Object
Cyber-Physical
Object
Computational
Object
Computational
Object
Computational
Object
Computational
Object
Computational
Object
Communication Platform Computational
Platform
Computational
Platform
Fault
Fault
Fault
Fault

Faults and resilience
 In CPS faults can appear in (and cascade to) any place
 Physical system
 Hardware (computing and communication) system
 Software (application and platform) system
 In CPS physical and cyber elements are integrated
 Many interaction pathways: P2P, P2C, C2C, P2C2P, C2P2P2C
 Many modeling paradigms for physical systems
 Consider engineering or physics!
 Heterogeneous models need to be integrated for detailed analysis
 In CPS recovery can take many forms
 Physical action
 Cyber restart
 Software adaptation

CPS and Model-based Design
Design of CPS layers via MDE
 Software models
 Platform models
 Physical models
Challenge: How to integrate the models so that cross-domain
interactions can be understood and managed?

A Strategy for Resilient CPS
 Overall scheme:
 Faults can originate in any layer of a hierarchy, in any component
 Anomalies caused by the fault can be detected in the same or a higher layer
 Based on anomalies a fault source isolation (diagnosis) is performed. The
diagnosis result may be reported to a higher layer, depending on the nature of
the fault.
 The fault is locally mitigated first, but when that mitigation fails the higher layer is
informed about the anomaly, the diagnosed fault, and the mitigation action taken.
 High-level view: Fault management is a control problem.
 Faults are disturbances in the system whose effects prevent the system to
provide the required service/s
 Anomalies are the sensory inputs, mitigation actions are the actuators of the
fault management system
 Fault mitigation must happen by considering (1) the current functional goals and
(2) the actual state of the system, on the right level of abstraction

A Strategy for Resilient CPS
Layered fault management
 Concepts:
1. Faults propagate to neighboring layers via
guaranteed behaviors
2. Each layer includes pro-active and reactive fault
management mechanisms
 Each layer provides a ‘fault reporting’ and
‘fault management’ interface
 Fault management services are built into the
‘middleware’:
 Temporal/spatial Isolation
 Fault Tolerant Clock Sync
 Time-triggered Communications
 Group Communication and Transactions
 Fault-tolerant Resource Sharing
 Component/Service Migration
 Primary/Backup
 Replication
 Autonomous Failure Management
 Safe Dynamic Composition of Components

System-level Fault Diagnostics

The need for resilience
In complex systems even simple
failures lead to complex cascades of
events that are difficult to understand
and manage.
How to
•detect and isolate faults?
•react to faults to mitigate their effect?

FACT:
A model-driven toolsuite for system-level diagnostics
Visual modeling tool for creating:
•System architecture models
•Timed failure propagation graph models
Run-time Platform (RTOS)
Modular run-time environment contains:
•Monitors detect anomalies in sensor data
and track mode changes
•TFPG Diagnostics Engine performs
diagnosis and isolates the source(s) of
observed anomalies
•Reports are generated for operators and
maintainers
Modules can be used standalone on an
embedded target processor with an RTOS
TFPG DIAGNOSTICS ENGINE
MONITORS
OPERATOR
MAINTAINER

Modeling Language
Temporal Failure Propagation Graphs
•Failure modes
•Discrepancies
•Monitors/Alarms
•Propagation links with:
•Time delay
•Mode
Fault model:
Known physical failure modes whose functional effects (discrepancies) are
monitored.
Diagnostic problem:
Given a set of active monitors and their temporal activation sequence,
which failure mode(s) explain the observations?
A causal network-like
model describing how
component failure effects
propagate across the
system activating
monitors.
Failure propagation links
and monitors could be
mode-dependent.

Modeling Language
Temporal Failure Propagation Graphs
Modeling variants
•Untimed, causal network (no modes, propagation = [0..inf])
•Modal networks: edges are mode dependent
•Timed models
•Hierarchical component models
Nodes:
•Failure modes
•Discrepancies
• AND/OR
• Monitored (option)
Edges:
•Propagation delay: [min, max]
•Discrete Modes (activation)
Example models (#components,#failuremodes,#alarms)
•Trivial examples
•Simplified fuel system (~30,~80,~100)
•Realistic fuel system (~200,~400,~600)
•Aircraft avionics (~2000,~8000,~25000) – generated

TFPG Example
Example
Not shown:
- Timing on propagation links
- Components/hierarchy
- Modal propagation
TFPG captures cause-effect
relationships that can be modal
and temporal. Effects may be
cumulative and/or monitored.
Legend
Component

Timed Failure Propagation Graphs
 Causal models that describe the system behavior in
presence of faults.
 Model is a labeled directed graph where
 Nodes represent either failure modes or discrepancies
 Edges between nodes in the graph represent causality
 Edges are attributed with timing and mode constraints on failure
propagation.
 A discrepancy can be either monitored unmonitored.
 The monitor detects a sensory manifestation of an anomaly and
generates alarms.
Failure Cascades
Propagation
links
Alarm
Allocation
Failure Modes Discrepancies Alarms

TFPG Example
Combustion Chamber

TFPG Reasoning
On-line diagnostics:
Input: Sequence of alarms and mode changes
Output: Sequence of sorted and ranked hypotheses containing failure mode(s)
that explain the observations (alarms, mode changes)

TFPG Hypothesis
 TFPG Hypothesis: estimation of the current system state.
 Directly, points to failure modes that “best” explain the
current set of observed alarms.
 Indirectly, points to failed monitored discrepancies; those
with a state that is inconsistent with the (hypothesized) state
of the failure modes
 Structure
 List of possible Failure Modes
 List of alarms in each set ( Consistent (C)/ Inconsistent
(IC)/Missing (M) / Expected (E))
 Metrics : Plausibility/ Robustness/ Failure Rate

Hypotheses Evaluation Metrics
Hypotheses are evaluated based on the following
metrics:
 Plausibility: reflects the support of a hypothesis based on the
current observed alarm state. It answers the question: Which
hypothesis to consider?
 Robustness: reflects the potential of a hypothesis (evidence)
to change based on remaining alarms. It answers the question:
When to take an action?
 Failure Rate: is a measure of how often a particular failure
mode has occurred in the past.

Run-time System
Diagnostics Engine
 Algorithm outline:
 Check if new evidence is explained by
current hypotheses.
 If not, create a new hypothesis that
assumes a hypothetical state of the
system consistent with observations
 Rank hypotheses for plausibility and
robustness
 Discard low-rank hypotheses, keep
plausible ones
 Fault state: ‘total state vector’ of the
system, i.e. all failure modes and
discrepancies
 Alarms could be
 Missing: should have fired but did not
 Inconsistent: fired, but it is not consistent
with the hypothesis
 Robust diagnostics: tolerates missing and
inconsistent alarms
 Metrics:
Plausibility: how plausible is the
hypothesis w.r.t alarm consistency
Robustness: how likely is that the
hypothesis will change in the future

Run-time System
Diagnostics Engine
 Novel properties:
 Multi-fault hypothesis is the default
 Fault state == State of all failure
modes/discrepancies
 Reasoner works with sets of failure modes
(instead of individual failure modes)
 Robust algorithm: can tolerate
missing/inconsistent alarms
 Parsimony principle: Use simplest
explanation
 Time-dependent diagnosis
 Reasoner can be asked to recompute
diagnosis upon the advance of time
 Extensions:
 Modal edges: Propagation happens only if
edge is enabled (controlled by system
mode)
 Diagnosis takes into consideration the last
propagation effect
Non-monotonic alarms:
 Alarm retraction triggers a re-computation
of the diagnosis

Run-time System
Discrete (TFPG) Diagnostics
 Additional capabilities:
 Intermittent failure modes
 Consequence: alarm/s change to ‘Off’
 Assumption: low frequency intermittents
 Upon alarm changing to ‘Off’, backtrack to
last change to ‘On’ and re-evaluate
 Maintain alternate branches (for alarms ‘On’
and ‘Off’)
 Test alarms: can be considered only
after activation
 If inactive, it is an un-monitored
discrepancy.
 If activated, it is used but timing may be
inconsistent (re: parent’s timing)
 Metrics summary:
Plausibility:
Robustness:

Performance Evaluation
 For n failure modes and m discrepancies, maximum number of
hypotheses is nm but more likely to be O(n).
 Updating hypothesis is polynomial with the number of nodes and
exponential w.r.t sensor faults.
Model #C #FM #D #A #M #P #R Avg.
Time (sec)
#1 15 36 48 21 0 120 1 0.000311
#2 11 36 120 174 27 3 1 0.000445
#3 153 481 1973 270 9 3409 1 0.013589
#4 24 64 116 116 0 695 4 0.016
#5 21 100 282 69 0 431 18 0.00288
• Keys: #C – Number of Components / #FM – Failure Modes/ #D – Discrepancies/ #A – Alarms/ #M –
Modes/ #P – Propagation links, #R – Regions
• Avg. Time = Average Computational Time taken by the reasoner (in seconds) after every event on
2.67GHz Intel Xeon® CPU, 8 GB RAM.

Tool Operations
1. Modeling
2. Desktop experimentation,
validation
3. Feedback
4. Deployment on
embedded platform
Model
Interpretation

Motivation: Software as Failure Source?
Qantas 72 - Oct 7, 2008 – A330 (Australia) – ATSB Report
At 1240:28, while the aircraft was cruising at 37,000 ft, the autopilot disconnected. From about
the same time there were various aircraft system failure indications. At 1242:27, while the
crew was evaluating the situation, the aircraft abruptly pitched nose-down. The aircraft reached a
maximum pitch angle of about 8.4 degrees nose-down, and descended 650 ft during the event.
After returning the aircraft to 37,000 ft, the crew commenced actions to deal with multiple
failure messages. At 1245:08, the aircraft commenced a second uncommanded pitch-down event.
The aircraft reached a maximum pitch angle of about 3.5 degrees nose-down, and descended
about 400 ft during this second event. At 1249, the crew made a PAN urgency broadcast to air
traffic control, and requested a clearance to divert to and track direct to Learmonth. At 1254,
after receiving advice from the cabin of several serious injuries, the crew declared a MAYDAY.
The aircraft subsequently landed at Learmonth at 1350.
The investigation to date has identified two significant safety factors related to the pitch-down
movements. Firstly, immediately prior to the autopilot disconnect, one of the air data
inertial reference units (ADIRUs) started providing erroneous data (spikes) on
many parameters to other aircraft systems. The other two ADIRUs continued to
function correctly. Secondly, some of the spikes in angle of attack data were not
filtered by the flight control computers, and the computers subsequently commanded
the pitch-down movements.
http://www.atsb.gov.au/publications/investigation_reports/2008/AAIR/pdf/AO2008070_interim.pdf

Understanding the Problem
Embedded software is a complex engineering artifact that can have latent
faults, uncaught by testing and verification. Such faults become apparent
during operation when unforeseen modes and/or (system) faults appear.
The problem:
 General: How to construct a Software Health Management system that
detects such faults, isolates their source/s, prognosticates their progression,
and takes mitigation actions in the system context?
 Specific: How to specify, design, and implement such a system using a model-based
framework?
The larger picture:
 General: Software Health Management must be integrated with System
Health Management – ‘Software Health Effects’ must be understood on the
System (Vehicle) Level.

What is ‘Systems Health Management’ ?
The ‘on-line’ view:
1. Detection of anomalies in system or component behavior
2. Identification and isolation of the fault source/s
3. Prognostication of impending faults that could lead to system failures
4. Mitigation of current or impending fault effects while preserving mission objective/s
Reports
Observations Corrections
Detection
Isolation
Prognostics
Mitigation
Examples:
- Automotive OBD (detection)
- Boeing 777 CMC (detection + isolation)
- Spacecraft fault protection (detection + isolation + mitigation)
- Aircraft fleet (detection + isolation + prognostics)

Software Health Management
 Software is a complex
engineering artifact.
 Software can have latent faults.
 Faults appear during operation
when unforeseen modes or
interactions happen.
 Techniques like Voting and Self-
Checking pairs have
shortcomings
 Common mode faults
 Fault cascades
• SHM is the extension of FDIR
techniques used in Physical systems to
Software.
Stimuli Responses
Fault mitigation
Fault detection
Environmental
Assumptions
Observed
Behavior
Domain
Assumptions
Observed Inputs
Fault isolation

Why ‘Software Health Management’?
 Complexity of systems necessitates an additional layer ‘above’ SFT that
manages ‘Software Health’
 Embedded software ….
 is a crucial ingredient in aerospace systems
 is a method for implementing functionality
 is the ‘universal system integrator’
 could exhibit faults that lead to system failures
 complexity has progressed to the point that zero-defect systems (containing
both hardware and software) are very difficult to build
 Systems Health Management is an emerging field that addresses precisely
this problem: How to manage systems’ health in case of faults ?
 ‘Software Health Management’ is not…
 A replacement for existing and robust engineering processes and standards
(DO-178B)
 A substitute for hardware- and software fault tolerance
 An ‘ultimate’ solution for fault tolerance

Software Health Management
Key ideas
 Use software components as units of fault management: detection, diagnosis,
and mitigation
 Components must be observable, provide fault isolation, and be capable of mitigation
 Use a two-level architecture:
 Component level: detect anomalies and mitigate locally
 System level: received anomaly reports, isolate faulty component(s), and mitigate
on the component
 Use models to represent
 anomalous conditions
 fault cascades
 mitigation actions (when / what)
 Use model-based generators to synthesize code artifacts
 Developer can use higher-level abstractions to design and implement the
software health management functions of a system

Software Component Framework
The Component Model should enable:
 Monitoring
 Interfaces (synchronous/asynchronous calls)
 Component state
 Scheduling and timing (WCET)
 Resource usage
 Anomaly Detection via:
 Pre/post conditions over call parameters, rates, and component state
 Conditions over (1) timing properties, (2) resource usage (e.g. memory footprint), and (3)
usage patterns
 Combinations of the above
 Mitigation:
 Given detected anomaly and state of the component take action
 Can be time- or event-triggered
 Actions: restart, initialize, block call, inject value, inject call, release resource, modify state;
checkpoint/restore, combination of the above

Notional Component Model
Parameter
Component
Resource Trigger
Subscribe
(Event)
Publish
(Event)
Provided
(Interface)
Required
(Interface)
State
A component is a unit (containing potentially many objects). The component is parameterized, has
state, it consumes resources, publishes and subscribes to events, provides interfaces to
and requires interfaces from other components.
Publish/Subscribe: Event-driven, asynchronous communication (publisher does not wait)
Required/Provided: Synchronous communication using call/return semantics.
Triggering can be periodic or sporadic.
Extension of a Component Model defined by OMG (CCM) : state, resource, trigger interfaces.

Example: Component Interactions
Sampler
Component GPS
Component
Display
Component
P
S
S
Components can interact via asynchronous/event-triggered and synchronous/call-driven connections.
Example: The Sampler component is triggered periodically and it publishes an event upon each
activation. The GPS component subscribes to this event and is triggered sporadically to obtain
GPS data from the receiver, and when ready it publishes its own output event. The Display
component is triggered sporadically via this event and it uses a required interface to retrieve the
position data from the GPS component.

Component Monitoring
Component
Monitor arriving
events
Monitor incoming
calls
Monitor published
events
Monitor outgoing
calls
Observe state
Monitor resource
usage
Monitor control flow/
triggering

ACM:
The ARINC Component Model
 Provide a CCM-like layer on top of ARINC-653 abstractions
 Notional model:
 Terminology:
 Synchronous: call/return
 Asynchronous: publish-return/trigger-process
 Periodic: time-triggered
 Aperiodic: event-triggered
 Note:
 All component interactions are realized via the framework
 Process (method) execution time has deadline, which is monitored

ACM:
The ARINC Component Model
 Each ‘input interface’ has its own process
 Process must obtain read-write/lock on component
 Asynchronous publisher (subscriber) interface:
 Listener (publisher) process
 Pushes (receives) one event (a struct), with a validity flag
 Can be event-triggered or time-triggered (i.e. 4 variations)
 Synchronous provided (required) interface:
 Handles incoming synchronous RMI call
 Forwards outgoing synchronous RMI call
 Other interfaces:
 State: to observe component state variables
 Resource: to monitor resource usage
 Trigger: to monitor execution timing

ACM:
A Prototype Implementation
 ARINC-653 Emulator
 Emulates APEX services using Linux API-s
 Partition  Process, Process  Thread
 Module manager: schedules partition set
 Partition level scheduler: schedules threads within partition
 CORBA foundation
 CCM Implementation
 No modification
 ACM component interactions
 Mainly implemented via APEX
 RMI interactions use threads

Implementation: Mapping ACM to APEX
APEX - Abstractions Platform (Linux)
Module Host/Processor
Partition Process
Process Thread
ACM: APEX Component Model APEX APEX Concept Used
Component method Periodic Periodic process Process start, stop
Semaphores
Sporadic Aperiodic process
Invocation Synchronous
Call-Return
Periodic
Target
Co-located N/A
Non-co-located N/A
Sporadic
Target
Co-located Caller method signals callee to release
then waits for callee until completion.
Event, Blackboard
Non-co-located Caller method sends RMI (via CM) to
release callee then waits for RMI to
complete.
TCP/IP, Semaphore,
Event
Asynchronous
Publish-Subscribe
Periodic
Target
Co-located Callee is periodically triggered and polls
‘event buffer’ – validity flag indicates
whether data is stale or fresh
Blackboard
Non-co-located Sampling port, Channel
Sporadic
Target
Co-located Callee is released when event is available Blackboard,
Semaphore, Event
Non-co-located Caller notifies via TCP/IP, callee is
released upon receipt
Queuing port,
Semaphore, Event

ACM:
Modeling Language
 Modeling elements:
 Data types: primitive, structs, vectors
 Interfaces: methods with arguments
 Components:
 Publish/Subscribe ports (with data type)
 Provided/Required interfaces (with i/f type)
 Health Manager
 Assemblies
 Deployment
 Modules, Partitions
 Component  Partition

Example: Sensor/GPS/Display
get
gps_data_source
data_in
invokes
Component Port Period Time Capacity Deadline
Sensor data_out 4 sec 4 sec Hard
GPS data_out aperiodic 4 sec Hard
GPS data_in 4 sec 4 sec Hard
GPS gps_data_src aperiodic 4 sec Hard
Navdisplay data_in aperiodic 4 sec Hard
Navdisplay gps_data_src aperiodic 4 sec Hard
component NavDisplay {
consumes SensorOutput data_in ; //APERIODIC
uses GPSDataSource gps_data_source ;} ;
data_out
component Sensor {
Publishes SensorOutput data_out ; };
data_out
GPS
get
gps_data_src
GPSValue
data_in
reads
invokes
updates reads
Nav
Display
Sensor
component GPS {
publishes SensorOutput data_out ; //APERIODIC
consumes SensorOutput data_in ; //PERIODIC
provides GPSDataSource gps_data_src ; };
struct SensorOutput
{
Timespec time ;
SensorData data ;
};
struct SensorData
{
FLOATINGPOINT alpha ;
FLOATINGPOINT beta ;
FLOATINGPOINT gamma ;
};
struct Timespec
{
LONGLONG tv_sec ;
LONGLONG tv_nsec ;
};
interface GPSDataSource
{
void getGPSData (out GPSData d);
};

Anomaly Detection
 Model-Based Specification of
monitoring expressions
 Post/Pre condition violations:
threshold, rate, custom filter
(moving average)
 Resource Violations: Deadline
 Validity Violation: Stale data on
a consumer
 Concurrency Violations: Lock
timeouts.
 User code violations: reported
error conditions from
application code.
 Code Generators
 Synthesize code for
implementing the monitors
Monitor
arriving events
Monitor
incoming calls
Monitor published
events
Monitor outgoing
calls
Observe state Monitor resource
usage
Monitor control
flow/ triggering
Port Monitors
Non-Port Monitors
• Based on these local detection,
each component developer can
implement a local health
manager
• It is a reactive timed state
machine with pre specified
actions.
• All alarms, actions are reported
to the system health manager

ACM:
Modeling Language: Monitoring
 Monitoring on component interfaces
 Subscriber port  ‘Subscriber process’ and
Publisher port  ‘Publisher process’
 Monitor: pre-conditions and post-conditions
 On subscriber: Data validity (‘age’ of data)
 Deadline (hard / soft)
 Provided interface  ‘Provider methods’ and
Required interface  ‘Required methods’
 Monitor: pre-conditions and post-conditions
 Deadline (hard / soft)
 Can be specified on a per-component basis
 Monitoring language:
 Simple, named expressions over input (output)
parameters, component state, delta(var), and
rate(var,dt). The expression yields a Boolean condition.
74

Component-level Health Management
 Manager’s behavioral model:
 Finite-state machine
 Triggers: monitored events, time
 Actions: mitigation activities
 Manager is local to component
container (for efficiency) but shall be
protected from the faults of functional
components
 Notional behavior:
 Track component state changes via
detected events and progression of
time
 Take mitigation actions as needed
 Design issues:
 Co-location with component (fault containment)
 Local detection may implicate another component
Component
Monitor
WCET
Component Framework
Manager
Actions
Events
Events
Idle
Exec
InvA
start
finish
timeout
/init
invA_violation
/reset

ACM - Modeling Language:
Component Health Manager
 Reactive Timed State Machine
 Event trigger:
 Predefined conditions (e.g. deadline violation, data validity validation)
 User-defined conditions (e.g. pre-condition violation)
 Reaction: mitigation action (start, reset, refuse, ignore, etc.)
 State: current state of the machine
 (Event X State)  Action

Component Health Management
Available Actions
Component Health Manager (High priority ARINC-653 process)
Error
Message /Action
HM Response
Component
NOMINAL ERROR CHECK
RESULT FAILURE
Action Successful
Timeout or
Action Failed
B
U
F
F
E
R
Incoming
Events
Component
Port (653
PRocess)
PPrroocceessss 3 1
HM Response
BBlalacckkBBooaardrd
BlackBoard
Blocking
Read
Architecture

Assembly Definition
Validity(GPS.data_in)<4ms
Delta(Nav.data_in.time)>0
Rate(gps_data_src.data)>1
Specified Monitoring Conditions
 The Sensor component is triggered periodically and it publishes an event upon each
activation.
 The GPS component subscribes to this event and is triggered periodically to obtain GPS
data from the receiver. It publishes its own output event.
 The Nav Display component is triggered sporadically via this event and it uses a required
78 Model-Based Software Health Management
interface to retrieve the position data from the GPS component.

System-level Health Management
 Focus issue: Cascading faults
 Hypothesis: Fault effects cascade via component interactions
 Anomalies detected on the component level are not
‘diagnosed’  can be caused by other components
 Problem:
 How to model fault cascades?
 How to diagnose and isolate fault cascade root causes?
 How to mitigate fault cascades?

Recap: Fault diagnosis
 Model: Timed Failure Propagation Graphs
Modeling variants
•Untimed, causal network (no modes, propagation = [0..inf])
•Modal networks: edges are mode dependent
•Timed models
•Hierarchical component models
Nodes:
•Failure modes
•Discrepancies
• AND/OR (combination)
• Monitored (option)
Edges:
•Propagation delay: [min, max]
•Discrete Modes (activation)
Example models (#components, #failuremodes, #alarms)
•Trivial examples
•Simplified fuel system (~30,~80,~100)
•Realistic fuel system (~200,~400,~600)
•Aircraft avionics (~2000,~8000,~25000) – generated

Recap: Fault diagnosis
Fault diagnosis algorithm:
• Outline:
– Check if new evidence is explained
by current hypotheses.
– If not, create a new hypothesis that
assumes a hypothetical state of the
system consistent with observations
– Rank hypotheses for plausibility and
robustness metrics
– Discard low-rank hypotheses, keep
plausible ones Fault state: ‘total state vector’ of the system,
i.e. all failure modes and discrepancies
Alarms could be
Missing: should have fired but did not
Inconsistent: fired, but it is not consistent
with the hypothesis
Robust diagnostics: tolerates missing and
inconsistent alarms
Metrics:
Plausibility: how plausible is the
hypothesis w.r.t. alarm consistency
Robustness: how likely is that the
hypothesis will change in the future

Modeling Cascading Faults
 Not needed - the cascades can be computed from the
component assemblies, if the anomaly types and their
interactions are known.
 Component ‘elements’
Every method belongs to one of these (7)
 Fault cascades within component
(A few of the 38 patterns)

 Inter-component propagation is regular – always follows the
same pattern
 Intra-component propagation depends on the component! 
Need to model internal dataflow and control flow of the
component.
Note: Could be determined via source code analysis.

 Fault Propagation Graph for GPS Example
 Here: hand-crafted, but it is generated automatically in the
system

System-level Fault Mitigation
 Model-based system-level mitigation engine
 Model-based diagnoser is automatically generated
 Designer specifies fault mitigation
strategies using a reactive state machine
Advantages:
Diagnoser Engine Mitigation Engine • Models are higher-level
programs to specify
(potentially complex)
D D FM
behavior – more readable and
comprehensible
•Models lend themselves to
formal analysis – e.g. model
Managed Component
checking
Component CHM
Component Platform
Managed Component
Component CHM
Component Fault Model
Component Fault Model
FM
FM
FM
FM
D
D
D
D
D

System-level Fault
Mitigation
 Model-based mitigation specification at
two levels
 Component level: quick action
 System level: Reactive action taking the
system state into consideration
 System designer specifies them as a
parallel timed state machine.
 Fixed set of mitigation actions are
available
 Runtime code is generated from
models
 Advantages:
 Models are higher-level programs to
specify (potentially complex) behavior –
more readable and comprehensible
 Models lend themselves to formal
analysis – e.g. model checking
List of predefined Mitigation Actions
HM Action Semantics
CLHM: IGNORE Continue as if nothing has happened
CLHM:ABORT Discontinue current operation, but opera-tion
can run again
CLHM:
USE PAST
DATA
Use most recent data (only for operations
that expect fresh data)
CLHM: STOP Discontinue current operation
Aperiodic processes (ports): operation can
run again
Periodic processes (ports): operation must
be enabled by a future START HM action
CLHM: START Re-enable a STOP-ped periodic operation
CLHM RESTART A Macro for STOP followed by a START
for the current operation
SLHM: RESET Stop all operations, initialize state of component, start all periodic
operations
SLHM: STOP Stop all operations
Diagnoser Engine Mitigation Engine
Alarms
Alarms
Alarms

System-level Health Management
Functional components
 1. Aggregator:
 Integrates (collates) health information coming
from components (typically in one hyperperiod)
 2. Diagnoser:
 Performs fault diagnosis, based on the fault
propagation graph model
 Ranks hypotheses
 Component that appears in all hypotheses with
the highest rank is chosen for mitigation
 3. Response Engine:
 Issues mitigation actions to components based
on diagnosis results
 Based on a state machine model that maps
diagnostic results to mitigation actions
These components are generated
automatically from the models
The Health Management Approach:
1. Locally detected anomalies are mitigated
locally first. – Quick reactive response.
2. Anomalies and local mitigation actions are
reported to the system level.
3. Aggregated reports are subjected to
diagnosis, potentially followed by a system-level
mitigation action.
4. System-level response commands are
propagated down to components.

Example:
2005 Malaysian Air Boeing 777 in-flight upset
 Low airspeed advisory.
 Airplane’s autopilot experienced excessive acceleration values.
 Vertical acceleration decreased to -2.3g within ½ second
 Lateral acceleration decreased to -1.01g (left) within ½ second
 Longitudinal acceleration increases to +1.2 g within ½ second
 Autopilot pitched nose-up to 17.6 degree and climbed at a vertical speed
of 10,650 fpm.
 Airspeed reduced to 241 knots.
 Stick shaker activated at top of the climb.
 Aircraft descended 4,000 ft.
 Re-engagement of autopilot followed by another climb of 2,000 ft.
 Maximum rate of climb = 4440 fpm.

B777 ADIRU Architecture
• Designed to be serviceable with
one fault in each FCA
• Can fly but maintenance
required upon landing with two
faults in each FCA
• Each ARINC 629 end unit voted
on the processor data bit-by-bit.
• Processors monitor the ARlNC
629 modules by full data wrap-around
• Processors also monitor the
power supplies, any one of
which can power the entire unit
• Accelerometer and gyro in
skewed redundant configuration
• A S(econdary)AARU also
provided inertial data
Based on Air Data Inertial Reference Unit (ADIRU)
Architecture (ATSB, 2007, p.5)

Cause of Inflight Upset
 June 2001: accelerometer 5 fails with high output value, ADIRU disregards it.
 A power cycle on ADIRU occurs. A latent software bug disregards the faulty status
of accelerometer 5.
 Status of failed unit was recorded on-board maintenance memory, but that memory was
not checked by the software.
 An inflight fault was recorded in accelerometer 6 and it was disregarded.
 FDI software allowed use of accelerometer 5.
 High acceleration value was passed to all computers.
 Due to common-mode nature of fault, voters allowed high accelerometer data to
go on all channels.
 This high value was used by primary flight computer.
 Mid value select function used by the flight computer lessened the effect of pitch
motion.
Pro blem: System relied on redundancy to mask a fault. But due to latent software
bug and common-mode fault, the effect cascaded into the system failure
Reading Material: The dangers of failure masking in fault-tolerant software: aspects of a recent in-flight upset event
C.W. Johnson and C.M. Holloway, IET Conf. Pub. 2007, 60 (2007), DOI:10.1049/cp:20070442

Case Study
• Modeled the architecture as a
software component assembly
• Created the fault scenario
• Only modeled part of the system
to illustrate the point of SHM
• Accelerometers are arranged on
six faces of a dodecahedron.
Used for regression Equations

ADIRU Assembly (Accelerometers)
Runs at 20 Hz

ADIRU Assembly (Processors)
Observer tracks the age
of accelerometer data.
Specified as timed state
machine (with timeout)
Runs at 20 Hz

ADIRU Assembly (Voters)
Runs at 20 Hz

ADIRU Assembly (Display- Mimics PFC)
Runs aperiodically

Deployment Model
Each Module is a processor running the
ARINC Component Runtime Environment

Execution
Accelerometers
Machine – durip02
SHM
Machine – durip09
ADIRU Processors
ADIRU Computers
Machine – durip03
Voter + Display Computer
Machine – durip06
Accelerometers
SHM VOTERS + DISPLAY

System Health Manager
other machines have similar specification
These components are auto generated
The hypothesis generated by the diagnoser is translated to
Component(s) that is most likely faulty. This list is fed to
Response Engine, which triggers the mitigation state machine

Demonstration
 Fault Scenario
 Accelerometer 5 has initial fault
 It is started which causes an alarm
 Then Accelerometer 6 develops fault
 Successful mitigation
 Identifying the faulty components
 Stopping the fault components
 Processors can still function with four accelerometers.

Demonstration: Faulty Scenario

Resilient architectures and
autonomy

Resilience and autonomy
 Model-based Software Health Management
 Requires explicit specification of component-level and system-level
health management (recovery) actions
 Complex and error-prone… too many options!
 Resilient systems should recover autonomously
 Concepts:
 Model the system architecture + functions.
 Express what is needed from the system to implement
functions.
 Embed models into the run-time system
 Use a reasoner to figure out how to recover function upon
failures

Modeling
Functional Requirements for IMU
 Inertial Position
• Determine inertial position.
• Functional Requirement (AND)
 GPS Position
 Position Tracking
 GPS Position
• Sense GPS position for computing
Inertial Position
 Position Tracking
• Continuously track position to compute
Inertial position
• Functional Requirement
 Body Acceleration Measurement
 Body Acceleration Measurement
• Sense body acceleration for Position
Tracking.
Inertial
Position
GPS
Position
Position
Tracking
Body
Acceleration
Measurement

Modeling
Complete Redundant Architecture

Modeling the Architecture
Function Allocation
Body Acceleration
Measurement
EXACTLY ONE (Primary /Secondary ADIRU
Subsystem)
ADIRU Subsystem has
• Accelerometers (6)
• ADIRU Computers (4)
• Voters (3)
Functional / Operational ADIRU Subsystem
requires
• ATLEAST 4 of 6 Accelerometers
• ATLEAST 2 of 4 Filters or ADIRU
computers
• ATLEAST 1 of 3 Voter
Inside one ADIRU:

Function Allocation
GPS Position
 EXACTLY ONE (Primary/Secondary
GPS Subsystem)
 GPS Subsystem includes
 GPS Receiver (1)
 GPS Processor (1)
 Functional / Operational GPS
subsystem requires
 EXACTLY ONE of GPS Receiver
 EXACTLY ONE of GPS Processor
Inside one GPS Subsystem:

Function Allocation
POSITION TRACKING
 ATLEAST ONE OF ( LEFT/ CENTER/
RIGHT PFC NavFilter Subsystem)
 PFC NavFilter Subsystem includes
 PFC Nav Filter (1)
 PFC Processor (1)
 Functional/ Operational Requirement
for PFC Subsystem
 EXACTLY ONE PFC NavFilter
 EXACTLY ONE PFC Processor
Inside one PFC Subsystem:

Component Operational Requirement
 EXPLICIT – Local dependency
 Display Subsystem
 ATLEAST 1 of 3 Consumers (Left, Center, Right)
 EXPLICIT – Local dependency
 ADIRU Computer inside ADIRU Subsystem
 ATLEAST 4 of 6 Consumer Port
Implies
 ATLEAST 4 of 6 Accelerometer Components

 IMPLICIT – Inferred dependency
 PFC NavFilter in PFC Subsystem
 EXACTLY 1 of 1 Consumer Port AND
 ATLEAST 1 of 1 Requires Port
Implies
 EXACTLY 1 of 2 ADIRU Subsystems AND
 ATLEAST 1 of 2 GPS Subsystem

 IMPLICIT – Inferred dependency
 PFC Processor inside PFC Subsystem
 EXACTLY 1 of 1 Consumer Port
Implies
 EXACTLY 1 of 1 PFC NavFilter
 GPS Processor inside GPS Subsystem
 EXACTLY 1 of 1 Consumer Port
Implies
 EXACTLY 1 of 1 GPS Receiver

Modeling the problem:
Boolean SAT
Functional Requirements + Function allocation +
Component operational requirements + Component states
 Encoded as Boolean (CNF) Expression for SATisfiability
problem
Solution: valid component architecture
Size: #Variables: 493/ #Clauses: 1776
FAULT / Scenario SAT solver -
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Verifying Initial State 0.004228 No commands. Initial State accepted as satisfying/
meeting functional requirements.

Fault: ADIRU Accelerometer
 Fault introduced, anomaly detected, fault source
component diagnosed, then:
 Compute the new component architecture that satisfies
the functional requirements AND minimizes the number
of reconfiguration changes
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Primary_
ADIRU_Subsystem_
Accelerometer6
0.002989
STOP Primary_ADIRU_Subsystem_Accelerometer6
Primary_
ADIRU_Subsystem_
Accelerometer5
0.003151

Primary ADIRU Subsystem
Partial fault – Primary still functional

ADIRU Accelerometer Fault
(contd.)
 3rd fault  failover to secondary ADIRU
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Primary_
ADIRU_Subsystem_
Accelerometer4
0.020825
STOP Primary_ADIRU Subsystem
(stop all accelerometers, ADIRU computers, Voters in Primary
ADIRU subsystem)
START Secondary_ADIRU Subsystem
(start all accelerometers, ADIRU computers,
Voters in Secondary ADIRU subsystem)

Primary ADIRU Subsystem
Complete fault

Primary ADIRU Subsystem Faulty
Failover to secondary ADIRU

GPS Fault
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Primary_
GPS_Subsystem_
GPSProcessor
0.004720
STOP Primary_GPS_Subsystem
(stop GPS Receiver, GPS Processor)
START Secondary_GPS Subsystem
(start GPS Receiver, GPS Processor)

Reconfiguration after
Primary GPS Subsystem becomes faulty

PFC NavFilter Faults
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Left_
PFC_Subsystem_
PFCNavFilter
0.003107
STOP
Left_PFC_Subsystem
( stop PFCNavFilter, PFC Processor)
Right_
PFC_Subsystem_
PFCNavFilter
0.003089
STOP
Right_PFC_Subsystem
( stop PFCNavFilter, PFC Processor)

Research challenges
 Modeling and engineering of r-CPS
 Modeling paradigm / verification paradigm / synthesis
 Verify recoverability under all scenarios
 Efficient recovery
 Analytics:
 Comparing architectures and solutions
 Resilience against…
 Cascading, cross-domain faults
 Cyber attacks possibly with physical faults
 Engineering process
 ‘Simian army’ or systematic design?
 Principles of multi-layer resilience

SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Opportunities

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Opportunities

Similar to SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Opportunities (20)

More from SERENEWorkshop

More from SERENEWorkshop (9)

Recently uploaded

Recently uploaded (20)

SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Opportunities