SERENE 2014 School: Luigi pomante serene2014_school

SERENE'14 Autumn School
ENGINEERING RESILIENT CYBER PHYSICAL SYSTEMS
System-Level Concurrent Error Detection
Dr. Luigi Pomante
Università degli Studi dell’’Aquila
Center of Excellence DEWS
luigi.pomante@univaq.it

Introduction
Resilience
Reliability
Fault Tolerance
Concurrent
Error
Detection
System Level CED - 2 - © 2014 - Luigi Pomante

Introduction
Error detection is one of the basic feature needed
to support reliability and then resilience in CPS
So, this talk focuses on error detection issues in the cyber
part of a CPS
Such a part is normally a customized electronic digital system,
with an ad-hoc hw/sw architecture, typically embedded in a
more complex heterogeneous system that heavily interacts
with some physical processes

Introduction
Error Detection Methodologies
Off-line vs. Concurrent
System-Level Design Methodologies
System-Level Specification
Functional characterization of the system without dealing
with implementation aspects
Specification of implementation objectives and constraints
Timing, Power Consumption, Area
Estimation of the influence of different alternatives on the
final implementation
HW/SW system composition
Different processors and/or alternative technologies

Introduction
Typically, system resislience/reliability aspects are neglected
while dealing with the higher levels of system synthesis process
They are postponed to lower abstraction levels but the use of
resislience/reliability methodologies could significantly
impacts on timing, energy and area
It is necessary to transfer these aspects toward the upper
levels of the synthesis flow by adding the resilience/reliability
constraint to the classical cost parameters
This work investigates the problem of adopting design for
reliability/resilience approaches at system level, when all the
solutions are still open for the implementation of the device,
presenting a set of design methodologies to provide concurrent
error detection (CED) properties to the final implementation

Goal
The achievement of this wide resilience/reliability
co-design project consists of the following aspects
specification of systems in a co-design environment
supporting resilience/reliability constraints
design methodologies providing the desired CED properties
hw/sw system partitioning on the basis of metrics taking
into account both traditional co-design issues and
resilience/reliability constraints

Overview
Problem Definition
Target System Architecture
Fault Model
System Specification
Design Methodologies for Reliability
Design Analysis and Metrics
Hw/Sw System Partitioning
A Case Study: a Reliable Pacemaker

Problem Definition
A Section is a subset of the system specification
A Critical Section is a section where the CED
property is required
A Reliable Section is a critical section that
propagates either error free critical results or faulty
critical results associated with an error indication

Problem Definition
The underlying assumption refers to the fact that the input
data processed by the reliable section is error free
The upstream sections provide either correct data by
definition or they are designed to be reliable themselves
The downstream sections also need to be designed reliable or
no reliability constraint applies to them
In the former case reliability is extended to all downstream
elements, in the latter the property has a pure local effect

Problem Definition
In order to define formally these two different
characterizations, the following definitions are
introduced
Local Reliability
The Local Reliability property of a critical section specifies
that the reliability constraints involve only the related critical
section
Global Reliability
The Global Reliability property of a critical section specifies
that the reliability constraints involve the related sections
and recursively all the downstream sections

Problem Definition
Local and Global Reliability Specification
A
B
D
D
C
A
B
D
E
C
Local reliability on B: the data
provided to A are reliable
Global reliability on B: the data
provided to A and B are reliable

Problem Definition
The need of two kinds of reliability is due to the possibility
that a specification could comprehend also the environment
description, that doesn’t need any property, or a set of
functionalities of which only one should be reliable
For example, a digital control system specification for a car could
comprehend tachometer, temperature and ABS control: the
reliability is needed only for the ABS
In order to be able to specify which sections must be reliable
and what kind of reliability is desired particular system level
specification languages (or proper extension to the existing
ones) are required

Two languages has been considered for system
specification: Occam II and SystemC
The first one has been selected since the TOSCA
environment (a Co-design environment for embedded
systems), used in our studies to verify the proposed
approaches, is based on it
The second language is becoming increasingly popular for
system level specification, thus making its adoption almost
a requirement when pursuing the integration of the
proposed approaches in a real design flow

Reliability constraints in Occam II
The language has been extended with the introduction of
statements for identifying critical sections to be added to
the standard constraint definition section
CS FROM label1 TO label2 IS LOCAL (GLOBAL)
INT a,b
CHAN OF INT in,out:
TAG A:
SEQ
a:=0
WHILE TRUE
TAG B:
SEQ
a:=a+1
out ! a
TAG C:
in ? b
a:=a+b
Declaration of a communication
channel
TAG D:
MAXDELAY FROM B TO C IS 10:
MAXRATE OF B IS 100:
CS FROM A to D IS LOCAL:
Tag definition
Timing constraints
Reliability constraint

Reliability constraints in SystemC
The language allows an intervention at different
abstraction levels: module or process
While working at module level, reliability constraints are
imposed by extending the basic class using the inheritance
mechanisms
SC_MODULE_GCS, SC_MODULE_LCS
– A reliability constraint imposed to the module applies directly to
all processes included in the module itself
When moving to process level, macro mechanisms can be
adopted, by introducing additional macros for specifying
critical sections and the local/global reliability constraint
SC_GCS, SC_LCS

Target System Architecture
The reference architecture consists of the basic processor
block (either general purpose or DSP), which executes software
processes, main memory and a set of co-processors (ASIC or
FPGA) implementing hardware functionalities if required
Communication between hardware modules uses the available
bus, memory otherwise
CPU
Memory
I/O Interface Co-Processors

Fault Model
The adopted fault model is represented by the Single
Functional Failure, where any number of physical faults
causes a functional module to perform incorrectly
The considered faults affect the hardware structure of the
system, mining the behavior of the software too, but no software
failures are considered in this work
The modules that may fail are, thus, the main processor, the
co-processors, the main memory, the system bus and the
dedicated channels for hardware-hardware module
communication
Such a single failure model is based on a commonly adopted
hypothesis: module failure is detected before another module
fails

Design Methodologies
for Reliability
The resilience/reliability project has investigated design
methodologies for guaranteeing error detection capabilities
based on the adoption of redundancy strategies
Architectural and information redundancy
The methodologies that have been analyzed and developed can
be classified
On the basis of the functionality to be performed and controlled
Data Processing or Communication
On the partitions involved
HW or SW
On the CED techniques adopted for guaranteeing the reliability
properties

for Reliability
The design approach considers as the basic element any
functionality that the system must provide in a reliable way
Nominal (N)
Denotes such basic element
Checking (C)
Identifies the redundant functional elements designed to provide
error detection capabilities
Checker (CK)
Is the functional element that detects a mismatching behavior
between N and C due to failures
Each one of these three elements (N, C and CK) can be
independently implemented in hardware or in software,
leading to several classes of methodologies

for Reliability
Reliable Data Processing
Nominal
Architecture
Checking
Architecture
Sw
Checker
Hw
Sw
Hw
Sw
Hw
Solution Nominal Checker Checking
1 SW SW SW
2 SW HW SW
3 SW SW HW
4 SW HW HW
5 HW SW SW
6 HW HW SW
7 HW SW HW
8 HW HW HW

for Reliability
Class 1: SW Nominal, SW Checker, and SW Checking
Self-Checking SW
Assertions
Dual-Processor Checking
VLIW Checking
Class 2: SW Nominal, HW Checker, and SW Checking
Interface for Functional Redundancy Check
DMA Checker
VLIW Checking with HW Checker

for Reliability
Class 4: SW Nominal, HW Checker, and HW Checking
Dynamically Re-Configurable Checker
Class 8: HW Nominal, HW Checker, and HW Checking
Device Duplication
TSC Scheduling
TSC Devices

for Reliability
Reliable Communications
It is necessary to guarantee that any fault on
communication lines is detected
Either hardware redundancy (lines duplication) or
information redundancy (data encoding) can be adopted
Two possibilities should be considered
Communications between procedures implemented in HW
Other kind of communications
– SW-SW, SW-HW, HW-SW

for Reliability
Communications between procedures implemented in HW
A pair of HW sections communicates by means of dedicated
lines
– Line Duplication vs. Data Encoding
Other kinds of communication
When the communication involves a SW section then it makes
use of the system bus
– The only viable solution is the use of error detection codes
– The best results are obtained keeping the data in memory in a
coding form and let the CPU working only with non-coded data
» HW TSC Encoder/Decoder/ChecKer for the processor and
one (or more) for the HW devices

for Reliability
Architecture with reliable communications
CPU
Memory
(Coded Data)
TSC
EDCK
TSC
EDCK
TSC
EDCK
TSC
CK
I/O Interface Co-Processors

Design Analysis
and Metrics
All the methodologies have been analyzed in details
in order to give prominence to main design issues
and to evaluate benefits and costs
The design issues have been analyzed qualitatively
according to a reference schema in order to quickly show
the main differences between different approaches
Benefits and costs have been analyzed defining a set of
significant parameters, constituting the basic elements
needed to build metrics useful to compare the quality of
different solutions, metrics that play an important role in
the partitioning step

Design Analysis
and Metrics
Design issues reference schema: key concepts
Selection of number and typology of processing elements
Detection of the need for a special architecture
Analysis of synchronization issues between processing elements
Analysis for possible physical and logical resources sharing
Detection of modification needs of the original specification
Selection of the execution policies for each processing element
Allocation of the checker memory space
Selection of the checking policies
Analysis of the checker structure and complexity
Selection of a mechanism to enable the checker to rise exceptions
to report error detection

Design Analysis
and Metrics
Benefits and Cost
Let us define the Efficiency of a given methodology as its
characterization relatively to three factors
Coverage
– It is the percentage of functional faults that it is possible to
detect with respect to the complete fault set
Detection Latency (DL)
– It is the time between the instant a fault causes an error and the
instant the error is detected
Performance Degradation (PD)
– It is related to the overhead (i.e., additional execution time)
caused by fault detection tasks with respect to the original
system

Design Analysis
and Metrics
Benefits and Costs
Let define the Cost of a given solution as the overhead
with respect to the original system
Physical cost (Cp)
– It represents the cost of the physical components added to the
original architecture
Design Cost (Cd)
– It represents the effort needed to design and implement a given
solution

Once the system, the constraints, and the set of possible
design solution are specified, the partitioning step selects the
implementation of each task, either hardware or software
The achieved solution is checked against the designer's
constraints and, if they are met, the solution is accepted,
otherwise a backtrack is performed and another allocation
solution is pursued
This process is extremely complex and time consuming, due to
the large number of possible alternatives and to the fact that,
although heuristics and tuned estimation functions have been
defined, it is the final co-simulation of the suggested system
implementation that confirms it to be a solution or not

The reliability aspects add a significant number of parameters
to the partitioning step for the selection of the final
implementation, making this task too complex
In order to cope with the complexity of the partitioning step
when reliability goals are also included, a two-level approach
is here proposed
A first partitioning is performed which takes into account only the
classical aspects and cost functions, meeting the usually stringent
time constraints
Given the first assessed solution, a second-level partitioning
considers the additional reliability constraints, analyzes the
possible approaches, within the set of defined methodologies
which fulfill them, and provides the solution that has the best
tradeoff (if it exists)

S P E C IF IC A T IO N
P A R T IT IO N IN G
R E L IA B IL IT Y
T A G S
T IM IN G
P O W E R
A R E A
C O S T
T IM IN G
T A G S
A R C H IT E C T U R E
I N T
H W S W
O .S .
IN I T I A L
S O L U T IO N
N O R E L IA B IL IT Y
Y E S
R E Q .
c o n s t r a in ts
c o n s t r a in ts
P A R T IT IO N IN G R E L IA B IL IT Y
M O D E L
S T R E N G T H
H A R D /S O F T
p a ra m e te rs
F A U L T C O V E R A G E
D E T E C T IO N L A T E N C Y
A R E A O V E R H E A D
P E R F O R M A N C E
D E G R A D A T IO N
S P E C I F IC S O L U T IO N
A R C H .
Y E S N O
N O
Y E S
O P T IM IZ A T IO N
H W S W
I N T
H W S W
O .S .
H W /S W S Y N T H E S IS
R E L IA B IL IT Y
C O -D E S IG N
P A R T IT IO N IN G
S E C T IO N S F O R
R E L IA B IL IT Y
S O L U T IO N
W IT H F A U L T
D E T E C T IO N

The 2th-level partitioning problem consists of both
Reliability Model Identification
Defining a criterion for the identification of the relation
between the constrained procedure and the most suitable CED
method
Optimization
Optimizing the result produced by the assignment criteria
with respect to the global solution

For each approach is identified a correct evaluation, or a
qualitative estimation, of the considered parameter
Methodologies Fault Coverage Detection
Latency
Performance
Degradation
Area Overhead
SCS min/med/max med/max med/max med/max
A min/med/max min/med med/max med/max
DP 100% med/max min/med med/max
VLIWS 100% 0 med/max min
IFRC 100% 0 0 max
DMAC 100% med/max med/max max
VLIWH 100% 0 0 max
DCC 100% med med max
D 100% 0 0 max
TSCS 100% med/max med/max med/max
TSCD 100% 0 0 min/med

A crisp tag (100% fault coverage, 0 detection latency, etc.)
represents a hard system constraint that has to be
enforced at any cost
A fuzzy tag (i.e. min, med, max) represents a soft system
requirement that is a design directive of the required
effort for the identification of anomalies during the device
operational time
Note that, for soft requirements, a maximum requirement
includes methodologies belonging to the medium or minimum
partitions; and a medium requirement includes minimum

Crisp tags force a partition on the methodologies set
In particular, 100% fault coverage induces the partitions
hard_fc and soft_fc, 0 detection latency induces the
partitions hard_dl and soft_dl while, 0 performance
degradation induces the partition hard_pd and soft_pd
Since the applicability of a methodology to a specific
procedure depends on its hardware/software
characteristic, a further partition is induced

By analyzing the properties of the methodologies, the
following partitions are identified:
swfc = { {IFRC, DP, DMAC, DCC, VLIWH, VLIWS} ; {A, SCS} }
hwfc = { {TSCS, TSCD, D} ; {} }
swdl = { {IFRC, VLIWH, VLIWS} ; {DP, DMAC, DCC, A, SCS} }
hwdl = { {D, TSCD} ; {TSCS} }
swpd = { {IFRC, VLIWH} ; {DMAC, DP, DCC, VLIWS, A, SCS} }
hwpd = { {D, TSCD} ; {TSCS} }

The second level partitioning takes into account the hard
parameters first for selecting suitable CED techniques, and
uses the soft parameters for selecting among them
More precisely, for each critical procedure, on the basis of its
allocation in hardware or in software, the  partitions
fulfilling the hard/soft requirements are selected, and the
intersection between them provides the set of suitable CED
techniques
The partitioning thus proceeds with the next critical
procedure and moves toward the end of this local CED
allocation analysis. At the end, all procedures are associated
with a set of admissible CED implementations

Optimization
The global solution determining for each procedure the CED
technique actually adopted is pursued by means of a
process of solution extraction and simulation, to verify that
the constraints of the first partitioning are still met
This process takes into account the fact that there are
techniques with a global effect (such as IFRC, DP), which
prevail over those with a local impact (A, SCS)
As an optimization policy, the final solution does not
include overlapped methods in order to achieve a
significant efficiency

A Case Study:
a Reliable Pacemaker
The goal of this case study is to co-design a reliable pacemaker
able to detect any anomalies in its behavior due to physical
faults in its components
In order to obtain this goal, by starting from system-level
specification and following a reliable co-design flow, the
design space is explored, identifying an optimal partitioning
between hardware and software, validated through system-level
co-simulation
Hence, by taking into account the reliability requirements, the
proper CED methodologies able to meet all the constraints are
selected and then the one with the best cost-benefit tradeoff
is identified and adopted for the final design

A Case Study:
Behavioral analysis
LRL
PVARP AEIr
BP
AVIr
CSW
AVI
Time Intervals Min-Max (ms)
PVARP 300-400
AEIr 0-400
BP 25
CSW 75
AVIr 100
Electrocardiographic diagram
showing the relevant timing parameters
Typical values for each interval

A Case Study:
State Diagram
BP
Natural V
time_out /
reset_timer, set_AEIr_timer
PVARP AEIr
Natural V /
reset_timer, set_PVARP_timer
Natural A /
reset_timer, set_BP_timer
time_out /
Stimulated A
reset_timer, set_BP_timer
AVIrp CSW
AVI r
Start
time_out /
set_CSW_timer
time_out /
reset_timer, set_AVIr_timer
Natural V /
set_AVIrp_timer
time_out /
Stimultaed V
rset_timer, set_PVARP_timer
NAtural V/
time_out /
Stimulated V

A Case Study:
Timing Constraints
State Min-Max (ms)
PVARP 300-400
AEIr 300-800
BP 325-825
CSW 400-900
AVIr 500-1000
Other Constraints
Timing bounds for the intervals
The other constraints to be considered in the first-level
partitioning step are the classical ones: power dissipation,
area and cost
They must be kept as much as possible to minimum values

A Case Study:
Reliability Constraints
Considering the criticality of the system for the human
safety, a hard reliability is imposed on the whole system
More in detail
100% fault coverage is required
Performance degradation is allowed as long timing constraints
are still met
Detection latency and area overhead must be kept as much as
possible to minimum values

A Case Study:
System Level Specification: the Environment
Main
Heart System
Test
bench
Environment
Channels
Calls
RTS
[1]
RTS
[0]
The heart ... inside

A Case Study:
System Level Specification: the System
Channels System
Pace
maker
PVARP
AEIr
AVIr
Time
out[0]
Time
Out
[2][3][4]
Time
out[1]
Calls

A Case Study:
Timing and Reliability Requirements Specification
PROC Pacemaker( CHAN OF BIT R; CHAN OF BIT V; CHAN OF BIT P;
CHAN OF BIT A; CHAN OF BIT inh_R; CHAN OF BIT inh_P )
BIT val:
-- Main body
SEQ
R ? val
WHILE (TRUE)
SEQ
TAG P1:
PVARP[0]( R, V, P, A, inh_R, inh_P, val)
TAG P2:
:
MINDELAY FROM P1 TO P2 IS 500 (MS):
MAXDELAY FROM P1 TO P2 IS 1000 (MS):
CS FROM P1 TO P2 IS GLOBAL:

A Case Study:
1st Level Partitioning
TOSCA
Embedded Ultra-Low Power Intel 486 GX
Genetic Algorithm
Communication Costs
Procedures Allocation Test results
Pacemaker PVARP AEIr AVI Timeout[0] [1] [2] [3] [4] T1 T2 T3 T4 T5 T6
SW SW SW SW SW SW SW SW SW OK OK OK OK OK OK
SW SW SW SW HW HW HW HW HW OK OK Max
SW HW HW HW SW SW SW SW SW OK Max
HW HW HW HW HW HW HW HW HW OK OK OK OK OK OK
Selected Solution
All-in-sw implementation (E486 16 Mhz)
AVI
Max
AEIr
OK Max
AVI
PVARP
AVI
Max
AEIr
Max
AEIr
OK Max
AVI

A Case Study:
2th Level Partitioning
Reliability Constraints
FC = 100%
PD = medium
DL = maximum
A = maximum
Partitions
FC 100%
– swfc = {hard_fc} = {IFRC, DP, DMAC, DCC, VLIWH, VLIWS}
PD medium
– swpd = {hard_pd; soft_pd}
= {{IFRC, VLIWH };{DMAC, DP, DCC, VLIWS, A, SCS}}
– swpd = {{IFRC, VLIWH };{DP}}

A Case Study:
2th Level Partitioning
Potential Solutions
{IFRC, DP, VLIWH}
Methodologies Comparison
IFRC and VLIWH doesn’t affect system behavior
DP requires co-simulation (Nominal, Checking, Checker)
Test results
T1 T2 T3 T4 T5 T6
OK OK Max
AEIr
Max
AVI
PVARP
OK Max
AEIr
PVARP
– The timing constraints aren’t met: the solution is discarded

A Case Study:
Selected Solution
The feasible solutions are IFRC and VLIWH
These alternatives are characterized by the same area
overhead and detection latency, so they are equivalent
The designer, considering the particular aspects related to
other steps of the co-design flow can make the final choice
For example, the IFRC is applicable independently from the
number of reliable procedures while VLIWH requires a specific
software synthesis step for each reliable procedure
– The first solution has thus a cost that is independent of the
number of critical sections, which is not true for VLIWH solutions
– Since in the present case study all the system procedures are
made reliable, the first architectural solution requires a lower
effort and design cost and may be preferable

A Case Study:
Selected Solution
The final architectural solution for the reliable pacemaker
CPU
Memory CPU_chk
BUS Interface
and Checker
I/O Interface
The selected solution doesn't allow any significant back
annotation to the first level partitioning, since the initial
hw/sw partitioning achieved an acceptable all-in-software
solution, loading all tasks efficiently on one processor

Conclusions
The resilience/reliability co-design project aims at
integrating in a standard co-design flow the
elements for achieving a final system able to
autonomously detect the occurrence of faults during
the operational life of the system
The entire flow has been presented in this work,
discussing the key elements of the proposed
framework
Specification
System Partitioning

Conclusions
Language specification extensions have been
defined to specify reliability requirements
A set of possible hw/sw architectural design
methodologies has been analyzed considering the
possibilities to implement any part of the complete
system (nominal, checking and checker) either in
hardware or in software
A metric has been introduced taking into account
the peculiar elements of reliability properties

Conclusions
A two-level hw/sw partitioning process has been
defined, acting initially as a traditional approach to
determine a valid solution, while the second step
explores the alternatives taking into account the
fault detection properties
A case study shows the results of our work
Further research efforts are directed toward the
tuning of metrics with respect to the selected suite
of design methodologies, to better support the
partitioning step

References
L. Pomante. “System Level Concurrent Error Detection”, Technical Report No. 2001.62,
Politecnico di Milano, 2001
L. Pomante. “System-Level Co-Design of Heterogeneous Multiprocessor Embedded
Systems”, PhD Thesis, Politecnico di Milano, 2002
L. Pomante, C. Bolchini, F. Salice, D. Sciuto. "Reliability Properties Assessment at
System Level: a Co Design Framework", Journal of Electronic Testing - Theory and
Application (JETTA), Kluwer Academic Publishers, 2002
L. Pomante, A. Miele, F. Salice, C. Bolchini, D. Sciuto, "Reliable System Co-Design: the
FIR Case Study", IEEE International Symposium on Defect and Fault Tolerance in VLSI
Systems (DFT 2004)
L. Pomante, F. Salice, C. Bolchini, D. Sciuto, “Reliable System Specification for Self-
Checking Data-Paths”, Design, Automation and Test in Europe – Conference & Exibition
(DATE 2005), 2005
L. Pomante, D. Sciuto, F. Salice, W. Fornaciari, C. Brandolese. “Affinity-Driven System
Design Exploration for Heterogeneous Multiprocessor SoC”, IEEE Transactions on
Computers, vol. 55, no. 5, 2006
L. Pomante. “System-Level Design Space Exploration for Dedicated Heterogeneous Multi-
Processor Systems”. IEEE International Conference on Application-specific Systems,
Architectures and Processors, 2011
L. Pomante. “HW/SW Co-Design of Dedicated Heterogeneous Parallel Systems: an
Extended Design Space Exploration Approach”. IET Computers & Digital Techniques,
Institution of Engineering and Technology, 2013

SERENE 2014 School: Luigi pomante serene2014_school

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à SERENE 2014 School: Luigi pomante serene2014_school

Similaire à SERENE 2014 School: Luigi pomante serene2014_school (20)

Plus de Henry Muccini

Plus de Henry Muccini (20)

SERENE 2014 School: Luigi pomante serene2014_school