2. A fault tolerant system is a system which is a able to
continue operating despite the failure of a limited
subset of their hardware or software.
They are gracefully degradable i.e. as the size of the
faulty set increases, the system wont collapse
suddenly but continue executing, part of its
workload.
The goal of this design is to ensure that the
probability of system failure is acceptably small.
3. FAULT TYPES
Hardware Fault: A hardware fault is some physical
defect that can cause a component to malfunction.
E.g. A broken wire or the output of a logic gate
that is perpetually stuck at some logic value(0 or 1).
Software Fault: A software fault is bug that can
cause the program to fail for a given set of inputs.
4. ERROR
Error is a manifestation of a fault.
e.g. A broken wire will cause an error if
the system tries to propagate a signal
through it.
A program that has a fault that induces
incorrect output for some set of inputs will
generate errors, if that set of inputs is
applied.
5. FAULT LATENCY
The fault latency is the duration between
the onset of a fault and its manifestation as
an error.
Since the faults themselves are invisible to
the outside world, only showing themselves
when they cause errors. Such latency will
impact the reliability of the overall system.
6. ERROR RECOVERY
It is the process by which the system attempts to
recover from the effects of an error.
TYPES OF ERROR RECOVERY
Forward Error Recovery: In this type the error is
masked without any computations having to be
redone.
Backward Error Recovery: In this type the system is
rolled back to moment in the time before the error is
believed to be occurred and computation is carried out
again. It consumes additional time to mask the effects
of failure.
7. CAUSES FOR FAULTS
Errors in the specification or design.
Defects in the components
Environmental effects.
8. Errors In The Specification Or Design
This error arises due to the communication
gap between the person who writes the
specification and the system designer.
The specification is the link between design
process and real world application.
If specification is wrong everything that
proceeds from it is likely to be wrong.
9. Defects In Components
This fault arise due to defects caused by the
wear and tear of use.
E.g. A mosfet may fail due to electro migration,
which is the drifting away overtime of metal
atoms towards the cathode.
10. Environmental Effects
This fault arise due to operating environment .
Devices can be subjected to whole array of
stresses, depending on the application.
Poor ventilation or excessively high ambient
temperatures can melt components or damage
them.
e.g If a computer is in missile, it can undergo
high g-forces and vibrational stress.
11. FAULT TYPES
Faults are classified according to their temporal
behavior and output behavior.
A fault is said to be active when it is physically
capable of generating errors and to be benign when
it is not.
12. TEMPORAL BEHAVIOR CLASSIFICATION
Fault types: Permanent, intermittent, transient.
A permanent fault does not die away with time,
but remains until it is repaired or the affected unit is
replaced.
An intermittent fault cycles between the fault-
active and fault benign states.
A transient fault dies away after some time.
13. Intermittent faults can be caused by loosely
connected components.
Transient faults can be caused by environmental
effects.
e.g. If there is a burst of electromagnetic
radiation and the memory is not properly shielded,
the contents of the memory can be altered without
the memory chips themselves suffering any
structural damage. When the memory is rewritten,
the fault will go away.
14. OUTPUT BEHAVIOR CLASSIFICATION
Malicious faults
• Inconsistent output, harder to neutralize
these errors
• It behaves arbitrarily
Non malicious faults
• Consistent output errors
• Easier to neutralize these errors
15. Fail stop
Responds to up to a certain maximum
number of failures by simply stopping,
rather than putting out incorrect outputs.
Fail safe
Its failure mode is biased so that the
application process does not suffer
catastrophe upon failure.
16. INDEPENDENCE AND CORRELATION
Component failures may be independent or
correlated.
Independent:A failure is said to be
independent if it does not directly or indirectly
cause another failure.
Correlated:If the failure is said to be correlated if
they are related in some way. e.g. They may be
triggered by same cause or one of them might
cause the others to occur.
17. FAULT DETECTION
There two ways to determine that a processor is
malfunctioning
• Online
• Offline
Online Detection:
•This detection goes in parallel with normal system operation
•It is done by checking the behavior that is inconsistent with
correct operation.
• Indication for faulty processor
-Branching to an invalid destination.
-Fetching an opcode from a location, which is not
containing data.
18. - Writing into a portion of memory to which the
process has no write access.
- Fetching an illegal opcode.
- Inactive for more than a prescribed period.
• A monitor is associated with each processor,
looking for signs that the processor is faulty. The
monitor watches the data and address lines.
• Another approach is to have multiple processors,
which are supposed to put out the same result , and
compare the results.If a discrepancy arise it
indicates an fault.
19. OFFLINE DETECTION
It is done by running a diagnostic test.
These test are scheduled just like ordinary task.
20. FAULT AND ERROR CONTAINMENT
The process of preventing the error spreading from one
part to another part of the system is called containment
When a fault or error occurs in one part of a system, it
will spread through the system like an infectious disease.
e.g. An fault in one part of the system might cause
large voltage swings in another.
A fault-free processor can give erroneous results,
when getting input from a faulty unit.
21. FAULT CONTAINMENT IS ACCOMPLISHED BY
The system is divided into fault and error
containment zones(FCZ,ECZ).
An FCZ is a subset of the system that operates
correctly despite arbitrary logical or electrical faults
outside the subset. i.e. the failure of some part of
the computer outside an FCZ cannot cause any
element inside the FCZ to fail.
22. Hardware inside an fcz must be isolated from
hardware outside it.It should withstand either a short-
circuit or the aplication of the maximum voltage
imposed on the lines connecting on FCZ to the
outside world.
Each fcz should have an independent power supply
and its own clocks. These clocks are synchronized
with the clocks in other FCZ’s ,but a malfunction in
the outside clocks wont affect the clocks inside the
fcz.
The function of an ECZ is to prevent errors from
propagating across zone boundaries. This is achieved
by voting redundant outputs.
23. REDUNDANCY
FTS consist of properly managed
redundancy, i.e. the system is to kept
running despite the failure of some its parts.
It must have spare capacity to begin with.
TYPES OF REDUNDANCY
• Hardware redundancy
• Software redundancy
• Time redundancy
• Information redundancy
24. Hardware redundancy
Hardware redundancy is the use of additional
hardware to compensate for failures. This can be
accomplished in two ways.
•One of them is fault detection, correction, and masking.
Fault detection: Multiple hardware units may be
assigned to do the same task in parallel and their results
are compared.
If one are more units are faulty, we can expect
this to show up as a disagreement in the result.
25. Fault Masking: If minority of the units are faulty and a
majority of the units produce the same output, the majority
result can considered and failure effect is masked.
Fault correction: If minority of the units disagree, the fault
is detected. So the computation is repeated on other
processors to correct that fault.
• The second one in hardware redundancy is replacing the
malfunctioning unit .It is possible that the system can be
designed so that faulty units can be easily replaced with
spare ones.
26. Two methods used in hardware redundancy
•Static Pairing
•N modular Redundancy (NMR)
28. •Hardwire processors in pairs and to discard the
entire pair if one of the processors fails, this is very
simple scheme
•The Pairs runs identical software with identical inputs
and should generate identical outputs. If the output is
not identical, then the pair is non functional, so the
entire pair is discarded
•This approach is depicted in the following figure, and
it will work only when the interface is working fine and
both the processors do not fail identically and around
the same time
29. • The interface is monitored by means of a
monitor. If the interface fails, the monitor takes
care and if the monitor fails, the interface
takes care. If both interface and monitor fails,
then the system is down.
31. •It is a scheme for Forward Error Recovery.
•It works with N processors instead of one and
voting on their output and N is usually odd.
•NMR can be illustrated by means of the following
two ways
There are N voters and the entire cluster
produces N outputs
There is just one voter
32. • NMR clusters are designed to allow the purging
of malfunctioning units. That is, when a failure is
detected, the failed unit is checked to see
whether or not the failure is transient. If it is not, it
must be electrically isolated from the rest of the
cluster and a replacement unit is switched on.
The faster the unit is replaced, the more reliable
the cluster.
33. • Purging can be done either by hardware or by the operating
system.
• Self purging consists of a monitor at each unit comparing its
output against the voted output. If there is a difference, the
monitor disconnects the unit from the system.
• The monitor can be described as a finite state machine with
two states connect and isolate. There are two signals, diff
which is set to 1 whenever the module output disagrees
with the voter output and reconnect, which is a command
from the system to reconnect the module
34.
35. SOFT WARE REDUNDANCY
•Software faults are not like hardware faults i.e.
software never wears out , the faults are not
generated spontaneously during system operation.
•Software faults can be regarded as faults in
design.
•For software redundancy simply replicating the
same software N times will not work, all N copies
will fail for the same inputs.
•Instead N versions of the software can be
implemented. The N versions can be developed by
independent teams, with no contact between them.
36. • Each version is being developed by a team of
developers who never communicated with each other
• To minimize the common mode failures
The specifications should be written in formal
terms and are subject to rigorous process of
checking
Multiple software versions should be developed in
different programming languages.
Nature of tools that are being used should be
selected properly.
Training and quality of the programmers should
be maintainded.
37. There are two approaches for that
•N Version Programming
•Recovery Block Approach