SlideShare une entreprise Scribd logo
1  sur  40
FAULT TOLEARANT SYSTEM
 A fault tolerant system is a system which is a able to
  continue operating despite the failure of a limited
  subset of their hardware or software.

 They are gracefully degradable i.e. as the size of the
  faulty set increases, the system wont collapse
  suddenly but continue executing, part of its
  workload.

 The goal of this design is to ensure that the
  probability of system failure is acceptably small.
FAULT TYPES

Hardware Fault: A hardware fault is some physical
defect that can cause a component to malfunction.
      E.g. A broken wire or the output of a logic gate
that is perpetually stuck at some logic value(0 or 1).

Software Fault: A software fault is bug that can
cause the program to fail for a given set of inputs.
ERROR
 Error is a manifestation of a fault.
   e.g. A broken wire will cause an error if
the system tries to propagate a signal
through it.
A program that has a fault that induces
incorrect output for some set of inputs will
generate errors, if that set of inputs is
applied.
FAULT LATENCY
The fault latency is the duration between
the onset of a fault and its manifestation as
an error.

Since the faults themselves are invisible to
the outside world, only showing themselves
when they cause errors. Such latency will
impact the reliability of the overall system.
ERROR RECOVERY
   It is the process by which the system attempts to
recover from the effects of an error.

TYPES OF ERROR RECOVERY
Forward Error Recovery: In this type the error is
masked without any computations having to be
redone.
Backward Error Recovery: In this type the system is
rolled back to moment in the time before the error is
believed to be occurred and computation is carried out
again. It consumes additional time to mask the effects
of failure.
CAUSES FOR FAULTS

Errors in the specification or design.

Defects in the components

Environmental effects.
Errors In The Specification Or Design

This error arises due to the communication
gap between the person who writes the
specification and the system designer.

The specification is the link between design
process and real world application.

If specification is wrong everything that
proceeds from it is likely to be wrong.
Defects In Components
  This fault arise due to defects caused by the
wear and tear of use.

  E.g. A mosfet may fail due to electro migration,
which is the drifting away overtime of metal
atoms towards the cathode.
Environmental Effects

This fault arise due to operating environment .

 Devices can be subjected to whole array of
stresses, depending on the application.

Poor ventilation or excessively high ambient
temperatures can melt components or damage
them.

  e.g If a computer is in missile, it can undergo
high g-forces and vibrational stress.
FAULT TYPES
Faults are classified according to their temporal
behavior and output behavior.

A fault is said to be active when it is physically
capable of generating errors and to be benign when
it is not.
TEMPORAL BEHAVIOR CLASSIFICATION

 Fault types: Permanent, intermittent, transient.
A permanent fault does not die away with time,
but remains until it is repaired or the affected unit is
replaced.

An intermittent fault cycles between the fault-
active and fault benign states.

A transient fault dies away after some time.
Intermittent faults can be caused by loosely
 connected components.

Transient faults can be caused by environmental
 effects.
     e.g. If there is a burst of electromagnetic
 radiation and the memory is not properly shielded,
 the contents of the memory can be altered without
 the memory chips themselves suffering any
 structural damage. When the memory is rewritten,
 the fault will go away.
OUTPUT BEHAVIOR CLASSIFICATION
  Malicious faults

   • Inconsistent output, harder to neutralize
     these errors

   • It behaves arbitrarily
  Non malicious faults
   • Consistent output errors

   • Easier to neutralize these errors
Fail stop
   Responds to up to a certain maximum
   number of failures by simply stopping,
   rather than putting out incorrect outputs.

Fail safe
   Its failure mode is biased so that the
   application process does not suffer
   catastrophe upon failure.
INDEPENDENCE AND CORRELATION
  Component failures may be independent or
correlated.

         Independent:A failure is said to be
independent if it does not directly or indirectly
cause another failure.

 Correlated:If the failure is said to be correlated if
they are related in some way. e.g. They may be
triggered by same cause or one of them might
cause the others to occur.
FAULT DETECTION
    There two ways to determine that a processor is
malfunctioning
• Online
• Offline

Online Detection:

•This detection goes in parallel with normal system operation
•It is done by checking the behavior that is inconsistent with
correct operation.
• Indication for faulty processor
     -Branching to an invalid destination.
     -Fetching an opcode from a location, which is not
containing data.
- Writing into a portion of memory to which the
  process has no write access.
- Fetching an illegal opcode.
- Inactive for more than a prescribed period.

• A monitor is associated with each processor,
  looking for signs that the processor is faulty. The
  monitor watches the data and address lines.

• Another approach is to have multiple processors,
  which are supposed to put out the same result , and
  compare the results.If a discrepancy arise it
  indicates an fault.
OFFLINE DETECTION

It is done by running a diagnostic test.


These test are scheduled just like ordinary task.
FAULT AND ERROR CONTAINMENT

The process of preventing the error spreading from one
part to another part of the system is called containment

When a fault or error occurs in one part of a system, it
will spread through the system like an infectious disease.
   e.g. An fault in one part of the system might cause
large voltage swings in another.

 A fault-free processor can give erroneous results,
when getting input from a faulty unit.
FAULT CONTAINMENT IS ACCOMPLISHED BY

The system is divided into fault and error
containment zones(FCZ,ECZ).

An FCZ is a subset of the system that operates
correctly despite arbitrary logical or electrical faults
outside the subset. i.e. the failure of some part of
the computer outside an FCZ cannot cause any
element inside the FCZ to fail.
 Hardware inside an fcz must be isolated from
  hardware outside it.It should withstand either a short-
  circuit or the aplication of the maximum voltage
  imposed on the lines connecting on FCZ to the
  outside world.

 Each fcz should have an independent power supply
  and its own clocks. These clocks are synchronized
  with the clocks in other FCZ’s ,but a malfunction in
  the outside clocks wont affect the clocks inside the
  fcz.

 The function of an ECZ is to prevent errors from
  propagating across zone boundaries. This is achieved
  by voting redundant outputs.
REDUNDANCY
     FTS consist of properly managed
redundancy, i.e. the system is to kept
running despite the failure of some its parts.

  It must have spare capacity to begin with.

TYPES OF REDUNDANCY
• Hardware redundancy
• Software redundancy
• Time redundancy
• Information redundancy
Hardware redundancy
         Hardware redundancy is the use of additional
hardware to compensate for failures. This can be
accomplished in two ways.

•One of them is fault detection, correction, and masking.

Fault detection: Multiple hardware units may be
assigned to do the same task in parallel and their results
are compared.
          If one are more units are faulty, we can expect
this to show up as a disagreement in the result.
Fault Masking: If minority of the units are faulty and a
majority of the units produce the same output, the majority
result can considered and failure effect is masked.

Fault correction: If minority of the units disagree, the fault
is detected. So the computation is repeated on other
processors to correct that fault.

• The second one in hardware redundancy is replacing the
malfunctioning unit .It is possible that the system can be
designed so that faulty units can be easily replaced with
spare ones.
Two methods used in hardware redundancy

  •Static Pairing

  •N modular Redundancy (NMR)
STATIC PAIRING
•Hardwire processors in pairs and to discard the
entire pair if one of the processors fails, this is very
simple scheme

•The Pairs runs identical software with identical inputs
and should generate identical outputs. If the output is
not identical, then the pair is non functional, so the
entire pair is discarded

•This approach is depicted in the following figure, and
it will work only when the interface is working fine and
both the processors do not fail identically and around
the same time
• The interface is monitored by means of a
  monitor. If the interface fails, the monitor takes
  care and if the monitor fails, the interface
  takes care. If both interface and monitor fails,
  then the system is down.
N MODULAR REDUNDANCY
•It is a scheme for Forward Error Recovery.

•It works with N processors instead of one and
voting on their output and N is usually odd.

•NMR can be illustrated by means of the following
two ways
   There are N voters and the entire cluster
   produces N outputs

   There is just one voter
•   NMR clusters are designed to allow the purging
    of malfunctioning units. That is, when a failure is
    detected, the failed unit is checked to see
    whether or not the failure is transient. If it is not, it
    must be electrically isolated from the rest of the
    cluster and a replacement unit is switched on.
    The faster the unit is replaced, the more reliable
    the cluster.
• Purging can be done either by hardware or by the operating
  system.

• Self purging consists of a monitor at each unit comparing its
  output against the voted output. If there is a difference, the
  monitor disconnects the unit from the system.

• The monitor can be described as a finite state machine with
  two states connect and isolate. There are two signals, diff
  which is set to 1 whenever the module output disagrees
  with the voter output and reconnect, which is a command
  from the system to reconnect the module
SOFT WARE REDUNDANCY
•Software faults are not like hardware faults i.e.
software never wears out , the faults are not
generated spontaneously during system operation.

•Software faults can be regarded as faults      in
design.

•For software redundancy simply replicating the
same software N times will not work, all N copies
will fail for the same inputs.

•Instead N versions     of the software can be
implemented. The N versions can be developed by
independent teams, with no contact between them.
•   Each version is being developed by a team of
    developers who never communicated with each other

• To minimize the common mode failures

      The specifications should be written in formal
       terms and are subject to rigorous process of
       checking

      Multiple software versions should be developed in
       different programming languages.

      Nature of tools that are being used should be
       selected properly.

      Training and quality of the programmers should
       be maintainded.
There are two approaches for that

   •N Version Programming

   •Recovery Block Approach
N Version Programming
Recovery Block Approach
THANK   U

Contenu connexe

Tendances

Distributed & parallel system
Distributed & parallel systemDistributed & parallel system
Distributed & parallel systemManish Singh
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systemsmridul mishra
 
Physical and Logical Clocks
Physical and Logical ClocksPhysical and Logical Clocks
Physical and Logical ClocksDilum Bandara
 
Prototyping to Production - Get your IoT Product to Market
Prototyping to Production - Get your IoT Product to MarketPrototyping to Production - Get your IoT Product to Market
Prototyping to Production - Get your IoT Product to MarketParticle
 
Real Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsReal Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsHariharan Ganesan
 
Embedded system design process
Embedded system design processEmbedded system design process
Embedded system design processRayees CK
 
Basic functions & types of RTOS ES
Basic functions & types of  RTOS ESBasic functions & types of  RTOS ES
Basic functions & types of RTOS ESJOLLUSUDARSHANREDDY
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemanujos25
 
Program security
Program securityProgram security
Program securityG Prachi
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)Romain Jacotin
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMJYoTHiSH o.s
 
Context model
Context modelContext model
Context modelUbaid423
 
Design of embedded systems
Design of embedded systemsDesign of embedded systems
Design of embedded systemsPradeep Kumar TS
 

Tendances (20)

Taxonomy for bugs
Taxonomy for bugsTaxonomy for bugs
Taxonomy for bugs
 
Distributed & parallel system
Distributed & parallel systemDistributed & parallel system
Distributed & parallel system
 
TinyOS
TinyOSTinyOS
TinyOS
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systems
 
Physical and Logical Clocks
Physical and Logical ClocksPhysical and Logical Clocks
Physical and Logical Clocks
 
Prototyping to Production - Get your IoT Product to Market
Prototyping to Production - Get your IoT Product to MarketPrototyping to Production - Get your IoT Product to Market
Prototyping to Production - Get your IoT Product to Market
 
Real Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsReal Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systems
 
Embedded system design process
Embedded system design processEmbedded system design process
Embedded system design process
 
Human Computer Interaction - INPUT OUTPUT CHANNELS
Human Computer Interaction - INPUT OUTPUT CHANNELSHuman Computer Interaction - INPUT OUTPUT CHANNELS
Human Computer Interaction - INPUT OUTPUT CHANNELS
 
Basic functions & types of RTOS ES
Basic functions & types of  RTOS ESBasic functions & types of  RTOS ES
Basic functions & types of RTOS ES
 
Real-Time Operating Systems
Real-Time Operating SystemsReal-Time Operating Systems
Real-Time Operating Systems
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating system
 
Real time-embedded-system-lec-02
Real time-embedded-system-lec-02Real time-embedded-system-lec-02
Real time-embedded-system-lec-02
 
Program security
Program securityProgram security
Program security
 
RTOS - Real Time Operating Systems
RTOS - Real Time Operating SystemsRTOS - Real Time Operating Systems
RTOS - Real Time Operating Systems
 
Chapter 2 operating systems
Chapter 2 operating systemsChapter 2 operating systems
Chapter 2 operating systems
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
Context model
Context modelContext model
Context model
 
Design of embedded systems
Design of embedded systemsDesign of embedded systems
Design of embedded systems
 

En vedette

Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentationskadyan1
 
Real time database
Real time databaseReal time database
Real time databasearvinthsaran
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data BaseSiva Rushi
 
Real time database (MDARTS)
Real time database (MDARTS)Real time database (MDARTS)
Real time database (MDARTS)Pradeep Kumar TS
 
Fault management presentation
Fault management presentationFault management presentation
Fault management presentationardhita banu adji
 
Fault Management System (OSS)
Fault Management System (OSS)Fault Management System (OSS)
Fault Management System (OSS)Riswan
 
Be information technology2008course
Be information technology2008courseBe information technology2008course
Be information technology2008courseAnuj Sharma
 
Chapter 19 - Real Time Systems
Chapter 19 - Real Time SystemsChapter 19 - Real Time Systems
Chapter 19 - Real Time SystemsWayne Jones Jnr
 
Introduction to Real-Time Operating Systems
Introduction to Real-Time Operating SystemsIntroduction to Real-Time Operating Systems
Introduction to Real-Time Operating Systemscoolmirza143
 
Real Time Systems & RTOS
Real Time Systems & RTOSReal Time Systems & RTOS
Real Time Systems & RTOSVishwa Mohan
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsZbigniew Jerzak
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systemssumitjain2013
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
 
Error detection recovery
Error detection recoveryError detection recovery
Error detection recoveryTech_MX
 
N-version programming
N-version programmingN-version programming
N-version programmingshabnam0102
 

En vedette (20)

Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
 
Real time database
Real time databaseReal time database
Real time database
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Vxworks
VxworksVxworks
Vxworks
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Real time database (MDARTS)
Real time database (MDARTS)Real time database (MDARTS)
Real time database (MDARTS)
 
Fault management presentation
Fault management presentationFault management presentation
Fault management presentation
 
Fault Management System (OSS)
Fault Management System (OSS)Fault Management System (OSS)
Fault Management System (OSS)
 
Be information technology2008course
Be information technology2008courseBe information technology2008course
Be information technology2008course
 
Chapter 19 - Real Time Systems
Chapter 19 - Real Time SystemsChapter 19 - Real Time Systems
Chapter 19 - Real Time Systems
 
Ch21 real time software engineering
Ch21 real time software engineeringCh21 real time software engineering
Ch21 real time software engineering
 
Introduction to Real-Time Operating Systems
Introduction to Real-Time Operating SystemsIntroduction to Real-Time Operating Systems
Introduction to Real-Time Operating Systems
 
Real Time Systems & RTOS
Real Time Systems & RTOSReal Time Systems & RTOS
Real Time Systems & RTOS
 
In-memory Databases
In-memory DatabasesIn-memory Databases
In-memory Databases
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Error detection recovery
Error detection recoveryError detection recovery
Error detection recovery
 
N-version programming
N-version programmingN-version programming
N-version programming
 

Similaire à Fault tolearant system

Fault Finding.pptx
Fault Finding.pptxFault Finding.pptx
Fault Finding.pptxMUST
 
SE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingSE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingAmr E. Mohamed
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance SystemEhsan Ilahi
 
SE2018_Lec 19_ Software Testing
SE2018_Lec 19_ Software TestingSE2018_Lec 19_ Software Testing
SE2018_Lec 19_ Software TestingAmr E. Mohamed
 
Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12AbdullahMunir32
 
Types of Computer System Errors.pptx
Types of Computer System Errors.pptxTypes of Computer System Errors.pptx
Types of Computer System Errors.pptxArjunePantallano1
 
Proposed Algorithm for Surveillance Applications
Proposed Algorithm for Surveillance ApplicationsProposed Algorithm for Surveillance Applications
Proposed Algorithm for Surveillance ApplicationsEditor IJCATR
 
Troubleshooting & Tools
Troubleshooting & ToolsTroubleshooting & Tools
Troubleshooting & ToolsPrabu U
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptxCS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptxAsst.prof M.Gokilavani
 
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...IRJET Journal
 
Fault avoidance and fault tolerance
Fault avoidance and fault toleranceFault avoidance and fault tolerance
Fault avoidance and fault toleranceJabez Winston
 
Jonny doin safe io t- lt_spice failsafe
Jonny doin safe io t- lt_spice failsafeJonny doin safe io t- lt_spice failsafe
Jonny doin safe io t- lt_spice failsafeJonny Doin
 
Functions of the Operating System
Functions of the Operating SystemFunctions of the Operating System
Functions of the Operating Systemandyr91
 
2012A8PS309P_AbhishekKumar_FinalReport
2012A8PS309P_AbhishekKumar_FinalReport2012A8PS309P_AbhishekKumar_FinalReport
2012A8PS309P_AbhishekKumar_FinalReportabhishekroushan
 
Application Fault Tolerance (AFT)
Application Fault Tolerance (AFT)Application Fault Tolerance (AFT)
Application Fault Tolerance (AFT)Daniel S. Katz
 

Similaire à Fault tolearant system (20)

Fault Finding.pptx
Fault Finding.pptxFault Finding.pptx
Fault Finding.pptx
 
SE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingSE2_Lec 20_Software Testing
SE2_Lec 20_Software Testing
 
Trouble Shooting PC
Trouble Shooting PCTrouble Shooting PC
Trouble Shooting PC
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
 
SE2018_Lec 19_ Software Testing
SE2018_Lec 19_ Software TestingSE2018_Lec 19_ Software Testing
SE2018_Lec 19_ Software Testing
 
Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12
 
Types of Computer System Errors.pptx
Types of Computer System Errors.pptxTypes of Computer System Errors.pptx
Types of Computer System Errors.pptx
 
Proposed Algorithm for Surveillance Applications
Proposed Algorithm for Surveillance ApplicationsProposed Algorithm for Surveillance Applications
Proposed Algorithm for Surveillance Applications
 
Troubleshooting & Tools
Troubleshooting & ToolsTroubleshooting & Tools
Troubleshooting & Tools
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptxCS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
 
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
 
Interrupts in 8085
Interrupts in 8085Interrupts in 8085
Interrupts in 8085
 
Fault avoidance and fault tolerance
Fault avoidance and fault toleranceFault avoidance and fault tolerance
Fault avoidance and fault tolerance
 
Jonny doin safe io t- lt_spice failsafe
Jonny doin safe io t- lt_spice failsafeJonny doin safe io t- lt_spice failsafe
Jonny doin safe io t- lt_spice failsafe
 
Functions of the Operating System
Functions of the Operating SystemFunctions of the Operating System
Functions of the Operating System
 
2012A8PS309P_AbhishekKumar_FinalReport
2012A8PS309P_AbhishekKumar_FinalReport2012A8PS309P_AbhishekKumar_FinalReport
2012A8PS309P_AbhishekKumar_FinalReport
 
Application Fault Tolerance (AFT)
Application Fault Tolerance (AFT)Application Fault Tolerance (AFT)
Application Fault Tolerance (AFT)
 
Ch20
Ch20Ch20
Ch20
 
Interrupts in CPU
Interrupts in CPUInterrupts in CPU
Interrupts in CPU
 

Dernier

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Fault tolearant system

  • 2.  A fault tolerant system is a system which is a able to continue operating despite the failure of a limited subset of their hardware or software.  They are gracefully degradable i.e. as the size of the faulty set increases, the system wont collapse suddenly but continue executing, part of its workload.  The goal of this design is to ensure that the probability of system failure is acceptably small.
  • 3. FAULT TYPES Hardware Fault: A hardware fault is some physical defect that can cause a component to malfunction. E.g. A broken wire or the output of a logic gate that is perpetually stuck at some logic value(0 or 1). Software Fault: A software fault is bug that can cause the program to fail for a given set of inputs.
  • 4. ERROR  Error is a manifestation of a fault. e.g. A broken wire will cause an error if the system tries to propagate a signal through it. A program that has a fault that induces incorrect output for some set of inputs will generate errors, if that set of inputs is applied.
  • 5. FAULT LATENCY The fault latency is the duration between the onset of a fault and its manifestation as an error. Since the faults themselves are invisible to the outside world, only showing themselves when they cause errors. Such latency will impact the reliability of the overall system.
  • 6. ERROR RECOVERY It is the process by which the system attempts to recover from the effects of an error. TYPES OF ERROR RECOVERY Forward Error Recovery: In this type the error is masked without any computations having to be redone. Backward Error Recovery: In this type the system is rolled back to moment in the time before the error is believed to be occurred and computation is carried out again. It consumes additional time to mask the effects of failure.
  • 7. CAUSES FOR FAULTS Errors in the specification or design. Defects in the components Environmental effects.
  • 8. Errors In The Specification Or Design This error arises due to the communication gap between the person who writes the specification and the system designer. The specification is the link between design process and real world application. If specification is wrong everything that proceeds from it is likely to be wrong.
  • 9. Defects In Components This fault arise due to defects caused by the wear and tear of use. E.g. A mosfet may fail due to electro migration, which is the drifting away overtime of metal atoms towards the cathode.
  • 10. Environmental Effects This fault arise due to operating environment .  Devices can be subjected to whole array of stresses, depending on the application. Poor ventilation or excessively high ambient temperatures can melt components or damage them. e.g If a computer is in missile, it can undergo high g-forces and vibrational stress.
  • 11. FAULT TYPES Faults are classified according to their temporal behavior and output behavior. A fault is said to be active when it is physically capable of generating errors and to be benign when it is not.
  • 12. TEMPORAL BEHAVIOR CLASSIFICATION  Fault types: Permanent, intermittent, transient. A permanent fault does not die away with time, but remains until it is repaired or the affected unit is replaced. An intermittent fault cycles between the fault- active and fault benign states. A transient fault dies away after some time.
  • 13. Intermittent faults can be caused by loosely connected components. Transient faults can be caused by environmental effects. e.g. If there is a burst of electromagnetic radiation and the memory is not properly shielded, the contents of the memory can be altered without the memory chips themselves suffering any structural damage. When the memory is rewritten, the fault will go away.
  • 14. OUTPUT BEHAVIOR CLASSIFICATION Malicious faults • Inconsistent output, harder to neutralize these errors • It behaves arbitrarily Non malicious faults • Consistent output errors • Easier to neutralize these errors
  • 15. Fail stop Responds to up to a certain maximum number of failures by simply stopping, rather than putting out incorrect outputs. Fail safe Its failure mode is biased so that the application process does not suffer catastrophe upon failure.
  • 16. INDEPENDENCE AND CORRELATION Component failures may be independent or correlated. Independent:A failure is said to be independent if it does not directly or indirectly cause another failure. Correlated:If the failure is said to be correlated if they are related in some way. e.g. They may be triggered by same cause or one of them might cause the others to occur.
  • 17. FAULT DETECTION There two ways to determine that a processor is malfunctioning • Online • Offline Online Detection: •This detection goes in parallel with normal system operation •It is done by checking the behavior that is inconsistent with correct operation. • Indication for faulty processor -Branching to an invalid destination. -Fetching an opcode from a location, which is not containing data.
  • 18. - Writing into a portion of memory to which the process has no write access. - Fetching an illegal opcode. - Inactive for more than a prescribed period. • A monitor is associated with each processor, looking for signs that the processor is faulty. The monitor watches the data and address lines. • Another approach is to have multiple processors, which are supposed to put out the same result , and compare the results.If a discrepancy arise it indicates an fault.
  • 19. OFFLINE DETECTION It is done by running a diagnostic test. These test are scheduled just like ordinary task.
  • 20. FAULT AND ERROR CONTAINMENT The process of preventing the error spreading from one part to another part of the system is called containment When a fault or error occurs in one part of a system, it will spread through the system like an infectious disease. e.g. An fault in one part of the system might cause large voltage swings in another.  A fault-free processor can give erroneous results, when getting input from a faulty unit.
  • 21. FAULT CONTAINMENT IS ACCOMPLISHED BY The system is divided into fault and error containment zones(FCZ,ECZ). An FCZ is a subset of the system that operates correctly despite arbitrary logical or electrical faults outside the subset. i.e. the failure of some part of the computer outside an FCZ cannot cause any element inside the FCZ to fail.
  • 22.  Hardware inside an fcz must be isolated from hardware outside it.It should withstand either a short- circuit or the aplication of the maximum voltage imposed on the lines connecting on FCZ to the outside world.  Each fcz should have an independent power supply and its own clocks. These clocks are synchronized with the clocks in other FCZ’s ,but a malfunction in the outside clocks wont affect the clocks inside the fcz.  The function of an ECZ is to prevent errors from propagating across zone boundaries. This is achieved by voting redundant outputs.
  • 23. REDUNDANCY FTS consist of properly managed redundancy, i.e. the system is to kept running despite the failure of some its parts. It must have spare capacity to begin with. TYPES OF REDUNDANCY • Hardware redundancy • Software redundancy • Time redundancy • Information redundancy
  • 24. Hardware redundancy Hardware redundancy is the use of additional hardware to compensate for failures. This can be accomplished in two ways. •One of them is fault detection, correction, and masking. Fault detection: Multiple hardware units may be assigned to do the same task in parallel and their results are compared. If one are more units are faulty, we can expect this to show up as a disagreement in the result.
  • 25. Fault Masking: If minority of the units are faulty and a majority of the units produce the same output, the majority result can considered and failure effect is masked. Fault correction: If minority of the units disagree, the fault is detected. So the computation is repeated on other processors to correct that fault. • The second one in hardware redundancy is replacing the malfunctioning unit .It is possible that the system can be designed so that faulty units can be easily replaced with spare ones.
  • 26. Two methods used in hardware redundancy •Static Pairing •N modular Redundancy (NMR)
  • 28. •Hardwire processors in pairs and to discard the entire pair if one of the processors fails, this is very simple scheme •The Pairs runs identical software with identical inputs and should generate identical outputs. If the output is not identical, then the pair is non functional, so the entire pair is discarded •This approach is depicted in the following figure, and it will work only when the interface is working fine and both the processors do not fail identically and around the same time
  • 29. • The interface is monitored by means of a monitor. If the interface fails, the monitor takes care and if the monitor fails, the interface takes care. If both interface and monitor fails, then the system is down.
  • 31. •It is a scheme for Forward Error Recovery. •It works with N processors instead of one and voting on their output and N is usually odd. •NMR can be illustrated by means of the following two ways There are N voters and the entire cluster produces N outputs There is just one voter
  • 32. NMR clusters are designed to allow the purging of malfunctioning units. That is, when a failure is detected, the failed unit is checked to see whether or not the failure is transient. If it is not, it must be electrically isolated from the rest of the cluster and a replacement unit is switched on. The faster the unit is replaced, the more reliable the cluster.
  • 33. • Purging can be done either by hardware or by the operating system. • Self purging consists of a monitor at each unit comparing its output against the voted output. If there is a difference, the monitor disconnects the unit from the system. • The monitor can be described as a finite state machine with two states connect and isolate. There are two signals, diff which is set to 1 whenever the module output disagrees with the voter output and reconnect, which is a command from the system to reconnect the module
  • 34.
  • 35. SOFT WARE REDUNDANCY •Software faults are not like hardware faults i.e. software never wears out , the faults are not generated spontaneously during system operation. •Software faults can be regarded as faults in design. •For software redundancy simply replicating the same software N times will not work, all N copies will fail for the same inputs. •Instead N versions of the software can be implemented. The N versions can be developed by independent teams, with no contact between them.
  • 36. Each version is being developed by a team of developers who never communicated with each other • To minimize the common mode failures  The specifications should be written in formal terms and are subject to rigorous process of checking  Multiple software versions should be developed in different programming languages.  Nature of tools that are being used should be selected properly.  Training and quality of the programmers should be maintainded.
  • 37. There are two approaches for that •N Version Programming •Recovery Block Approach
  • 40. THANK U