Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Fault Tolerance Paul Krzyzanowski [email_address] [email_address] Distributed Systems Except as otherwise noted, the conte...
Faults <ul><li>Deviation from expected behavior </li></ul><ul><li>Variety of factors </li></ul><ul><ul><li>hardware </li><...
Faults <ul><li>Three categories </li></ul><ul><ul><li>transient faults </li></ul></ul><ul><ul><li>intermittent faults </li...
Achieving fault tolerence <ul><li>Redundancy </li></ul><ul><ul><li>information redundancy </li></ul></ul><ul><ul><ul><li>H...
How much fault tolerance? <ul><li>100 % fault-tolerance  cannot   be achieved. </li></ul><ul><ul><li>The closer we wish to...
Active replication <ul><li>Technique for fault tolerance through physical redundancy </li></ul><ul><li>No redundancy: </li...
Primary backup <ul><li>One server does all the work </li></ul><ul><li>When it fails, backup takes over </li></ul><ul><ul><...
Agreement in faulty systems <ul><li>Two army problem </li></ul><ul><ul><li>good processors - faulty communication lines </...
Agreement in faulty systems <ul><li>Byzantine Generals problem </li></ul><ul><ul><li>reliable communication lines - faulty...
Agreement in faulty systems <ul><li>Byzantine Generals problem </li></ul><ul><ul><li>Solutions require: </li></ul></ul><ul...
The end.
Prochain SlideShare
Chargement dans…5
×

Fault Tolerance (Distributed computing)

5 784 vues

Publié le

Publié dans : Technologie
  • Soyez le premier à commenter

Fault Tolerance (Distributed computing)

  1. 1. Fault Tolerance Paul Krzyzanowski [email_address] [email_address] Distributed Systems Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
  2. 2. Faults <ul><li>Deviation from expected behavior </li></ul><ul><li>Variety of factors </li></ul><ul><ul><li>hardware </li></ul></ul><ul><ul><li>software </li></ul></ul><ul><ul><li>operator </li></ul></ul><ul><ul><li>Network </li></ul></ul><ul><li>Three categories </li></ul><ul><ul><li>transient faults </li></ul></ul><ul><ul><li>intermittent faults </li></ul></ul><ul><ul><li>permanent faults </li></ul></ul>
  3. 3. Faults <ul><li>Three categories </li></ul><ul><ul><li>transient faults </li></ul></ul><ul><ul><li>intermittent faults </li></ul></ul><ul><ul><li>permanent faults </li></ul></ul><ul><li>Any fault may be </li></ul><ul><ul><li>fail-silent (fail-stop) </li></ul></ul><ul><ul><li>Byzantine </li></ul></ul><ul><li>synchronous system vs. asynchronous system </li></ul><ul><ul><li>E.g., IP packet versus serial port transmission </li></ul></ul>
  4. 4. Achieving fault tolerence <ul><li>Redundancy </li></ul><ul><ul><li>information redundancy </li></ul></ul><ul><ul><ul><li>Hamming codes, parity memory ECC memory </li></ul></ul></ul><ul><ul><li>time redundancy </li></ul></ul><ul><ul><ul><li>Timeout & retransmit </li></ul></ul></ul><ul><ul><li>physical redundancy </li></ul></ul><ul><ul><ul><li>TMR, RAID disks, backup servers </li></ul></ul></ul>
  5. 5. How much fault tolerance? <ul><li>100 % fault-tolerance cannot be achieved. </li></ul><ul><ul><li>The closer we wish to get to 100%, the more expensive the system will be. </li></ul></ul><ul><li>A system is k-fault tolerant if it can withstand k faults. </li></ul><ul><ul><li>Need k+1 components with silent faults k can fail and one will still be working </li></ul></ul><ul><ul><li>Need 2k+1 components with Byzantine faults k can generate false replies: k+1 will provide a majority vote </li></ul></ul>
  6. 6. Active replication <ul><li>Technique for fault tolerance through physical redundancy </li></ul><ul><li>No redundancy: </li></ul><ul><li>Triple Modular Redundancy (TMR): </li></ul><ul><ul><li>Threefold component replication to detect and correct a single component failure </li></ul></ul>
  7. 7. Primary backup <ul><li>One server does all the work </li></ul><ul><li>When it fails, backup takes over </li></ul><ul><ul><li>Backup may ping primary with are you alive messages </li></ul></ul><ul><li>Simpler design: no need for multicast </li></ul><ul><li>Works poorly with Byzantine faults </li></ul><ul><li>Recovery may be time-consuming and/or complex </li></ul>
  8. 8. Agreement in faulty systems <ul><li>Two army problem </li></ul><ul><ul><li>good processors - faulty communication lines </li></ul></ul><ul><ul><li>coordinated attack </li></ul></ul><ul><ul><li>multiple acknowledgement problem </li></ul></ul>
  9. 9. Agreement in faulty systems <ul><li>Byzantine Generals problem </li></ul><ul><ul><li>reliable communication lines - faulty processors </li></ul></ul><ul><ul><li>n generals head different divisions </li></ul></ul><ul><ul><li>m generals are traitors and are trying to prevent others from reaching agreement </li></ul></ul><ul><ul><ul><li>4 generals agree to attack </li></ul></ul></ul><ul><ul><ul><li>4 generals agree to retreat </li></ul></ul></ul><ul><ul><ul><li>1 traitor tells the 1 st group that he’ll attack and tells the 2 nd group that he’ll retreat </li></ul></ul></ul><ul><ul><li>can the loyal generals reach agreement? </li></ul></ul>
  10. 10. Agreement in faulty systems <ul><li>Byzantine Generals problem </li></ul><ul><ul><li>Solutions require: </li></ul></ul><ul><ul><ul><li>3m+1 participants for m traitors ( 2m+1 loyal generals) </li></ul></ul></ul><ul><ul><ul><li>m+1 rounds of message exchanges </li></ul></ul></ul><ul><ul><ul><li>O(m 2 ) messages </li></ul></ul></ul><ul><ul><li>Costly solution! </li></ul></ul>
  11. 11. The end.

×