SlideShare une entreprise Scribd logo
1  sur  35
APPLICATION LEVEL CHECKPOINT-BASED
APPROACH FOR CRUSH FAILURE IN
DISTRIBUTED SYSTEM
Presented By
Moh Moh Khaing
OUTLINES
 Abstract
 Introduction
 Objectives
 Background Theory
 Proposed System
 System flow of proposed system
 Two phases of proposed system
 Implementation
 Conclusion
2
ABSTRACT
 Fault-tolerance for the computing node failure is an important and
critical issue in distributed and parallel processing system.
 If the numbers of computing nodes are increased concurrently and
dynamically in network, it may occur node failure more times.
 This system proposes application level checkpoint-based fault
tolerance approach for distributed computing.
 The proposed system uses coordinated checkpointing techniques
and systematic process logging as global monitoring mechanism.
 The proposed system implements on distributed multiple
sequences alignment (MSA) application using genetic algorithm
(GA).
3
DISTRIBUTED MULTIPLE SEQUENCE ALIGNMENT WITH
GENETIC ALGORITHM (MSAGA)
4
MSA with
GA
Division
Head Node
MSA with
GA
MSA with
GA
Aligned
Sequence Result
Aligned
Sequence Result
Aligned
Sequence Result
Combine Alignment Result
Display Result
DNA Sequences (2 …..n)
SEQUENCES ALIGNMENT EXAMPLE
Input multiple DNA Sequences
>DNAseq1: AAGGAAGGAAGGAAGGAAGGAAGG
>DNAseq2: AAGGAAGGAATGGAAGGAAGGAAGG
>DNAseq3: AAGGAACGGAATGGTAGGAAGGAAGG
Output for aligned DNA Sequences
>DNAseq1: A-AGGA-AGGA-AGGAA-------GG-----AA-GGAAGG
>DNAseq2: ----------------AAGGAAGGAATGGAAGGAAGGAAGG
>DNAseq3: ----------------AAGGAACGGAATGGTAGGAAGGAAGG 5
NODE FAILURE CONDITION
 Node failure condition is occurred when the worker node connects
to head node, worker node accepts the input sequence and worker
node sends resulted sequence the head node. The failure
conditions are
1. Worker node is denied as soon as worker node had connected
to the head node without working any job.
2. Worker node rejects the input sequence from the head node
after the head node and worker node had connected and head
node had prepared the input sequence for worker node.
3. Worker node sends “No Send” message to Head node after
worker node had accepted the result sequence to head node.
4. Worker node is crushed when it cannot connect to the Head
node with correct address.
5. Worker node is crushed when it disconnect to the Head node.
6
COORDINATED CHECKPOINTING
 Checkpointing is used as fault tolerance mechanism in distributed
system.
 A checkpoint is a snapshot of the current state of a process and
assist in monitoring process.
 Coordinated checkpointing takes the checkpoint periodically and
save in the log file.
 This monitoring information provides at the node failure
condition.
 If node failure occurs in distributed computing, another available
node can reconstruct the process state from the information saved
in the checkpoint information of failed node.
7
SYSTEMATIC PROCESS LOGGING
 Systematic Process Logging (SPL) which was derived from a
log-based method.
 The motivation for SPL is to reduce the amount of computation
that can be lost, which is bound by the execution time of a
single failed task.
 SPL saves the checkpoint information from the coordinated
checkpointing as the log file format with exactly time and their
contents.
 Depending on the fault, it decides which node can be accepted
the job from failed node using storing log file.
8
PROPOSED FAULT TOLERANCE SYSTEM
 The checkpoint based fault tolerance approach is implemented
on the application layer without using any operating system
support.
 In distributed multiple sequences alignment application,one head
node and one or more worker nodes are connected with local
area network.
 All worker nodes implemented the MSAGA and aligned the
input sequence from head node independently.
 The proposed fault tolerance system takes the local checkpoint at
the MSA process of each computing worker node themselves
and global checkpoint at events of all workers ’ condition by
head node.
9
ARCHITECTURE OF PROPOSED FAULT TOLERANCE
SYSTEM
Head Node
Local Area Network
GRM GCS
LCS LC
Worker 1
LCS LC
Worker 2
LCS LC
Worker 3
GRM – Global Resource Monitor
GCS – Global Checkpoint Storage
LCS- Local Checkpoint Storage
LC – Local Checkpoint 10
SYSTEM FLOW OF PROPOSED SYSTEM
Start
End
Load Balancing Phase
GRM
HN
GCS
Checkpointing Phase
WNHN
Systematic Process Logging
GCS LCS
WNHN
GRM LC
Coordinated Checkpointing
HN- Head Node
WN – Worker Node
11
IMPLEMENTATION OF HEAD NODE
Checkpointing Phase
 The global resource monitor(GRM) plays the main role in
both coordinated checkpointing phase and systematic process
logging phase.
 GRM takes the global checkpoint of all workers nodes’ event
at the coordinated checkpointing phase.
 GCS saves the global checkpoint information as the log file
format at the Systematic process logging phase.
12
GLOBAL CHECKPOINT
13
Global Rrsource Monitor(GRM )
Begin
1. Taking global checkpoints of current condition of each WN
with WN’s IP, port, status, and time duration
2. Detecting the failure condition of WNs
3. Finding the available worker nodes and decide which node
is suitable for continuing to do failed WN’s jobs
End
TYPES OF CHECKPOINT
14
Checkpoint No Checkpoint
Name
Checkpoint Content
1 Available Worker node is connected with Head node
and waits for jobs from Head node
2 Denied Worker node is disconnected with Server
3 Busy Worker node is processing the jobs
4 Receive Worker node send the result to the Head
node and exist (or) Worker node send
Error message and Exit
5 Crush Worker node sends the crush message to
the Head node
CHECKPOINT INFORMATION
 For each checkpoint, there are four conditions are
described:
 Worker Typeto show worker number,
 IP address to show WN,
 Checkpoint Name to show worker node’s conditions,
 Current Time to show process current time,
 Time Duration to show time within each worker’s
running state to accept and receive state or running
state to reject state.

15
Worker
Type
IP Address Checkpoint
Name
Current
Time
Time
Duration
AVAILABLE CHECKPOINT OF ALL WORKERS
 GRM take checkpoint as Available when all worker nodes are
connected to the head node
16
CHECKPOINT CHANGES FROM AVAILABLE
17
GlobalCheckpoint_Available ( )
Begin
1. IF HN and WNs are connected THEN
GRM takes checkpoint as Available
END IF
2. IF Checkpoint is Available THEN
IF WN is continuously connected to HN THEN
HN selects sequence and send to WNs
IF WN not accepted the sequence THEN
GRM takes checkpoint as Crush
The sequence is go to crush queue
ELSE
GRM takes checkpoint as Busy
WN does MSA application
END IF
ELSE
GRM takes checkpoint as Denied
END IF
End
DETECTING NODE FAILURE BY GRM
18
BUSY CHECKPOINT OF ALL WORKERS
19
CHECKPOINT CHANGES FROM BUSY
20
GlobalCheckpoint_Busy ( )
Begin
1 IF WN accepted input sequence from HN THEN
GRM takes checkpoint as Busy
END IF
2 IF the checkpoint is Busy THEN
IF WN sends error message to HN THEN
GRM takes checkpoint as Receive for error
ELSE
GRM takes checkpoint as Receive for result
END IF
END IF
End
RECEIVE CHECKPOINT WITH RESULT
21
RECEIVE CHECKPOINT WITH NO SEND MESSAGE
22
GLOBAL CHECKPOINT STORAGE(GCS)
23
Global_Checkpoint_Storage ( )
Begin
1 GCS stores the current condition of all WN in network
as checkpoint by GRM
2 GCS records the detail condition of WN
3 Create GCS log file for all checkpoint of nodes
End
GCS LOG FILE
24
LOAD BALANCING PHASE
25
GRM_LoadBalancing( )
BEGIN
IF (GRM detects Denied or Crush or Receive “No Send”) THEN
1 It is assumed that they are the failure of worker node.
2 The GRM finds the available node using GCS and decide
which node is suitable to send job.
3 If so, the HN sends jobs to such available node from failed
node.
4 Call Available and Busy Algorithm
ENDIF
END
LOAD BALANCING ACCORDING TO NODE FAILURE
AS DENIED CHECKPOINT
26
LOAD BALANCING ACCORDING TO NODE FAILURE
AS CRUSH CHECKPOINT
27
LOAD BALANCING ACCORDING TO NODE FAILURE
AS RECEIVE CHECKPOINT(NO SEND)
28
IMPLEMENTATION OF WORKER NODE
 Worker node executes the DNA sequence to form aligned
sequence using MSAGA application
 Worker node takes the local checkpoint at the application level
of MSAGA
 Worker node implements checkpointing phase in proposed fault
tolerance system.
 The local checkpoint (LC) and the local checkpoint storage
(LCS) play the main role in that phase.
 Every worker nodes make the local checkpoint and has own
local checkpoint storage.
 Local checkpoint (LC) takes all checkpoint of each worker node.
 Local checkpoint storage(LCS) stores the process of one
worker’s processing state. 29
LOCAL CHECKPOINT
 local checkpoint (LC) is responsible for taking local checkpoint
of worker process states.
 Local checkpoint (LC) starts to take the checkpoints of worker’s
processing state when worker node (WN) connects to the head
node.
 This local checkpoint’s responsibilities is done till all workers’
processes are finished regularly and worker is exit from local area
network because of node failure.
30
LOCAL CHECKPOINT OF EACH WORKER
31
LocalCheckpoint( )
BEGIN
1 Record WN Starting time, Ending time and connection time
2 Record all process state of MSA for sequence
END
LOCAL CHECKPOINT STORAGE(LCS)
 SPL produces the checkpoint log file and processing log file for
local condition of each node.
 So, all local checkpoint monitoring information are stored into
local checkpoint storage (LCS).
 The LCS is stored by the correspondence each WN.
32
LocalCheckpointStorage( )
BEGIN
1. Store WN Starting time, Ending time and
connection time
2. Store all process state of MSA for sequence
END
LCS LOG FILE
33
CONCLUSION
 The GRM cannot make wrong checkpoint for the number of
worker node .
 GRM can recognize differences between old worker node and new
worker node exactly when the worker node connect to the head
node next again.
 While GRM takes the checkpoint for one worker node, the
remaining workers do not need to stop their operation. Therefore,
there is no block for worker nodes.
 This approach supports that the distributed multiple sequence
alignment processing can operate continuously to get the final
result when the node failure occurred within network.
 This system computes the exact time of each worker nodes and
the whole system execution time. This system can get the portable
checkpoint feature and does not need to use any operating system
supports.
34
THANK YOU!!
35

Contenu connexe

Tendances

Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
guest61205606
 
resource management
  resource management  resource management
resource management
Ashish Kumar
 

Tendances (20)

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic AnalyticsSAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
 
OSMC 2021 | Scaling Naemon deployments to Kubernetes with Merlin
OSMC 2021 | Scaling Naemon deployments to Kubernetes with MerlinOSMC 2021 | Scaling Naemon deployments to Kubernetes with Merlin
OSMC 2021 | Scaling Naemon deployments to Kubernetes with Merlin
 
Resource management
Resource managementResource management
Resource management
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
 
Chapter 18 - Distributed Coordination
Chapter 18 - Distributed CoordinationChapter 18 - Distributed Coordination
Chapter 18 - Distributed Coordination
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
 
Non integer order controller based robust performance analysis of a conical t...
Non integer order controller based robust performance analysis of a conical t...Non integer order controller based robust performance analysis of a conical t...
Non integer order controller based robust performance analysis of a conical t...
 
Process Migration in Heterogeneous Systems
Process Migration in Heterogeneous SystemsProcess Migration in Heterogeneous Systems
Process Migration in Heterogeneous Systems
 
Distributed System
Distributed SystemDistributed System
Distributed System
 
Gsm kpi optimization
Gsm kpi optimizationGsm kpi optimization
Gsm kpi optimization
 
Chapter05 new
Chapter05 newChapter05 new
Chapter05 new
 
Traffic Based Malicious Switch and DDoS Detection in Software Defined Network
Traffic Based Malicious Switch and DDoS Detection in Software Defined NetworkTraffic Based Malicious Switch and DDoS Detection in Software Defined Network
Traffic Based Malicious Switch and DDoS Detection in Software Defined Network
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
Synchronization in distributed systems
Synchronization in distributed systems Synchronization in distributed systems
Synchronization in distributed systems
 
resource management
  resource management  resource management
resource management
 
Chapter 6 synchronization
Chapter 6 synchronizationChapter 6 synchronization
Chapter 6 synchronization
 
Process Synchronization
Process SynchronizationProcess Synchronization
Process Synchronization
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 
Synchronization Pradeep K Sinha
Synchronization Pradeep K SinhaSynchronization Pradeep K Sinha
Synchronization Pradeep K Sinha
 

Similaire à Grds conferences icst and icbelsh (9)

Ch17 OS
Ch17 OSCh17 OS
Ch17 OS
C.U
 
Formal Verification of Distributed Checkpointing Using Event-B
Formal Verification of Distributed Checkpointing Using Event-BFormal Verification of Distributed Checkpointing Using Event-B
Formal Verification of Distributed Checkpointing Using Event-B
ijcsit
 
[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...
TI Safe
 

Similaire à Grds conferences icst and icbelsh (9) (20)

Review of Some Checkpointing Schemes for Distributed and Mobile Computing Env...
Review of Some Checkpointing Schemes for Distributed and Mobile Computing Env...Review of Some Checkpointing Schemes for Distributed and Mobile Computing Env...
Review of Some Checkpointing Schemes for Distributed and Mobile Computing Env...
 
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptxCS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
 
Ch17 OS
Ch17 OSCh17 OS
Ch17 OS
 
OS_Ch17
OS_Ch17OS_Ch17
OS_Ch17
 
Streaming systems - Part 2
Streaming systems - Part 2Streaming systems - Part 2
Streaming systems - Part 2
 
Module3 part1
Module3 part1Module3 part1
Module3 part1
 
Hierarchical Non-blocking Coordinated Checkpointing Algorithms for Mobile Dis...
Hierarchical Non-blocking Coordinated Checkpointing Algorithms for Mobile Dis...Hierarchical Non-blocking Coordinated Checkpointing Algorithms for Mobile Dis...
Hierarchical Non-blocking Coordinated Checkpointing Algorithms for Mobile Dis...
 
Software rejuvenation based fault tolerance
Software rejuvenation based fault toleranceSoftware rejuvenation based fault tolerance
Software rejuvenation based fault tolerance
 
Computer Organization
Computer OrganizationComputer Organization
Computer Organization
 
p2 p grid
 p2 p grid  p2 p grid
p2 p grid
 
Capturing Monotonic Components from Input Patterns
Capturing Monotonic Components from Input PatternsCapturing Monotonic Components from Input Patterns
Capturing Monotonic Components from Input Patterns
 
Real Time System
Real Time SystemReal Time System
Real Time System
 
Motorola BSC Overview
Motorola BSC OverviewMotorola BSC Overview
Motorola BSC Overview
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Where is my MQ message on z/OS?
Where is my MQ message on z/OS?Where is my MQ message on z/OS?
Where is my MQ message on z/OS?
 
Operating Systems - "Chapter 5 Process Synchronization"
Operating Systems - "Chapter 5 Process Synchronization"Operating Systems - "Chapter 5 Process Synchronization"
Operating Systems - "Chapter 5 Process Synchronization"
 
Formal Verification of Distributed Checkpointing Using Event-B
Formal Verification of Distributed Checkpointing Using Event-BFormal Verification of Distributed Checkpointing Using Event-B
Formal Verification of Distributed Checkpointing Using Event-B
 
Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...
 
[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...
 
Operating system Interview Questions
Operating system Interview QuestionsOperating system Interview Questions
Operating system Interview Questions
 

Plus de Global R & D Services

Plus de Global R & D Services (20)

Wb june ictel
Wb june ictelWb june ictel
Wb june ictel
 
Wb june icrst
Wb june icrstWb june icrst
Wb june icrst
 
Wb june ecg
Wb june ecgWb june ecg
Wb june ecg
 
Wb june icssh
Wb  june icsshWb  june icssh
Wb june icssh
 
Wb june icpbs
Wb  june icpbsWb  june icpbs
Wb june icpbs
 
Wb june icnm
Wb  june icnmWb  june icnm
Wb june icnm
 
Wb june icllr
Wb  june icllrWb  june icllr
Wb june icllr
 
Wb june ichlsr
Wb  june ichlsrWb  june ichlsr
Wb june ichlsr
 
Wb june icbmls
Wb  june icbmlsWb  june icbmls
Wb june icbmls
 
Rome icpbs 2017
Rome icpbs 2017Rome icpbs 2017
Rome icpbs 2017
 
Rome icnm
Rome icnmRome icnm
Rome icnm
 
Romei ecg 2017
Romei ecg 2017Romei ecg 2017
Romei ecg 2017
 
Rome ictel 2017
Rome  ictel 2017Rome  ictel 2017
Rome ictel 2017
 
Rome icssh, ppt
Rome icssh, pptRome icssh, ppt
Rome icssh, ppt
 
Rome icrst 2017
Rome  icrst 2017Rome  icrst 2017
Rome icrst 2017
 
Rome icllr 2017
Rome  icllr 2017Rome  icllr 2017
Rome icllr 2017
 
Rome ichlsr
Rome  ichlsr Rome  ichlsr
Rome ichlsr
 
Rome icbmls 2017
Rome icbmls 2017Rome icbmls 2017
Rome icbmls 2017
 
Wb june ictel
Wb june ictelWb june ictel
Wb june ictel
 
Wb june icrst
Wb june icrstWb june icrst
Wb june icrst
 

Dernier

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Dernier (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

Grds conferences icst and icbelsh (9)

  • 1. APPLICATION LEVEL CHECKPOINT-BASED APPROACH FOR CRUSH FAILURE IN DISTRIBUTED SYSTEM Presented By Moh Moh Khaing
  • 2. OUTLINES  Abstract  Introduction  Objectives  Background Theory  Proposed System  System flow of proposed system  Two phases of proposed system  Implementation  Conclusion 2
  • 3. ABSTRACT  Fault-tolerance for the computing node failure is an important and critical issue in distributed and parallel processing system.  If the numbers of computing nodes are increased concurrently and dynamically in network, it may occur node failure more times.  This system proposes application level checkpoint-based fault tolerance approach for distributed computing.  The proposed system uses coordinated checkpointing techniques and systematic process logging as global monitoring mechanism.  The proposed system implements on distributed multiple sequences alignment (MSA) application using genetic algorithm (GA). 3
  • 4. DISTRIBUTED MULTIPLE SEQUENCE ALIGNMENT WITH GENETIC ALGORITHM (MSAGA) 4 MSA with GA Division Head Node MSA with GA MSA with GA Aligned Sequence Result Aligned Sequence Result Aligned Sequence Result Combine Alignment Result Display Result DNA Sequences (2 …..n)
  • 5. SEQUENCES ALIGNMENT EXAMPLE Input multiple DNA Sequences >DNAseq1: AAGGAAGGAAGGAAGGAAGGAAGG >DNAseq2: AAGGAAGGAATGGAAGGAAGGAAGG >DNAseq3: AAGGAACGGAATGGTAGGAAGGAAGG Output for aligned DNA Sequences >DNAseq1: A-AGGA-AGGA-AGGAA-------GG-----AA-GGAAGG >DNAseq2: ----------------AAGGAAGGAATGGAAGGAAGGAAGG >DNAseq3: ----------------AAGGAACGGAATGGTAGGAAGGAAGG 5
  • 6. NODE FAILURE CONDITION  Node failure condition is occurred when the worker node connects to head node, worker node accepts the input sequence and worker node sends resulted sequence the head node. The failure conditions are 1. Worker node is denied as soon as worker node had connected to the head node without working any job. 2. Worker node rejects the input sequence from the head node after the head node and worker node had connected and head node had prepared the input sequence for worker node. 3. Worker node sends “No Send” message to Head node after worker node had accepted the result sequence to head node. 4. Worker node is crushed when it cannot connect to the Head node with correct address. 5. Worker node is crushed when it disconnect to the Head node. 6
  • 7. COORDINATED CHECKPOINTING  Checkpointing is used as fault tolerance mechanism in distributed system.  A checkpoint is a snapshot of the current state of a process and assist in monitoring process.  Coordinated checkpointing takes the checkpoint periodically and save in the log file.  This monitoring information provides at the node failure condition.  If node failure occurs in distributed computing, another available node can reconstruct the process state from the information saved in the checkpoint information of failed node. 7
  • 8. SYSTEMATIC PROCESS LOGGING  Systematic Process Logging (SPL) which was derived from a log-based method.  The motivation for SPL is to reduce the amount of computation that can be lost, which is bound by the execution time of a single failed task.  SPL saves the checkpoint information from the coordinated checkpointing as the log file format with exactly time and their contents.  Depending on the fault, it decides which node can be accepted the job from failed node using storing log file. 8
  • 9. PROPOSED FAULT TOLERANCE SYSTEM  The checkpoint based fault tolerance approach is implemented on the application layer without using any operating system support.  In distributed multiple sequences alignment application,one head node and one or more worker nodes are connected with local area network.  All worker nodes implemented the MSAGA and aligned the input sequence from head node independently.  The proposed fault tolerance system takes the local checkpoint at the MSA process of each computing worker node themselves and global checkpoint at events of all workers ’ condition by head node. 9
  • 10. ARCHITECTURE OF PROPOSED FAULT TOLERANCE SYSTEM Head Node Local Area Network GRM GCS LCS LC Worker 1 LCS LC Worker 2 LCS LC Worker 3 GRM – Global Resource Monitor GCS – Global Checkpoint Storage LCS- Local Checkpoint Storage LC – Local Checkpoint 10
  • 11. SYSTEM FLOW OF PROPOSED SYSTEM Start End Load Balancing Phase GRM HN GCS Checkpointing Phase WNHN Systematic Process Logging GCS LCS WNHN GRM LC Coordinated Checkpointing HN- Head Node WN – Worker Node 11
  • 12. IMPLEMENTATION OF HEAD NODE Checkpointing Phase  The global resource monitor(GRM) plays the main role in both coordinated checkpointing phase and systematic process logging phase.  GRM takes the global checkpoint of all workers nodes’ event at the coordinated checkpointing phase.  GCS saves the global checkpoint information as the log file format at the Systematic process logging phase. 12
  • 13. GLOBAL CHECKPOINT 13 Global Rrsource Monitor(GRM ) Begin 1. Taking global checkpoints of current condition of each WN with WN’s IP, port, status, and time duration 2. Detecting the failure condition of WNs 3. Finding the available worker nodes and decide which node is suitable for continuing to do failed WN’s jobs End
  • 14. TYPES OF CHECKPOINT 14 Checkpoint No Checkpoint Name Checkpoint Content 1 Available Worker node is connected with Head node and waits for jobs from Head node 2 Denied Worker node is disconnected with Server 3 Busy Worker node is processing the jobs 4 Receive Worker node send the result to the Head node and exist (or) Worker node send Error message and Exit 5 Crush Worker node sends the crush message to the Head node
  • 15. CHECKPOINT INFORMATION  For each checkpoint, there are four conditions are described:  Worker Typeto show worker number,  IP address to show WN,  Checkpoint Name to show worker node’s conditions,  Current Time to show process current time,  Time Duration to show time within each worker’s running state to accept and receive state or running state to reject state.  15 Worker Type IP Address Checkpoint Name Current Time Time Duration
  • 16. AVAILABLE CHECKPOINT OF ALL WORKERS  GRM take checkpoint as Available when all worker nodes are connected to the head node 16
  • 17. CHECKPOINT CHANGES FROM AVAILABLE 17 GlobalCheckpoint_Available ( ) Begin 1. IF HN and WNs are connected THEN GRM takes checkpoint as Available END IF 2. IF Checkpoint is Available THEN IF WN is continuously connected to HN THEN HN selects sequence and send to WNs IF WN not accepted the sequence THEN GRM takes checkpoint as Crush The sequence is go to crush queue ELSE GRM takes checkpoint as Busy WN does MSA application END IF ELSE GRM takes checkpoint as Denied END IF End
  • 19. BUSY CHECKPOINT OF ALL WORKERS 19
  • 20. CHECKPOINT CHANGES FROM BUSY 20 GlobalCheckpoint_Busy ( ) Begin 1 IF WN accepted input sequence from HN THEN GRM takes checkpoint as Busy END IF 2 IF the checkpoint is Busy THEN IF WN sends error message to HN THEN GRM takes checkpoint as Receive for error ELSE GRM takes checkpoint as Receive for result END IF END IF End
  • 22. RECEIVE CHECKPOINT WITH NO SEND MESSAGE 22
  • 23. GLOBAL CHECKPOINT STORAGE(GCS) 23 Global_Checkpoint_Storage ( ) Begin 1 GCS stores the current condition of all WN in network as checkpoint by GRM 2 GCS records the detail condition of WN 3 Create GCS log file for all checkpoint of nodes End
  • 25. LOAD BALANCING PHASE 25 GRM_LoadBalancing( ) BEGIN IF (GRM detects Denied or Crush or Receive “No Send”) THEN 1 It is assumed that they are the failure of worker node. 2 The GRM finds the available node using GCS and decide which node is suitable to send job. 3 If so, the HN sends jobs to such available node from failed node. 4 Call Available and Busy Algorithm ENDIF END
  • 26. LOAD BALANCING ACCORDING TO NODE FAILURE AS DENIED CHECKPOINT 26
  • 27. LOAD BALANCING ACCORDING TO NODE FAILURE AS CRUSH CHECKPOINT 27
  • 28. LOAD BALANCING ACCORDING TO NODE FAILURE AS RECEIVE CHECKPOINT(NO SEND) 28
  • 29. IMPLEMENTATION OF WORKER NODE  Worker node executes the DNA sequence to form aligned sequence using MSAGA application  Worker node takes the local checkpoint at the application level of MSAGA  Worker node implements checkpointing phase in proposed fault tolerance system.  The local checkpoint (LC) and the local checkpoint storage (LCS) play the main role in that phase.  Every worker nodes make the local checkpoint and has own local checkpoint storage.  Local checkpoint (LC) takes all checkpoint of each worker node.  Local checkpoint storage(LCS) stores the process of one worker’s processing state. 29
  • 30. LOCAL CHECKPOINT  local checkpoint (LC) is responsible for taking local checkpoint of worker process states.  Local checkpoint (LC) starts to take the checkpoints of worker’s processing state when worker node (WN) connects to the head node.  This local checkpoint’s responsibilities is done till all workers’ processes are finished regularly and worker is exit from local area network because of node failure. 30
  • 31. LOCAL CHECKPOINT OF EACH WORKER 31 LocalCheckpoint( ) BEGIN 1 Record WN Starting time, Ending time and connection time 2 Record all process state of MSA for sequence END
  • 32. LOCAL CHECKPOINT STORAGE(LCS)  SPL produces the checkpoint log file and processing log file for local condition of each node.  So, all local checkpoint monitoring information are stored into local checkpoint storage (LCS).  The LCS is stored by the correspondence each WN. 32 LocalCheckpointStorage( ) BEGIN 1. Store WN Starting time, Ending time and connection time 2. Store all process state of MSA for sequence END
  • 34. CONCLUSION  The GRM cannot make wrong checkpoint for the number of worker node .  GRM can recognize differences between old worker node and new worker node exactly when the worker node connect to the head node next again.  While GRM takes the checkpoint for one worker node, the remaining workers do not need to stop their operation. Therefore, there is no block for worker nodes.  This approach supports that the distributed multiple sequence alignment processing can operate continuously to get the final result when the node failure occurred within network.  This system computes the exact time of each worker nodes and the whole system execution time. This system can get the portable checkpoint feature and does not need to use any operating system supports. 34