Grds conferences icst and icbelsh (9)

APPLICATION LEVEL CHECKPOINT-BASED
APPROACH FOR CRUSH FAILURE IN
DISTRIBUTED SYSTEM
Presented By
Moh Moh Khaing

OUTLINES
 Abstract
 Introduction
 Objectives
 Background Theory
 Proposed System
 System flow of proposed system
 Two phases of proposed system
 Implementation
 Conclusion
2

ABSTRACT
 Fault-tolerance for the computing node failure is an important and
critical issue in distributed and parallel processing system.
 If the numbers of computing nodes are increased concurrently and
dynamically in network, it may occur node failure more times.
 This system proposes application level checkpoint-based fault
tolerance approach for distributed computing.
 The proposed system uses coordinated checkpointing techniques
and systematic process logging as global monitoring mechanism.
 The proposed system implements on distributed multiple
sequences alignment (MSA) application using genetic algorithm
(GA).
3

DISTRIBUTED MULTIPLE SEQUENCE ALIGNMENT WITH
GENETIC ALGORITHM (MSAGA)
4
MSA with
GA
Division
Head Node
MSA with
GA
MSA with
GA
Aligned
Sequence Result
Aligned
Sequence Result
Aligned
Sequence Result
Combine Alignment Result
Display Result
DNA Sequences (2 …..n)

SEQUENCES ALIGNMENT EXAMPLE
Input multiple DNA Sequences
>DNAseq1: AAGGAAGGAAGGAAGGAAGGAAGG
>DNAseq2: AAGGAAGGAATGGAAGGAAGGAAGG
>DNAseq3: AAGGAACGGAATGGTAGGAAGGAAGG
Output for aligned DNA Sequences
>DNAseq1: A-AGGA-AGGA-AGGAA-------GG-----AA-GGAAGG
>DNAseq2: ----------------AAGGAAGGAATGGAAGGAAGGAAGG
>DNAseq3: ----------------AAGGAACGGAATGGTAGGAAGGAAGG 5

NODE FAILURE CONDITION
 Node failure condition is occurred when the worker node connects
to head node, worker node accepts the input sequence and worker
node sends resulted sequence the head node. The failure
conditions are
1. Worker node is denied as soon as worker node had connected
to the head node without working any job.
2. Worker node rejects the input sequence from the head node
after the head node and worker node had connected and head
node had prepared the input sequence for worker node.
3. Worker node sends “No Send” message to Head node after
worker node had accepted the result sequence to head node.
4. Worker node is crushed when it cannot connect to the Head
node with correct address.
5. Worker node is crushed when it disconnect to the Head node.
6

COORDINATED CHECKPOINTING
 Checkpointing is used as fault tolerance mechanism in distributed
system.
 A checkpoint is a snapshot of the current state of a process and
assist in monitoring process.
 Coordinated checkpointing takes the checkpoint periodically and
save in the log file.
 This monitoring information provides at the node failure
condition.
 If node failure occurs in distributed computing, another available
node can reconstruct the process state from the information saved
in the checkpoint information of failed node.
7

SYSTEMATIC PROCESS LOGGING
 Systematic Process Logging (SPL) which was derived from a
log-based method.
 The motivation for SPL is to reduce the amount of computation
that can be lost, which is bound by the execution time of a
single failed task.
 SPL saves the checkpoint information from the coordinated
checkpointing as the log file format with exactly time and their
contents.
 Depending on the fault, it decides which node can be accepted
the job from failed node using storing log file.
8

PROPOSED FAULT TOLERANCE SYSTEM
 The checkpoint based fault tolerance approach is implemented
on the application layer without using any operating system
support.
 In distributed multiple sequences alignment application,one head
node and one or more worker nodes are connected with local
area network.
 All worker nodes implemented the MSAGA and aligned the
input sequence from head node independently.
 The proposed fault tolerance system takes the local checkpoint at
the MSA process of each computing worker node themselves
and global checkpoint at events of all workers ’ condition by
head node.
9

ARCHITECTURE OF PROPOSED FAULT TOLERANCE
SYSTEM
Head Node
Local Area Network
GRM GCS
LCS LC
Worker 1
LCS LC
Worker 2
LCS LC
Worker 3
GRM – Global Resource Monitor
GCS – Global Checkpoint Storage
LCS- Local Checkpoint Storage
LC – Local Checkpoint 10

SYSTEM FLOW OF PROPOSED SYSTEM
Start
End
Load Balancing Phase
GRM
HN
GCS
Checkpointing Phase
WNHN
Systematic Process Logging
GCS LCS
WNHN
GRM LC
Coordinated Checkpointing
HN- Head Node
WN – Worker Node
11

IMPLEMENTATION OF HEAD NODE
Checkpointing Phase
 The global resource monitor(GRM) plays the main role in
both coordinated checkpointing phase and systematic process
logging phase.
 GRM takes the global checkpoint of all workers nodes’ event
at the coordinated checkpointing phase.
 GCS saves the global checkpoint information as the log file
format at the Systematic process logging phase.
12

GLOBAL CHECKPOINT
13
Global Rrsource Monitor(GRM )
Begin
1. Taking global checkpoints of current condition of each WN
with WN’s IP, port, status, and time duration
2. Detecting the failure condition of WNs
3. Finding the available worker nodes and decide which node
is suitable for continuing to do failed WN’s jobs
End

TYPES OF CHECKPOINT
14
Checkpoint No Checkpoint
Name
Checkpoint Content
1 Available Worker node is connected with Head node
and waits for jobs from Head node
2 Denied Worker node is disconnected with Server
3 Busy Worker node is processing the jobs
4 Receive Worker node send the result to the Head
node and exist (or) Worker node send
Error message and Exit
5 Crush Worker node sends the crush message to
the Head node

CHECKPOINT INFORMATION
 For each checkpoint, there are four conditions are
described:
 Worker Typeto show worker number,
 IP address to show WN,
 Checkpoint Name to show worker node’s conditions,
 Current Time to show process current time,
 Time Duration to show time within each worker’s
running state to accept and receive state or running
state to reject state.

15
Worker
Type
IP Address Checkpoint
Name
Current
Time
Time
Duration

AVAILABLE CHECKPOINT OF ALL WORKERS
 GRM take checkpoint as Available when all worker nodes are
connected to the head node
16

CHECKPOINT CHANGES FROM AVAILABLE
17
GlobalCheckpoint_Available ( )
Begin
1. IF HN and WNs are connected THEN
GRM takes checkpoint as Available
END IF
2. IF Checkpoint is Available THEN
IF WN is continuously connected to HN THEN
HN selects sequence and send to WNs
IF WN not accepted the sequence THEN
GRM takes checkpoint as Crush
The sequence is go to crush queue
ELSE
GRM takes checkpoint as Busy
WN does MSA application
END IF
ELSE
GRM takes checkpoint as Denied
END IF
End

DETECTING NODE FAILURE BY GRM
18

BUSY CHECKPOINT OF ALL WORKERS
19

CHECKPOINT CHANGES FROM BUSY
20
GlobalCheckpoint_Busy ( )
Begin
1 IF WN accepted input sequence from HN THEN
GRM takes checkpoint as Busy
END IF
2 IF the checkpoint is Busy THEN
IF WN sends error message to HN THEN
GRM takes checkpoint as Receive for error
ELSE
GRM takes checkpoint as Receive for result
END IF
END IF
End

RECEIVE CHECKPOINT WITH RESULT
21

RECEIVE CHECKPOINT WITH NO SEND MESSAGE
22

GLOBAL CHECKPOINT STORAGE(GCS)
23
Global_Checkpoint_Storage ( )
Begin
1 GCS stores the current condition of all WN in network
as checkpoint by GRM
2 GCS records the detail condition of WN
3 Create GCS log file for all checkpoint of nodes
End

LOAD BALANCING PHASE
25
GRM_LoadBalancing( )
BEGIN
IF (GRM detects Denied or Crush or Receive “No Send”) THEN
1 It is assumed that they are the failure of worker node.
2 The GRM finds the available node using GCS and decide
which node is suitable to send job.
3 If so, the HN sends jobs to such available node from failed
node.
4 Call Available and Busy Algorithm
ENDIF
END

LOAD BALANCING ACCORDING TO NODE FAILURE
AS DENIED CHECKPOINT
26

AS CRUSH CHECKPOINT
27

AS RECEIVE CHECKPOINT(NO SEND)
28

IMPLEMENTATION OF WORKER NODE
 Worker node executes the DNA sequence to form aligned
sequence using MSAGA application
 Worker node takes the local checkpoint at the application level
of MSAGA
 Worker node implements checkpointing phase in proposed fault
tolerance system.
 The local checkpoint (LC) and the local checkpoint storage
(LCS) play the main role in that phase.
 Every worker nodes make the local checkpoint and has own
local checkpoint storage.
 Local checkpoint (LC) takes all checkpoint of each worker node.
 Local checkpoint storage(LCS) stores the process of one
worker’s processing state. 29

LOCAL CHECKPOINT
 local checkpoint (LC) is responsible for taking local checkpoint
of worker process states.
 Local checkpoint (LC) starts to take the checkpoints of worker’s
processing state when worker node (WN) connects to the head
node.
 This local checkpoint’s responsibilities is done till all workers’
processes are finished regularly and worker is exit from local area
network because of node failure.
30

LOCAL CHECKPOINT OF EACH WORKER
31
LocalCheckpoint( )
BEGIN
1 Record WN Starting time, Ending time and connection time
2 Record all process state of MSA for sequence
END

LOCAL CHECKPOINT STORAGE(LCS)
 SPL produces the checkpoint log file and processing log file for
local condition of each node.
 So, all local checkpoint monitoring information are stored into
local checkpoint storage (LCS).
 The LCS is stored by the correspondence each WN.
32
LocalCheckpointStorage( )
BEGIN
1. Store WN Starting time, Ending time and
connection time
2. Store all process state of MSA for sequence
END

CONCLUSION
 The GRM cannot make wrong checkpoint for the number of
worker node .
 GRM can recognize differences between old worker node and new
worker node exactly when the worker node connect to the head
node next again.
 While GRM takes the checkpoint for one worker node, the
remaining workers do not need to stop their operation. Therefore,
there is no block for worker nodes.
 This approach supports that the distributed multiple sequence
alignment processing can operate continuously to get the final
result when the node failure occurred within network.
 This system computes the exact time of each worker nodes and
the whole system execution time. This system can get the portable
checkpoint feature and does not need to use any operating system
supports.
34

Grds conferences icst and icbelsh (9)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Grds conferences icst and icbelsh (9)

Similaire à Grds conferences icst and icbelsh (9) (20)

Plus de Global R & D Services

Plus de Global R & D Services (20)

Dernier

Dernier (20)

Grds conferences icst and icbelsh (9)