In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.
2. Contents
Computation Models for Distributed Computing
MPI
MapReduce
Why MapReduce?
How MapReduce works
Simple example
References
2
3. Distributed Computing
Why ?
Booming of big data generation (social media , e-commerce , banks , etc …)
Big data and machine learning , data mining AI became like bread and butter : better results
comes from analyzing bigger set of data.
How it works ?
Data-Partitioning : divide data into multiple tasks , each implementing the same procedure
(computations) at specific phase on its data segment
Task-Partitioning : assign different tasks to different computation units
Hardware for distributed Computing
Multiple processors (Multi-core processors) 3
4. Metrics
How to judge computational model suitability ?
Simplicity : level of developer experience
Scalability : adding more computational node , increase throughput / response time
fault tolerance : support recovering computed results when node is down
Maintainability : How easy fix bugs , add features
Cost : need for special hardware (Multi-core processors , large RAM , infiniband or can use
common ethernet cluster of commodity machines)
No one size fits all
sometimes it is better to use hybrid computational models 4
5. MPI (Message Passing Interface)
● Workload is divided among different
processes (each process may have
multiple threads)
● Communication is via Message passing
● Data exchange is via shared memory
(Physical / Virtual)
● Pros
○ Flexibility : programmer can customize message and communication between nodes
○ Speed : rely on sharing data via memory
Source :
https://computing.llnl.gov/tutorials/mpi/
5
6. MapReduce
Objective : Design scalable parallel programming
framework to be deployed on
large cluster of commodity machines
Data divided into splits , each processed
by map functions , whose output are
processed by reduce functions.
Originated and first practical
implementation in Google Inc. 2004
MapReduce implementations
Apache Hadoop (Computation)
6
8. MapReduce - Execution (2)
Platform :
Nodes communicating over ethernet network over TCP/IP
Two main type of processes :
Master : orchestrates the work
Worker : process data
Units of work :
Jobs :A MapReduce job is a unit of work that the client wants to be performed: it consists of the
input data, the MapReduce program, and configuration.
task : can be map task (process input to intermediate data), reduce task (process intermediate
data to output). Job is divided into several map / reduce tasks. 8
9. MapReduce Execution (3)
1. A copy of the master process is created
2. Input data is divided into M splits , each of 16 to 64 MB (user configured)
3. M map tasks are created and given unique IDs , each parses key/value pairs
of each split ,start processing , output is written into a memory buffer.
4. Map output is partitioned to R partitions. When buffers are full , they are
spilled into local hard disks , and master is notified by saved buffered
locations. All records with the same key are put in same partition.
Note : Map output stored in local worker file system , not distributed file system , as it is intermediate
and to avoid complexity.
5. Shuffling : when a reduce receives a notification from master that one of 9
10. MapReduce Execution (4)
6. When the reduce worker receives all its intermediate output , it sorts them by
key (sorting is need as reduce task may have several keys). (1)
7. When sorting finished , the reduce worker iterates over each key , passing the
key and list of values to the reduce function.
8. The output of reduce function is appended to the file corresponding to this
reduce worker.
9. For each HDFS block of the output of reduce task , one block is stored locally
in the reduce worker and the other 2 (assuming replication factor of 3) is
replicated on two other off-rack node for reliability.
■ Notes : 10
11. Master responsibilities
Find idle nodes (workers) to assign
map and reduce tasks.
monitor each task status
(idle , in-progress, finished).
Keep track of locations of R map
intermediate output on each map
worker machine.
Keep record of worker IDs and
other info (CPU , memory , disk size)
Continuously push information about
intermediate map output to reduce 11
12. Fault tolerance (1)
Objective : handle machine failures gracefully ,
i.e. programmer don’t need to handle
it or be aware of details.
Two type of failures :
Master failure
Worker failure
Two main activities
Failure detection
Recover lost (computed) data with least 12
13. Fault tolerance (2)
Worker failure
Detection : Timeout for master ping , mark worker as failed.
Remove worker from list of available worker.
For all map tasks assigned to that worker :
mark these tasks as idle
these tasks will be eligible for re-scheduling on other workers
map tasks are re-executed as output is stored in local file system in failed machine
all reduce workers are notified with re-execution so they can get intermediate data they
haven’t yet.
No need to re-execute reduce tasks as their output is stored in distributed file system and
13
14. Semantics in the Presence of Failures
Deterministic and Nondeterministic Functions
Deterministic functions always return the same result any time they are called with a specific set
of input values.
Nondeterministic functions may return different results each time they are called with a specific
set of input values.
If map and reduce function are “deterministic” , distributed implementation of
mapreduce framework must produce the same output of a non-faulting
sequential execution of the program.
Several copies of the same map/reduce task might run on different nodes for
sake of reliability and fault tolerance 14
15. Semantics in the Presence of Failures (2)
Mapper always write their output to tmp files (atomic commits).
When a map task finishes :
Renames the tmp file to final output.
Sends message to master informing it with the filename.
If another copy of the same map finished before , master ignores it , else store the filename.
Reducers do the same , and if multiple copies of the same reduce task finished ,
MapReduce framework rely on the atomic rename of the file system.
If map task are non-deterministic , and multiple copies of the map task run on
different machines , weak semantic condition can happen : two reducers read15
16. Semantics in the Presence of Failures (3)
16
● Workers #1 and #2 run the
same copy of map task M1
● Reducer task R1 reads its
input for M1 from worker #1
● Reducer task R2 reads its
input for M1 from worker #2 ,
as worker#1 has failed by the
time R2 has started.
● If M1’s function is deterministic
, we have complete
consistency.
● If M1’s function is not
deterministic , R1 and R2 may
receive different results from
M1.
17. Task granularity
Load balancing: fine grained is better , faster machines tend to take more tasks than slower machined
over time , leads to less overall job execution time.
Failure recovery : less time to re-execute failed tasks.
Very fine-grained tasks may not be desirable : overhead to manage and too much data shuffling
(consume valuable bandwidth).
Optimal Granularity : Split size = HDFS block size (128 MB by default)
HDFS block is guaranteed to be in the same node.
We need to maximize work done by one mapper locally
if split size < block : not fully utilizing possibility of local data processing
if split size > block : may need data transfer to make map function complete
17
18. Data locality
Network bandwidth is a valuable resource.
We assume a rack server hardware.
MapReduce scheduler works as follows :
1. Try to assign map task to the node where the
corresponding split block(s) reside , if it is free
assign , else go to step 2
2. try to find a free nod in the same rack to assign
the map task , if can’t find a free off-rack node
to assign.
● More complex implementation uses network cost
model.
18
19. Backup tasks
Stragglers : a set machines that run a set
of assigned tasks (MapReduce) very slow.
Slow running can be due to many reasons;
bad disk , slow network , low speed CPU.
Other tasks scheduled on stragglers cause
more
load and longer execution time.
Solution Mechanism:
When MapReduce job is close to finish
for all the “in-progress” tasks , issue backup tasks. 19
20. Refinements
Partitioning function :
Partition the output of mapping tasks into R partitions (each for a reduce task).
Good function should try as possible to make the partitions equal.
Default : hash(key) mod R
Works fine usually
problem arises when specific keys have many records
than the others.
Need design custom hash functions or change the key.
Combiner function
Reduce size of map intermediate output.
20
21. Refinements (2)
Skipping bad record
Bug in third party library that can’t be fixed , causes code crash at specific records
Terminating a job running for hours / days more expensive than sacrificing small percentage of
accuracy (If context allows , for ex. statistical analysis of large data).
How MapReduce handle that ?
1. Each worker process installs a signal handler that catches segmentation violations ,bus
errors and other possible fatal errors.
2. Before a map / reduce task runs , the MapReduce library stores the key value in global
variable.
3. When a map / reduce task function code generates a signal , the worker sends UDP 21
In google paper , sorting is mentioned as reducer responsibility , however in Hadoop Definitive Guide , sorting is mentioned as mapper responsibility and reducer is responsible for merging sorting intermediate output.