MapReduce : Simplified Data Processing on Large Clusters

by Jeffrey Dean and Sanjay Ghemawat
Communication of the ACM, 2008
Presented by: Abolfazl Asudeh
MapReduce: Simplified Data Processing
on Large Clusters

Map Reduce
7/4/20132
 Patented by Google
 A parallel programming model
 and an associated implementation
 for processing and generating large datasets
 Users specify the computation in terms of a map and
a reduce function
 The system automatically parallelizes the
computation across large-scale clusters

7/4/20133
 Previously, users had to handle the parallelization of
the programs over hundred or thousand machines
 Distribute the data
 Handle Failure
 Schedule inter-machine communications – to make efficient
use of resources
 By Experiment:
 most of their computations involved applying a map
operation to produce intermediate key/value pairs
 Then applying a reduce operation to
combine/aggregate the pairs

Programming Model
7/4/20134
 Input: a set of input key/value pairs
 Output: a set of output key/value pairs
 Map (written by the user):
 takes an input pair and produces a set of intermediate
key/value pairs.
 MapReduce library:
 groups all intermediate values associated with the same
intermediate key and passes them to the reduce function.
 Reduce (written by the user):
 accepts an intermediate key and a set of values for that
key.
 It merges these values together to form a possibly smaller
set of values

Basic Example: Word Counting
7/4/20135

7/4/20136

7/4/20137
<How,1>
<now,1>
<brown,1>
<cow,1>
<How,1>
<does,1>
<it,1>
<work,1>
<now,1>
<How,1 1>
<now,1 1>
<brown,1>
<cow,1>
<does,1>
<it,1>
<work,1>
How now
Brown cow
How does
It work now
Input
brown 1
cow 1
does 1
How 2
it 1
now 2
work 1
Output
M
M
M
M
Map
R
R
Reduce

Execution Overview
7/4/20138
1. The MapReduce library splits the input files into M
pieces of typically 16-64MB per piece and starts up
many copies of the program on a cluster of
machines.
2. One of the workers becomes Master. It manages
assigning M map jobs and R reduce jobs to the
Workers. It picks the Idle workers and assign the
jobs.
3. The worker that is doing a Map job: reads the
corresponding input split, parses the key/value
pairs and pass to map function (by user)

Execution Overview
7/4/20139
4. the buffered pairs are written to local disk,
partitioned into R regions.
 The locations of buffered pairs on the local disk are
passed back to the master who is responsible for
forwarding these locations to the reduce workers
5. Reduce Worker remotely reads the buffered data
from the local disc of the corresponding mapper.
 Sorts the read data by Intermediate key and group the
results together.

Execution Overview
7/4/201310
6. The reduce worker passes the results for each
intermediate key to the reduce function
7. When all the tasks are done, the Map-Reduce
function returns back to the user program

Fault Tolerance
7/4/201312
 Failure: if a worker does not response the PING of
master
 If a map worker Fails:
 Reschedule the WHOLE map tasks (because it writes on
the local disk)
 Send the results Address in the new map worker to all
corresponding reduce workers (if the did not still read
from the previous mapper, read from the new one)
 If a reduce worker Fails:
 Completed reduce tasks do not need to be re-executed
since their output is stored in a global file system

Execution Optimization
7/4/201313
 Locality
 Network bandwidth is a relatively scarce resource
 Compute on local copies which are distributed by HFDS
 Task Granularity
 Ideally, M and R should be much larger than the
number of worker machines
 Having each worker perform many different tasks
improves dynamic load balancing and also speeds
up recovery when a worker fails

Practice
7/4/201314
 Write the map and reduce functions for Page Rank
Algorithm.

MapReduce : Simplified Data Processing on Large Clusters

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à MapReduce : Simplified Data Processing on Large Clusters

Similaire à MapReduce : Simplified Data Processing on Large Clusters (20)

Dernier

Dernier (20)

MapReduce : Simplified Data Processing on Large Clusters