MapReduce

Introduction to
MapReduce

Zuhair Khayyat
3/11/2012

What is MapReduce

● A programming model introduced by Google in OSDI '04 for
processing large datasets efficiently.
● Features:
– Automatic parallelization, no parallel experience required.
– Data and process redundancy for failure recovery.
– Auto scheduling and Load balancing.
– Easy to program, based on two simple functions:
● Map
● Reduce.

CS245 - 2012 Introduction to MapReduce 2

Why MapReduce?

● For a cluster of:
– 2000 machines.
– Total 16 TB Ram (≈ 8 GB each).
– Total 2 PB Disk space (≈ 1 TB each).
● Use the maximum capacity of the cluster to:
– Implement a parallel word count for input size 100 TB.


Why MapReduce?

● For a cluster of:
– 2000 machines.
– Total 16 TB Ram (≈ 8 GB each).
– Total 2 PB Disk space (≈ 1 TB each).
● Use the maximum capacity of the cluster to:
– Implement a parallel word count for input size 100 TB.
– Implement a parallel sort for the same input file.
● Can you use the same code for both applications?


How Fast is MapReduce (Hadoop)

● Sort Benchmark competition (http://sortbenchmark.org/):
– 2009: 100 TB in 173 minutes using 3452 nodes:
● 2 x Quad core Xeons @ 2.5 GHz.
● 8 GB RAM.

– 2008: 1TB in 3.48 minutes using 910 nodes:
● 4 x Dual core Xeons @ 2.0 GHz.
● 8 GB RAM.


Who uses MapReduce?


Map & Reduce functions

● The Mapper (Pick a key):
– Input: Read input from disk.
– Output: Create pairs of <key, value>, known as
intermediate pairs.
– More input partitions == More parallel Mappers.
● The Reducer (Process values):
– Input: a list of <key,value> pairs with a unique key.
– Output: Single or multiple of <key, values>
– More unique keys == More Parallel Reducers.

How MapReduce Work

1) Partition input file into M partitions.
2) Create M Map tasks, read M partitions in parallel and emits
intermediate <key, value> pairs. Store them into local storage.
3) Wait for all Map workers to finish, sort and partition
intermediate <key, value> pairs into R regions.
4) Start R reduce workers, each reads a list of intermediate with
a unique key from remote disks.
5) Write the output of reduce workers to file(s).


Example – Word count

● Assume an input as following:

cat flower picture
snow cat cat
prince flower sun
king queen AC


● Step1: Partition input file into M partitions.

cat flower picture cat flower picture
snow cat cat
prince flower sun snow cat cat
king queen AC
prince flower sun

king queen AC


● Step2: Create M Map tasks, read M partitions in parallel and
emits intermediate <key, value> pairs. Store them into local
storage.

cat flower picture Mapper 1 <cat,1> <flower,1> <picture,1>

snow cat cat
Mapper 2 <snow,1> <cat,1> <cat,1>
prince flower sun
Mapper 3 <prince,1> <flower,1> <sun,1>
king queen AC

CS245 - 2012 Introduction to 4
Mapper MapReduce <king,1> <queen,1> <AC,1>
11

● Step3: Wait for all Map workers to finish, sort and partition
intermediate <key, value> pairs into R regions.

<cat,1> <AC,1>
<cat,1> <flower,1> <picture,1> <flower,1> <cat,1>
<picture,1> <cat,1>
<cat,1> <cat,1>
<snow,1> <cat,1> <cat,1> <cat,1> <flower,1>
<snow,1> <flower,1>
<flower,1> <king,1>
<prince,1> <flower,1> <sun,1> <prince,1> <picture,1>
<sun,1> <prince,1>
<queen,1>
<AC,1> <snow,1>
CS245 - 2012
<king,1>
<king,1> <queen,1> <AC,1> Introduction to MapReduce <sun,1> 12
<queen,1>

● Step4: Start R reduce workers, each reads a list of intermediate
with a unique key from remote disks.

<AC,1> Reducer 1 <AC,1>
<cat,1>
<cat,1>
<cat,1> Reducer 2 <cat,3>
<flower,1>
<flower,1> Reducer 3 <flower,2>
<king,1>
<picture,1>
<prince,1>
<queen,1>
<snow,1>
CS245 - 2012
<sun,1> Reducer 9
Introduction to MapReduce <sun,1> 13

● Step5: Write the output of reduce workers to file(s).

<AC,1>
<AC,1>
<cat,3> <cat,3>
<flower,2>
<flower,2>
<king,1>
<king,1> <picture,1>
<prince,1>
<picture,1> <queen,1>
<snow,1>
<sun,1>

<sun,1>

MapReduce framework


MapReduce Failure Recovery

● The framework works as master worker paradigm.
● The master keeps records of the work done on each worker.
● If a worker fails, the master assigns the same work to another
worker.
● If a worker is late, another copy of the same work is assigned
to another worker.
● If the master fails, another backup copy of the master can pick
up and continue execution from the last check points.


Advantages of MapReduce

● Parallel IO: hides disk latency.
● Parallel Processing:
– Map functions works independently in parallel, each
process one unique partition.
– Reduce functions work independently in parallel, each
on a unique intermediate key.
● Using large clusters of commodity machines gives better
results than small expensive clusters.


Advantages of MapReduce

● Parallel IO: hides disk latency.
● Parallel Processing:
– Map functions works independently in parallel, each
process one unique partition.
– Reduce functions work independently in parallel, each
on a unique intermediate key.
● Using large clusters of commodity machines gives
comparable results than small expensive clusters.


Hadoop vs. others
● Algorithm: Sorting 100 TB data.

Hadoop DEMSort TritonSort
Nodes Count 3452 195 47
Processor 2x Quad-core 2x Quad-core 2x Quad-core
Xeons @ 2.5 GHz Xeons @ 2.6 GHz Xeons @ 2.27 GHz
Memory 8 GB 16 GB 24 GB
Network 1 Gigabit Ethernet InfiniBand 10 Gigabit Fiber
Throughput 0.578 TB/Min 0.564 TB/Min 0.582 TB/Min


MapReduce weak points

● Overhead of MapReduce is huge.
● Data dependent applications may need multiple iterations of
MapReduce, for example:
– K-means.
– PageRank.
● Complex algorithms can be very hard to implement.
– Range Queries.
● Sensitive to <key,value> pairs' skewed distribution


Implementations of MapReduce

● Hadoop in Java.
● Mars in C++ & CUDA.
● Skynet in Ruby.
● Phoenix in C++
● Microsoft Dryad:
– Schedule multiple levels of “MapReduce” like
operations..


MapReduce in Database


MapReduce in Database - Ex1

● Select Name from Students where age = 23;

Students:
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22



● Select COUNT(Name) from Students where age > 20 group
by Name;

Students:
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22



● Select Name, Term from Students, Enrolment where ID = SID
and age != 20;

Students: Enrolment:
Name ID Age CID SID Term
Ahmed 1177 23 CS290 1177 042
Bob 1131 20 CS260 1177 052
Sara 1197 22 ME222 1131 051
AMCS220 1197 051



● Select Name, Term from Students, Enrolment where ID !=
SID;
Name ID Age CID SID Term
Ahmed 1177 23 CS290 1177 042
Bob 1131 20 CS260 1177 052
Sara 1197 22 ME222 1131 051
AMCS220 1197 051

● What if the condition ID > SID?



● Select Name, Term from Students, Enrolment where ID = SID
and Admission != Term;

Students:
Enrolment:
Name ID Age Admission CID SID Term
Ahmed 1177 23 042 CS290 1177 042
Bob 1131 20 051 CS260 1177 052
Sara 1197 22 042 ME222 1131 051
AMCS220 1197 051



● Select y from R, S, T where R.x = S.x and T.a = S.a;

R: S:
x y z a b x

T:
m n a


MapReduce in Academic Papers
● NIPS '07: Map-Reduce for Machine Learning on Multicore.
● Escience '08: CloudBLAST: Combining MapReduce and Virtualization on
Distributed Resources for Bioinformatics Applications.
● KDD '09: Large-scale behavioral targeting.
● GCC '09: Spatial Queries Evaluation with MapReduce.
● SIGIR '09: On single-pass indexing with MapReduce.
● MDAC '10: A novel approach to multiple sequence alignment using
hadoop data grids.
● VLDB Endowment '11: Social Content Matching in MapReduce.
● VLDB '12: Building Wavelet Histograms on Large Data in MapReduce.

Links
● http://code.google.com/edu/parallel/mapreduce-tutorial.html
● http://hadoop.apache.org/mapreduce/
● http://www.cse.ust.hk/gpuqp/Mars.html
● http://skynet.rubyforge.org/
● http://mapreduce.stanford.edu/
● http://wiki.apache.org/hadoop/PoweredBy
● http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-
academic-papers-4th-update-may-2011/


MapReduce

Recommandé

Recommandé

Contenu connexe

Similaire à MapReduce

Similaire à MapReduce (20)

Plus de Zuhair khayyat

Plus de Zuhair khayyat (11)

Dernier

Dernier (20)

MapReduce