2. What is MapReduce
● A programming model introduced by Google in OSDI '04 for
processing large datasets efficiently.
● Features:
– Automatic parallelization, no parallel experience required.
– Data and process redundancy for failure recovery.
– Auto scheduling and Load balancing.
– Easy to program, based on two simple functions:
● Map
● Reduce.
CS245 - 2012 Introduction to MapReduce 2
3. Why MapReduce?
● For a cluster of:
– 2000 machines.
– Total 16 TB Ram (≈ 8 GB each).
– Total 2 PB Disk space (≈ 1 TB each).
● Use the maximum capacity of the cluster to:
– Implement a parallel word count for input size 100 TB.
CS245 - 2012 Introduction to MapReduce 3
4. Why MapReduce?
● For a cluster of:
– 2000 machines.
– Total 16 TB Ram (≈ 8 GB each).
– Total 2 PB Disk space (≈ 1 TB each).
● Use the maximum capacity of the cluster to:
– Implement a parallel word count for input size 100 TB.
– Implement a parallel sort for the same input file.
● Can you use the same code for both applications?
CS245 - 2012 Introduction to MapReduce 4
5. How Fast is MapReduce (Hadoop)
● Sort Benchmark competition (http://sortbenchmark.org/):
– 2009: 100 TB in 173 minutes using 3452 nodes:
● 2 x Quad core Xeons @ 2.5 GHz.
● 8 GB RAM.
– 2008: 1TB in 3.48 minutes using 910 nodes:
● 4 x Dual core Xeons @ 2.0 GHz.
● 8 GB RAM.
CS245 - 2012 Introduction to MapReduce 5
7. Map & Reduce functions
● The Mapper (Pick a key):
– Input: Read input from disk.
– Output: Create pairs of <key, value>, known as
intermediate pairs.
– More input partitions == More parallel Mappers.
● The Reducer (Process values):
– Input: a list of <key,value> pairs with a unique key.
– Output: Single or multiple of <key, values>
– More unique keys == More Parallel Reducers.
CS245 - 2012 Introduction to MapReduce 7
8. How MapReduce Work
1) Partition input file into M partitions.
2) Create M Map tasks, read M partitions in parallel and emits
intermediate <key, value> pairs. Store them into local storage.
3) Wait for all Map workers to finish, sort and partition
intermediate <key, value> pairs into R regions.
4) Start R reduce workers, each reads a list of intermediate with
a unique key from remote disks.
5) Write the output of reduce workers to file(s).
CS245 - 2012 Introduction to MapReduce 8
9. Example – Word count
● Assume an input as following:
cat flower picture
snow cat cat
prince flower sun
king queen AC
CS245 - 2012 Introduction to MapReduce 9
10. Example – Word count
● Step1: Partition input file into M partitions.
cat flower picture cat flower picture
snow cat cat
prince flower sun snow cat cat
king queen AC
prince flower sun
king queen AC
CS245 - 2012 Introduction to MapReduce 10
11. Example – Word count
● Step2: Create M Map tasks, read M partitions in parallel and
emits intermediate <key, value> pairs. Store them into local
storage.
cat flower picture Mapper 1 <cat,1> <flower,1> <picture,1>
snow cat cat
Mapper 2 <snow,1> <cat,1> <cat,1>
prince flower sun
Mapper 3 <prince,1> <flower,1> <sun,1>
king queen AC
CS245 - 2012 Introduction to 4
Mapper MapReduce <king,1> <queen,1> <AC,1>
11
12. Example – Word count
● Step3: Wait for all Map workers to finish, sort and partition
intermediate <key, value> pairs into R regions.
<cat,1> <AC,1>
<cat,1> <flower,1> <picture,1> <flower,1> <cat,1>
<picture,1> <cat,1>
<cat,1> <cat,1>
<snow,1> <cat,1> <cat,1> <cat,1> <flower,1>
<snow,1> <flower,1>
<flower,1> <king,1>
<prince,1> <flower,1> <sun,1> <prince,1> <picture,1>
<sun,1> <prince,1>
<queen,1>
<AC,1> <snow,1>
CS245 - 2012
<king,1>
<king,1> <queen,1> <AC,1> Introduction to MapReduce <sun,1> 12
<queen,1>
13. Example – Word count
● Step4: Start R reduce workers, each reads a list of intermediate
with a unique key from remote disks.
<AC,1> Reducer 1 <AC,1>
<cat,1>
<cat,1>
<cat,1> Reducer 2 <cat,3>
<flower,1>
<flower,1> Reducer 3 <flower,2>
<king,1>
<picture,1>
<prince,1>
<queen,1>
<snow,1>
CS245 - 2012
<sun,1> Reducer 9
Introduction to MapReduce <sun,1> 13
14. Example – Word count
● Step5: Write the output of reduce workers to file(s).
<AC,1>
<AC,1>
<cat,3> <cat,3>
<flower,2>
<flower,2>
<king,1>
<king,1> <picture,1>
<prince,1>
<picture,1> <queen,1>
<snow,1>
<sun,1>
<sun,1>
CS245 - 2012 Introduction to MapReduce 14
16. MapReduce Failure Recovery
● The framework works as master worker paradigm.
● The master keeps records of the work done on each worker.
● If a worker fails, the master assigns the same work to another
worker.
● If a worker is late, another copy of the same work is assigned
to another worker.
● If the master fails, another backup copy of the master can pick
up and continue execution from the last check points.
CS245 - 2012 Introduction to MapReduce 16
17. Advantages of MapReduce
● Parallel IO: hides disk latency.
● Parallel Processing:
– Map functions works independently in parallel, each
process one unique partition.
– Reduce functions work independently in parallel, each
on a unique intermediate key.
● Using large clusters of commodity machines gives better
results than small expensive clusters.
CS245 - 2012 Introduction to MapReduce 17
18. Advantages of MapReduce
● Parallel IO: hides disk latency.
● Parallel Processing:
– Map functions works independently in parallel, each
process one unique partition.
– Reduce functions work independently in parallel, each
on a unique intermediate key.
● Using large clusters of commodity machines gives
comparable results than small expensive clusters.
CS245 - 2012 Introduction to MapReduce 18
20. MapReduce weak points
● Overhead of MapReduce is huge.
● Data dependent applications may need multiple iterations of
MapReduce, for example:
– K-means.
– PageRank.
● Complex algorithms can be very hard to implement.
– Range Queries.
● Sensitive to <key,value> pairs' skewed distribution
CS245 - 2012 Introduction to MapReduce 20
21. Implementations of MapReduce
● Hadoop in Java.
● Mars in C++ & CUDA.
● Skynet in Ruby.
● Phoenix in C++
● Microsoft Dryad:
– Schedule multiple levels of “MapReduce” like
operations..
CS245 - 2012 Introduction to MapReduce 21
23. MapReduce in Database - Ex1
● Select Name from Students where age = 23;
Students:
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22
CS245 - 2012 Introduction to MapReduce 23
24. MapReduce in Database - Ex2
● Select COUNT(Name) from Students where age > 20 group
by Name;
Students:
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22
CS245 - 2012 Introduction to MapReduce 24
25. MapReduce in Database - Ex3
● Select Name, Term from Students, Enrolment where ID = SID
and age != 20;
Students: Enrolment:
Name ID Age CID SID Term
Ahmed 1177 23 CS290 1177 042
Bob 1131 20 CS260 1177 052
Sara 1197 22 ME222 1131 051
AMCS220 1197 051
CS245 - 2012 Introduction to MapReduce 25
26. MapReduce in Database - Ex4
● Select Name, Term from Students, Enrolment where ID !=
SID;
Students: Enrolment:
Name ID Age CID SID Term
Ahmed 1177 23 CS290 1177 042
Bob 1131 20 CS260 1177 052
Sara 1197 22 ME222 1131 051
AMCS220 1197 051
● What if the condition ID > SID?
CS245 - 2012 Introduction to MapReduce 26
27. MapReduce in Database - Ex5
● Select Name, Term from Students, Enrolment where ID = SID
and Admission != Term;
Students:
Students: Enrolment:
Enrolment:
Name ID Age Admission CID SID Term
Ahmed 1177 23 042 CS290 1177 042
Bob 1131 20 051 CS260 1177 052
Sara 1197 22 042 ME222 1131 051
AMCS220 1197 051
CS245 - 2012 Introduction to MapReduce 27
28. MapReduce in Database - Ex6
● Select y from R, S, T where R.x = S.x and T.a = S.a;
R: S:
x y z a b x
T:
m n a
CS245 - 2012 Introduction to MapReduce 28
29. MapReduce in Academic Papers
● NIPS '07: Map-Reduce for Machine Learning on Multicore.
● Escience '08: CloudBLAST: Combining MapReduce and Virtualization on
Distributed Resources for Bioinformatics Applications.
● KDD '09: Large-scale behavioral targeting.
● GCC '09: Spatial Queries Evaluation with MapReduce.
● SIGIR '09: On single-pass indexing with MapReduce.
● MDAC '10: A novel approach to multiple sequence alignment using
hadoop data grids.
● VLDB Endowment '11: Social Content Matching in MapReduce.
● VLDB '12: Building Wavelet Histograms on Large Data in MapReduce.
CS245 - 2012 Introduction to MapReduce 29