Genislab builds better products and faster go-to-market with Lean project man...
Topic 6: MapReduce Applications
1. 6: MapReduce Applications
Zubair Nabi
zubair.nabi@itu.edu.pk
April 18, 2013
Zubair Nabi 6: MapReduce Applications April 18, 2013 1 / 27
2. Outline
1 The Anatomy of a MapReduce Application
2 MapReduce Design Patterns
3 Common MapReduce Application Types
Zubair Nabi 6: MapReduce Applications April 18, 2013 2 / 27
3. Outline
1 The Anatomy of a MapReduce Application
2 MapReduce Design Patterns
3 Common MapReduce Application Types
Zubair Nabi 6: MapReduce Applications April 18, 2013 3 / 27
4. MapReduce job phases
A MapReduce job can be divided into 4 phases:
1 Input split: The input dataset is sliced into M splits, one per map task
Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
5. MapReduce job phases
A MapReduce job can be divided into 4 phases:
1 Input split: The input dataset is sliced into M splits, one per map task
2 Map logic: The user-supplied map function is invoked
Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
6. MapReduce job phases
A MapReduce job can be divided into 4 phases:
1 Input split: The input dataset is sliced into M splits, one per map task
2 Map logic: The user-supplied map function is invoked
In tandem a sort phase is also applied that ensures that map output is
locally sorted by key
Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
7. MapReduce job phases
A MapReduce job can be divided into 4 phases:
1 Input split: The input dataset is sliced into M splits, one per map task
2 Map logic: The user-supplied map function is invoked
In tandem a sort phase is also applied that ensures that map output is
locally sorted by key
In addition, the key space is also partitioned amongst the reducers
Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
8. MapReduce job phases
A MapReduce job can be divided into 4 phases:
1 Input split: The input dataset is sliced into M splits, one per map task
2 Map logic: The user-supplied map function is invoked
In tandem a sort phase is also applied that ensures that map output is
locally sorted by key
In addition, the key space is also partitioned amongst the reducers
3 Shuffle: Map output is relayed to all reduce tasks
Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
9. MapReduce job phases
A MapReduce job can be divided into 4 phases:
1 Input split: The input dataset is sliced into M splits, one per map task
2 Map logic: The user-supplied map function is invoked
In tandem a sort phase is also applied that ensures that map output is
locally sorted by key
In addition, the key space is also partitioned amongst the reducers
3 Shuffle: Map output is relayed to all reduce tasks
4 Reduce logic: The user-provided reduce function is invoked
Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
10. MapReduce job phases
A MapReduce job can be divided into 4 phases:
1 Input split: The input dataset is sliced into M splits, one per map task
2 Map logic: The user-supplied map function is invoked
In tandem a sort phase is also applied that ensures that map output is
locally sorted by key
In addition, the key space is also partitioned amongst the reducers
3 Shuffle: Map output is relayed to all reduce tasks
4 Reduce logic: The user-provided reduce function is invoked
Before the application of the reduce function, the input keys are merged
to get globally sorted key/value pairs
Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
11. Of mappers and reducers
In the common case, programmers only need to write a map and a
reduce function
Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
12. Of mappers and reducers
In the common case, programmers only need to write a map and a
reduce function
The user-provided map function is invoked for every line (can be
modified) in the input file and is passed the line number as key and line
contents as value
Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
13. Of mappers and reducers
In the common case, programmers only need to write a map and a
reduce function
The user-provided map function is invoked for every line (can be
modified) in the input file and is passed the line number as key and line
contents as value
The user-provided reduce function is invoked for each key output by
the map phase and is passed the set of associated values as iterable
values
Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
14. Wordcount: High-level view
Input: A text corpus such as Wikipedia dump, books from Gutenberg,
etc.
Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
15. Wordcount: High-level view
Input: A text corpus such as Wikipedia dump, books from Gutenberg,
etc.
The map function is invoked once for each text line
Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
16. Wordcount: High-level view
Input: A text corpus such as Wikipedia dump, books from Gutenberg,
etc.
The map function is invoked once for each text line
Map output: Words as keys and 1 as values
Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
17. Wordcount: High-level view
Input: A text corpus such as Wikipedia dump, books from Gutenberg,
etc.
The map function is invoked once for each text line
Map output: Words as keys and 1 as values
Reduce input: Key/value pairs of words and values (1)
Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
18. Wordcount: High-level view
Input: A text corpus such as Wikipedia dump, books from Gutenberg,
etc.
The map function is invoked once for each text line
Map output: Words as keys and 1 as values
Reduce input: Key/value pairs of words and values (1)
The reduce function is invoked once for each word with a list of 1s
Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
19. Wordcount: High-level view
Input: A text corpus such as Wikipedia dump, books from Gutenberg,
etc.
The map function is invoked once for each text line
Map output: Words as keys and 1 as values
Reduce input: Key/value pairs of words and values (1)
The reduce function is invoked once for each word with a list of 1s
Reduce output: Words and their final counts
Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
20. Wordcount: Low-level view
A new process is created for each map, called MapRunner
Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
21. Wordcount: Low-level view
A new process is created for each map, called MapRunner
MapRunner has a RecordReader instance that is used to read the
input file
Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
22. Wordcount: Low-level view
A new process is created for each map, called MapRunner
MapRunner has a RecordReader instance that is used to read the
input file
RecordReader reads the input file in chunks and parses the chunks
into lines
Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
23. Wordcount: Low-level view
A new process is created for each map, called MapRunner
MapRunner has a RecordReader instance that is used to read the
input file
RecordReader reads the input file in chunks and parses the chunks
into lines
MapRunner also has a Mapper instance with a map function,
WordCountMapper in this case
Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
24. Wordcount: Low-level view
A new process is created for each map, called MapRunner
MapRunner has a RecordReader instance that is used to read the
input file
RecordReader reads the input file in chunks and parses the chunks
into lines
MapRunner also has a Mapper instance with a map function,
WordCountMapper in this case
For each line parse by RecordReader, MapRunner calls
WordCountMapper.map() and passes it the line
Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
25. Wordcount: Low-level view (2)
WordCountMapper has an OutputCollector instance which maintains
an in-memory buffer for each output partition (one partition per reduce)
Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
26. Wordcount: Low-level view (2)
WordCountMapper has an OutputCollector instance which maintains
an in-memory buffer for each output partition (one partition per reduce)
Each time WordCountMapper.map() is invoked it, it tokenizes the line
into words
Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
27. Wordcount: Low-level view (2)
WordCountMapper has an OutputCollector instance which maintains
an in-memory buffer for each output partition (one partition per reduce)
Each time WordCountMapper.map() is invoked it, it tokenizes the line
into words
For each word, it writes the word as key and 1 as value to
OutputCollector
Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
28. Wordcount: Low-level view (2)
WordCountMapper has an OutputCollector instance which maintains
an in-memory buffer for each output partition (one partition per reduce)
Each time WordCountMapper.map() is invoked it, it tokenizes the line
into words
For each word, it writes the word as key and 1 as value to
OutputCollector
OutputCollector uses the Partitioner instance to select a partition
buffer for each key
Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
29. Wordcount: Low-level view (2)
WordCountMapper has an OutputCollector instance which maintains
an in-memory buffer for each output partition (one partition per reduce)
Each time WordCountMapper.map() is invoked it, it tokenizes the line
into words
For each word, it writes the word as key and 1 as value to
OutputCollector
OutputCollector uses the Partitioner instance to select a partition
buffer for each key
Whenever the size of a partition buffer exceeds a configurable
threshold, its contents are first sorted by key and then flushed to disk
Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
30. Wordcount: Low-level view (2)
WordCountMapper has an OutputCollector instance which maintains
an in-memory buffer for each output partition (one partition per reduce)
Each time WordCountMapper.map() is invoked it, it tokenizes the line
into words
For each word, it writes the word as key and 1 as value to
OutputCollector
OutputCollector uses the Partitioner instance to select a partition
buffer for each key
Whenever the size of a partition buffer exceeds a configurable
threshold, its contents are first sorted by key and then flushed to disk
This process is repeated till the map logic has been applied to all lines
within the input file
Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
31. Wordcount: Low-level view (3)
Once all maps have completed their execution, the reduce phase is
started
Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
32. Wordcount: Low-level view (3)
Once all maps have completed their execution, the reduce phase is
started
For each reduce task, a ReduceRunner process is created
Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
33. Wordcount: Low-level view (3)
Once all maps have completed their execution, the reduce phase is
started
For each reduce task, a ReduceRunner process is created
Each reduce task fetches its input partitions from machines on which
map tasks were run
Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
34. Wordcount: Low-level view (3)
Once all maps have completed their execution, the reduce phase is
started
For each reduce task, a ReduceRunner process is created
Each reduce task fetches its input partitions from machines on which
map tasks were run
All input partitions are then merged to get a globally sorted partition of
key/value pairs
Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
35. Wordcount: Low-level view (4)
ReduceRunner contains a Reducer instance with a reduce function,
WordCountReducer in this case
Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
36. Wordcount: Low-level view (4)
ReduceRunner contains a Reducer instance with a reduce function,
WordCountReducer in this case
For each word, ReduceRunner invokes WordCountReducer.reduce()
and passes it the word and a list of its values (1s)
Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
37. Wordcount: Low-level view (4)
ReduceRunner contains a Reducer instance with a reduce function,
WordCountReducer in this case
For each word, ReduceRunner invokes WordCountReducer.reduce()
and passes it the word and a list of its values (1s)
WordCountReducer also has an OutputCollector instance with an
in-memory buffer
Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
38. Wordcount: Low-level view (4)
ReduceRunner contains a Reducer instance with a reduce function,
WordCountReducer in this case
For each word, ReduceRunner invokes WordCountReducer.reduce()
and passes it the word and a list of its values (1s)
WordCountReducer also has an OutputCollector instance with an
in-memory buffer
WordCountReducer.reduce() sums the list of values it is passed and
writes the word and its final count to the OutputCollector
Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
39. Wordcount: Low-level view (4)
ReduceRunner contains a Reducer instance with a reduce function,
WordCountReducer in this case
For each word, ReduceRunner invokes WordCountReducer.reduce()
and passes it the word and a list of its values (1s)
WordCountReducer also has an OutputCollector instance with an
in-memory buffer
WordCountReducer.reduce() sums the list of values it is passed and
writes the word and its final count to the OutputCollector
This process is repeated till the reduce logic has been applied
key/value pairs
Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
40. Wordcount: Low-level view (4)
ReduceRunner contains a Reducer instance with a reduce function,
WordCountReducer in this case
For each word, ReduceRunner invokes WordCountReducer.reduce()
and passes it the word and a list of its values (1s)
WordCountReducer also has an OutputCollector instance with an
in-memory buffer
WordCountReducer.reduce() sums the list of values it is passed and
writes the word and its final count to the OutputCollector
This process is repeated till the reduce logic has been applied
key/value pairs
At the end of the entire job, each reduce produces an output file with
words and their number of occurrences
Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
41. Wordcount map in Java
1 public void map( Object key , Text value , Context context ) {
2 StringTokenizer itr = new StringTokenizer ( value . toString ());
3 while (itr. hasMoreTokens ()) {
4 word.set(itr. nextToken ());
5 context .write (word , one );
6 }
7 }
Zubair Nabi 6: MapReduce Applications April 18, 2013 11 / 27
42. Wordcount reduce in Java
1 public void reduce (Text key , Iterable < IntWritable > values ,
2 Context context ) {
3 int sum = 0;
4 for ( IntWritable val : values ) {
5 sum += val.get ();
6 }
7 result .set(sum );
8 context .write(key , result );
9 }
Zubair Nabi 6: MapReduce Applications April 18, 2013 12 / 27
43. Wordcount map in Python
1 def map(self , key , value ):
2 [self. _output_collector . collect (word , 1) for word in value . split (’ ’)]
Zubair Nabi 6: MapReduce Applications April 18, 2013 13 / 27
44. Wordcount reduce in Python
1 def reduce (self , key , values ):
2 sum__ = 0
3 for value in values :
4 sum__ += value
5 self. _output_collector . collect (key , sum__ )
Zubair Nabi 6: MapReduce Applications April 18, 2013 14 / 27
45. Outline
1 The Anatomy of a MapReduce Application
2 MapReduce Design Patterns
3 Common MapReduce Application Types
Zubair Nabi 6: MapReduce Applications April 18, 2013 15 / 27
46. Bird’s-eye view
The MapReduce paradigm is amenable to divide-and-conquer
algorithms
Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
47. Bird’s-eye view
The MapReduce paradigm is amenable to divide-and-conquer
algorithms
One way to look at MapReduce is that it is just a large-scale sorting
platform
Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
48. Bird’s-eye view
The MapReduce paradigm is amenable to divide-and-conquer
algorithms
One way to look at MapReduce is that it is just a large-scale sorting
platform
User-logic is only involved at specific hook points
Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
49. Bird’s-eye view
The MapReduce paradigm is amenable to divide-and-conquer
algorithms
One way to look at MapReduce is that it is just a large-scale sorting
platform
User-logic is only involved at specific hook points
Algorithms must be expressed in terms of a small number of specific
components that fit together in preset ways
Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
50. Bird’s-eye view
The MapReduce paradigm is amenable to divide-and-conquer
algorithms
One way to look at MapReduce is that it is just a large-scale sorting
platform
User-logic is only involved at specific hook points
Algorithms must be expressed in terms of a small number of specific
components that fit together in preset ways
Like putting together a jigsaw puzzle in which all the other pieces have
already been assembled and you only need to add two pieces: The map
and the reduce pieces
Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
51. Bird’s-eye view
The MapReduce paradigm is amenable to divide-and-conquer
algorithms
One way to look at MapReduce is that it is just a large-scale sorting
platform
User-logic is only involved at specific hook points
Algorithms must be expressed in terms of a small number of specific
components that fit together in preset ways
Like putting together a jigsaw puzzle in which all the other pieces have
already been assembled and you only need to add two pieces: The map
and the reduce pieces
Fortunately a large number of algorithms easily fit this rigid pattern
Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
52. Programmer control
The programmer has no control over
1 The location of a map or reduce task in terms of nodes in the cluster
Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
53. Programmer control
The programmer has no control over
1 The location of a map or reduce task in terms of nodes in the cluster
2 The start and end time of a map or a reduce task
Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
54. Programmer control
The programmer has no control over
1 The location of a map or reduce task in terms of nodes in the cluster
2 The start and end time of a map or a reduce task
3 The input key/value pairs processed by a specific map task
Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
55. Programmer control
The programmer has no control over
1 The location of a map or reduce task in terms of nodes in the cluster
2 The start and end time of a map or a reduce task
3 The input key/value pairs processed by a specific map task
4 The intermediate key/value pairs processed by a specific reduce task
Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
56. Programmer control (2)
The programmer does have control over
1 The data structures to be used as keys and values
Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
57. Programmer control (2)
The programmer does have control over
1 The data structures to be used as keys and values
2 Initialization code at the beginning of map/reduce tasks and
termination code at the end
Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
58. Programmer control (2)
The programmer does have control over
1 The data structures to be used as keys and values
2 Initialization code at the beginning of map/reduce tasks and
termination code at the end
3 Preservation of state across multiple invocations of map/reduce tasks
Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
59. Programmer control (2)
The programmer does have control over
1 The data structures to be used as keys and values
2 Initialization code at the beginning of map/reduce tasks and
termination code at the end
3 Preservation of state across multiple invocations of map/reduce tasks
4 The sort order of intermediate keys and in turn, the order in which a
reducer encounters keys
Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
60. Programmer control (2)
The programmer does have control over
1 The data structures to be used as keys and values
2 Initialization code at the beginning of map/reduce tasks and
termination code at the end
3 Preservation of state across multiple invocations of map/reduce tasks
4 The sort order of intermediate keys and in turn, the order in which a
reducer encounters keys
5 Partitioning of key space and in turn, the set of keys that a particular
reducer encounters
Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
61. Multi-job algorithms
Many algorithms cannot be easily expressed as a single MapReduce
job
Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
62. Multi-job algorithms
Many algorithms cannot be easily expressed as a single MapReduce
job
Complex algorithms need to be decomposed into a sequence of jobs
Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
63. Multi-job algorithms
Many algorithms cannot be easily expressed as a single MapReduce
job
Complex algorithms need to be decomposed into a sequence of jobs
The output of one job becomes the input to the next
Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
64. Multi-job algorithms
Many algorithms cannot be easily expressed as a single MapReduce
job
Complex algorithms need to be decomposed into a sequence of jobs
The output of one job becomes the input to the next
Most interactive algorithms need to be run by an external driver
program that performs the convergence check
Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
65. Local aggregation
Network and disk latencies are expensive compared to other
operations
Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
66. Local aggregation
Network and disk latencies are expensive compared to other
operations
Decreasing the amount of data transferred over the network during the
shuffle phase results in efficiency
Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
67. Local aggregation
Network and disk latencies are expensive compared to other
operations
Decreasing the amount of data transferred over the network during the
shuffle phase results in efficiency
Aggressive user of combiners for commutative and associative
algorithms can greatly reduce intermediate data
Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
68. Local aggregation
Network and disk latencies are expensive compared to other
operations
Decreasing the amount of data transferred over the network during the
shuffle phase results in efficiency
Aggressive user of combiners for commutative and associative
algorithms can greatly reduce intermediate data
Another strategy, dubbed “in-mapper combining” can not only decrease
the amount of intermediate data but also the number of key/valur pairs
emitted by the map tasks
Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
69. Outline
1 The Anatomy of a MapReduce Application
2 MapReduce Design Patterns
3 Common MapReduce Application Types
Zubair Nabi 6: MapReduce Applications April 18, 2013 21 / 27
70. Counting and Summing
1 Problem
A number of documents with a set of terms
Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
71. Counting and Summing
1 Problem
A number of documents with a set of terms
Need to calculate the number of occurrences of each term (word count)
or some arbitrary function over the terms (average response time in log
files)
Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
72. Counting and Summing
1 Problem
A number of documents with a set of terms
Need to calculate the number of occurrences of each term (word count)
or some arbitrary function over the terms (average response time in log
files)
2 Solution
Map: For each term, emit the term and “1”
Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
73. Counting and Summing
1 Problem
A number of documents with a set of terms
Need to calculate the number of occurrences of each term (word count)
or some arbitrary function over the terms (average response time in log
files)
2 Solution
Map: For each term, emit the term and “1”
Reduce: Take the sum (or any other operation) of each term values
Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
74. Collating
1 Problem
A number of documents with a set of terms and some function of one
item
Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
75. Collating
1 Problem
A number of documents with a set of terms and some function of one
item
Need to group all items that have the same value of function to either
store items together or perform some computation over them
Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
76. Collating
1 Problem
A number of documents with a set of terms and some function of one
item
Need to group all items that have the same value of function to either
store items together or perform some computation over them
2 Solution
Map: For each item, compute given function and emit function value as
key and item as value
Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
77. Collating
1 Problem
A number of documents with a set of terms and some function of one
item
Need to group all items that have the same value of function to either
store items together or perform some computation over them
2 Solution
Map: For each item, compute given function and emit function value as
key and item as value
Reduce: Either save all grouped items or perform further computation
Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
78. Collating
1 Problem
A number of documents with a set of terms and some function of one
item
Need to group all items that have the same value of function to either
store items together or perform some computation over them
2 Solution
Map: For each item, compute given function and emit function value as
key and item as value
Reduce: Either save all grouped items or perform further computation
Example: Inverted Index: Items are words and function is document ID
Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
79. Filtering, Parsing, and Validation
1 Problem
A set of records
Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
80. Filtering, Parsing, and Validation
1 Problem
A set of records
Need to collect all records that meet some condition or transform each
record into another representation
Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
81. Filtering, Parsing, and Validation
1 Problem
A set of records
Need to collect all records that meet some condition or transform each
record into another representation
2 Solution
Map: For each record, emit it if passes the condition or emit its
transformed version
Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
82. Filtering, Parsing, and Validation
1 Problem
A set of records
Need to collect all records that meet some condition or transform each
record into another representation
2 Solution
Map: For each record, emit it if passes the condition or emit its
transformed version
Reduce: Identity
Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
83. Filtering, Parsing, and Validation
1 Problem
A set of records
Need to collect all records that meet some condition or transform each
record into another representation
2 Solution
Map: For each record, emit it if passes the condition or emit its
transformed version
Reduce: Identity
Example: Text parsing or transformation such as word capitalization
Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
84. Distributed Task Execution
1 Problem
Large computational problem
Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
85. Distributed Task Execution
1 Problem
Large computational problem
Need to divide it into multiple parts and combine results from all parts to
obtain a final result
Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
86. Distributed Task Execution
1 Problem
Large computational problem
Need to divide it into multiple parts and combine results from all parts to
obtain a final result
2 Solution
Map: Perform corresponding computation
Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
87. Distributed Task Execution
1 Problem
Large computational problem
Need to divide it into multiple parts and combine results from all parts to
obtain a final result
2 Solution
Map: Perform corresponding computation
Reduce: Combine all emitted results into a final one
Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
88. Distributed Task Execution
1 Problem
Large computational problem
Need to divide it into multiple parts and combine results from all parts to
obtain a final result
2 Solution
Map: Perform corresponding computation
Reduce: Combine all emitted results into a final one
Example: RGB histogram calculation of bitmap images
Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
89. Sorting
1 Problem
A set of records
Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
90. Sorting
1 Problem
A set of records
Need to sort records in some order
Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
91. Sorting
1 Problem
A set of records
Need to sort records in some order
2 Solution
Map: Identity
Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
92. Sorting
1 Problem
A set of records
Need to sort records in some order
2 Solution
Map: Identity
Reduce: Identity
Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
93. Sorting
1 Problem
A set of records
Need to sort records in some order
2 Solution
Map: Identity
Reduce: Identity
Also possible to sort by value, either perform a secondary sort or
perform a key-to-value conversion
Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
94. References
1 Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with
MapReduce. Morgan and Claypool Publishers.
2 MapReduce Patterns, Algorithms, and Use Cases:
http://highlyscalable.wordpress.com/2012/02/01/
mapreduce-patterns/
Zubair Nabi 6: MapReduce Applications April 18, 2013 27 / 27