SlideShare a Scribd company logo
1 of 94
Download to read offline
6: MapReduce Applications

                            Zubair Nabi

                  zubair.nabi@itu.edu.pk


                          April 18, 2013




Zubair Nabi      6: MapReduce Applications   April 18, 2013   1 / 27
Outline




  1    The Anatomy of a MapReduce Application



  2    MapReduce Design Patterns



  3    Common MapReduce Application Types




  Zubair Nabi             6: MapReduce Applications   April 18, 2013   2 / 27
Outline




  1    The Anatomy of a MapReduce Application



  2    MapReduce Design Patterns



  3    Common MapReduce Application Types




  Zubair Nabi             6: MapReduce Applications   April 18, 2013   3 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task




  Zubair Nabi                 6: MapReduce Applications               April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked




  Zubair Nabi                 6: MapReduce Applications               April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key
                In addition, the key space is also partitioned amongst the reducers




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key
                In addition, the key space is also partitioned amongst the reducers
     3    Shuffle: Map output is relayed to all reduce tasks




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key
                In addition, the key space is also partitioned amongst the reducers
     3    Shuffle: Map output is relayed to all reduce tasks
     4    Reduce logic: The user-provided reduce function is invoked




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key
                In addition, the key space is also partitioned amongst the reducers
     3    Shuffle: Map output is relayed to all reduce tasks
     4    Reduce logic: The user-provided reduce function is invoked
                Before the application of the reduce function, the input keys are merged
                to get globally sorted key/value pairs




  Zubair Nabi                   6: MapReduce Applications                  April 18, 2013   4 / 27
Of mappers and reducers




          In the common case, programmers only need to write a map and a
          reduce function




  Zubair Nabi               6: MapReduce Applications           April 18, 2013   5 / 27
Of mappers and reducers




          In the common case, programmers only need to write a map and a
          reduce function
          The user-provided map function is invoked for every line (can be
          modified) in the input file and is passed the line number as key and line
          contents as value




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   5 / 27
Of mappers and reducers




          In the common case, programmers only need to write a map and a
          reduce function
          The user-provided map function is invoked for every line (can be
          modified) in the input file and is passed the line number as key and line
          contents as value
          The user-provided reduce function is invoked for each key output by
          the map phase and is passed the set of associated values as iterable
          values




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   5 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line




  Zubair Nabi                 6: MapReduce Applications           April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line
          Map output: Words as keys and 1 as values




  Zubair Nabi                 6: MapReduce Applications           April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line
          Map output: Words as keys and 1 as values
          Reduce input: Key/value pairs of words and values (1)




  Zubair Nabi                 6: MapReduce Applications           April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line
          Map output: Words as keys and 1 as values
          Reduce input: Key/value pairs of words and values (1)
          The reduce function is invoked once for each word with a list of 1s




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line
          Map output: Words as keys and 1 as values
          Reduce input: Key/value pairs of words and values (1)
          The reduce function is invoked once for each word with a list of 1s
          Reduce output: Words and their final counts




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   6 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner




  Zubair Nabi               6: MapReduce Applications           April 18, 2013   7 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner
          MapRunner has a RecordReader instance that is used to read the
          input file




  Zubair Nabi               6: MapReduce Applications           April 18, 2013   7 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner
          MapRunner has a RecordReader instance that is used to read the
          input file
          RecordReader reads the input file in chunks and parses the chunks
          into lines




  Zubair Nabi               6: MapReduce Applications            April 18, 2013   7 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner
          MapRunner has a RecordReader instance that is used to read the
          input file
          RecordReader reads the input file in chunks and parses the chunks
          into lines
          MapRunner also has a Mapper instance with a map function,
          WordCountMapper in this case




  Zubair Nabi               6: MapReduce Applications            April 18, 2013   7 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner
          MapRunner has a RecordReader instance that is used to read the
          input file
          RecordReader reads the input file in chunks and parses the chunks
          into lines
          MapRunner also has a Mapper instance with a map function,
          WordCountMapper in this case
          For each line parse by RecordReader, MapRunner calls
          WordCountMapper.map() and passes it the line




  Zubair Nabi               6: MapReduce Applications            April 18, 2013   7 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words
          For each word, it writes the word as key and 1 as value to
          OutputCollector




  Zubair Nabi                 6: MapReduce Applications                April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words
          For each word, it writes the word as key and 1 as value to
          OutputCollector
          OutputCollector uses the Partitioner instance to select a partition
          buffer for each key




  Zubair Nabi                 6: MapReduce Applications                April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words
          For each word, it writes the word as key and 1 as value to
          OutputCollector
          OutputCollector uses the Partitioner instance to select a partition
          buffer for each key
          Whenever the size of a partition buffer exceeds a configurable
          threshold, its contents are first sorted by key and then flushed to disk




  Zubair Nabi                 6: MapReduce Applications                April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words
          For each word, it writes the word as key and 1 as value to
          OutputCollector
          OutputCollector uses the Partitioner instance to select a partition
          buffer for each key
          Whenever the size of a partition buffer exceeds a configurable
          threshold, its contents are first sorted by key and then flushed to disk
          This process is repeated till the map logic has been applied to all lines
          within the input file



  Zubair Nabi                 6: MapReduce Applications                April 18, 2013   8 / 27
Wordcount: Low-level view (3)




          Once all maps have completed their execution, the reduce phase is
          started




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   9 / 27
Wordcount: Low-level view (3)




          Once all maps have completed their execution, the reduce phase is
          started
          For each reduce task, a ReduceRunner process is created




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   9 / 27
Wordcount: Low-level view (3)




          Once all maps have completed their execution, the reduce phase is
          started
          For each reduce task, a ReduceRunner process is created
          Each reduce task fetches its input partitions from machines on which
          map tasks were run




  Zubair Nabi                6: MapReduce Applications              April 18, 2013   9 / 27
Wordcount: Low-level view (3)




          Once all maps have completed their execution, the reduce phase is
          started
          For each reduce task, a ReduceRunner process is created
          Each reduce task fetches its input partitions from machines on which
          map tasks were run
          All input partitions are then merged to get a globally sorted partition of
          key/value pairs




  Zubair Nabi                  6: MapReduce Applications                April 18, 2013   9 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)
          WordCountReducer also has an OutputCollector instance with an
          in-memory buffer




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)
          WordCountReducer also has an OutputCollector instance with an
          in-memory buffer
          WordCountReducer.reduce() sums the list of values it is passed and
          writes the word and its final count to the OutputCollector




  Zubair Nabi               6: MapReduce Applications           April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)
          WordCountReducer also has an OutputCollector instance with an
          in-memory buffer
          WordCountReducer.reduce() sums the list of values it is passed and
          writes the word and its final count to the OutputCollector
          This process is repeated till the reduce logic has been applied
          key/value pairs




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)
          WordCountReducer also has an OutputCollector instance with an
          in-memory buffer
          WordCountReducer.reduce() sums the list of values it is passed and
          writes the word and its final count to the OutputCollector
          This process is repeated till the reduce logic has been applied
          key/value pairs
          At the end of the entire job, each reduce produces an output file with
          words and their number of occurrences



  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   10 / 27
Wordcount map in Java




1   public void map( Object key , Text value , Context context ) {
2                 StringTokenizer itr = new StringTokenizer ( value . toString ());
3                 while (itr. hasMoreTokens ()) {
4                         word.set(itr. nextToken ());
5                         context .write (word , one );
6                 }
7   }




    Zubair Nabi                      6: MapReduce Applications                  April 18, 2013   11 / 27
Wordcount reduce in Java



1   public void reduce (Text key , Iterable < IntWritable > values ,
2                                                            Context context ) {
3                 int sum = 0;
4                 for ( IntWritable val : values ) {
5                          sum += val.get ();
6                 }
7                 result .set(sum );
8                 context .write(key , result );
9   }




    Zubair Nabi                        6: MapReduce Applications                   April 18, 2013   12 / 27
Wordcount map in Python




1   def map(self , key , value ):
2                 [self. _output_collector . collect (word , 1) for word in value . split (’ ’)]




    Zubair Nabi                      6: MapReduce Applications                   April 18, 2013   13 / 27
Wordcount reduce in Python




1   def reduce (self , key , values ):
2         sum__ = 0
3         for value in values :
4                 sum__ += value
5         self. _output_collector . collect (key , sum__ )




    Zubair Nabi                    6: MapReduce Applications   April 18, 2013   14 / 27
Outline




  1    The Anatomy of a MapReduce Application



  2    MapReduce Design Patterns



  3    Common MapReduce Application Types




  Zubair Nabi             6: MapReduce Applications   April 18, 2013   15 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms




  Zubair Nabi               6: MapReduce Applications         April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform
          User-logic is only involved at specific hook points




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform
          User-logic is only involved at specific hook points
          Algorithms must be expressed in terms of a small number of specific
          components that fit together in preset ways




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform
          User-logic is only involved at specific hook points
          Algorithms must be expressed in terms of a small number of specific
          components that fit together in preset ways
                Like putting together a jigsaw puzzle in which all the other pieces have
                already been assembled and you only need to add two pieces: The map
                and the reduce pieces




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform
          User-logic is only involved at specific hook points
          Algorithms must be expressed in terms of a small number of specific
          components that fit together in preset ways
                Like putting together a jigsaw puzzle in which all the other pieces have
                already been assembled and you only need to add two pieces: The map
                and the reduce pieces
          Fortunately a large number of algorithms easily fit this rigid pattern




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   16 / 27
Programmer control




  The programmer has no control over
     1    The location of a map or reduce task in terms of nodes in the cluster




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   17 / 27
Programmer control




  The programmer has no control over
     1    The location of a map or reduce task in terms of nodes in the cluster
     2    The start and end time of a map or a reduce task




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   17 / 27
Programmer control




  The programmer has no control over
     1    The location of a map or reduce task in terms of nodes in the cluster
     2    The start and end time of a map or a reduce task
     3    The input key/value pairs processed by a specific map task




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   17 / 27
Programmer control




  The programmer has no control over
     1    The location of a map or reduce task in terms of nodes in the cluster
     2    The start and end time of a map or a reduce task
     3    The input key/value pairs processed by a specific map task
     4    The intermediate key/value pairs processed by a specific reduce task




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   17 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values




  Zubair Nabi                6: MapReduce Applications        April 18, 2013   18 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values
     2    Initialization code at the beginning of map/reduce tasks and
          termination code at the end




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   18 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values
     2    Initialization code at the beginning of map/reduce tasks and
          termination code at the end
     3    Preservation of state across multiple invocations of map/reduce tasks




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   18 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values
     2    Initialization code at the beginning of map/reduce tasks and
          termination code at the end
     3    Preservation of state across multiple invocations of map/reduce tasks
     4    The sort order of intermediate keys and in turn, the order in which a
          reducer encounters keys




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   18 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values
     2    Initialization code at the beginning of map/reduce tasks and
          termination code at the end
     3    Preservation of state across multiple invocations of map/reduce tasks
     4    The sort order of intermediate keys and in turn, the order in which a
          reducer encounters keys
     5    Partitioning of key space and in turn, the set of keys that a particular
          reducer encounters




  Zubair Nabi                  6: MapReduce Applications               April 18, 2013   18 / 27
Multi-job algorithms




          Many algorithms cannot be easily expressed as a single MapReduce
          job




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   19 / 27
Multi-job algorithms




          Many algorithms cannot be easily expressed as a single MapReduce
          job
          Complex algorithms need to be decomposed into a sequence of jobs




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   19 / 27
Multi-job algorithms




          Many algorithms cannot be easily expressed as a single MapReduce
          job
          Complex algorithms need to be decomposed into a sequence of jobs
                The output of one job becomes the input to the next




  Zubair Nabi                   6: MapReduce Applications             April 18, 2013   19 / 27
Multi-job algorithms




          Many algorithms cannot be easily expressed as a single MapReduce
          job
          Complex algorithms need to be decomposed into a sequence of jobs
                The output of one job becomes the input to the next
          Most interactive algorithms need to be run by an external driver
          program that performs the convergence check




  Zubair Nabi                   6: MapReduce Applications             April 18, 2013   19 / 27
Local aggregation



          Network and disk latencies are expensive compared to other
          operations




  Zubair Nabi                6: MapReduce Applications           April 18, 2013   20 / 27
Local aggregation



          Network and disk latencies are expensive compared to other
          operations
          Decreasing the amount of data transferred over the network during the
          shuffle phase results in efficiency




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   20 / 27
Local aggregation



          Network and disk latencies are expensive compared to other
          operations
          Decreasing the amount of data transferred over the network during the
          shuffle phase results in efficiency
          Aggressive user of combiners for commutative and associative
          algorithms can greatly reduce intermediate data




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   20 / 27
Local aggregation



          Network and disk latencies are expensive compared to other
          operations
          Decreasing the amount of data transferred over the network during the
          shuffle phase results in efficiency
          Aggressive user of combiners for commutative and associative
          algorithms can greatly reduce intermediate data
          Another strategy, dubbed “in-mapper combining” can not only decrease
          the amount of intermediate data but also the number of key/valur pairs
          emitted by the map tasks




  Zubair Nabi                6: MapReduce Applications             April 18, 2013   20 / 27
Outline




  1    The Anatomy of a MapReduce Application



  2    MapReduce Design Patterns



  3    Common MapReduce Application Types




  Zubair Nabi             6: MapReduce Applications   April 18, 2013   21 / 27
Counting and Summing




     1    Problem
                A number of documents with a set of terms




  Zubair Nabi                  6: MapReduce Applications    April 18, 2013   22 / 27
Counting and Summing




     1    Problem
                A number of documents with a set of terms
                Need to calculate the number of occurrences of each term (word count)
                or some arbitrary function over the terms (average response time in log
                files)




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   22 / 27
Counting and Summing




     1    Problem
                A number of documents with a set of terms
                Need to calculate the number of occurrences of each term (word count)
                or some arbitrary function over the terms (average response time in log
                files)
     2    Solution
                Map: For each term, emit the term and “1”




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   22 / 27
Counting and Summing




     1    Problem
                A number of documents with a set of terms
                Need to calculate the number of occurrences of each term (word count)
                or some arbitrary function over the terms (average response time in log
                files)
     2    Solution
                Map: For each term, emit the term and “1”
                Reduce: Take the sum (or any other operation) of each term values




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   22 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item




  Zubair Nabi                  6: MapReduce Applications               April 18, 2013   23 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item
                Need to group all items that have the same value of function to either
                store items together or perform some computation over them




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   23 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item
                Need to group all items that have the same value of function to either
                store items together or perform some computation over them
     2    Solution
                Map: For each item, compute given function and emit function value as
                key and item as value




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   23 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item
                Need to group all items that have the same value of function to either
                store items together or perform some computation over them
     2    Solution
                Map: For each item, compute given function and emit function value as
                key and item as value
                Reduce: Either save all grouped items or perform further computation




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   23 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item
                Need to group all items that have the same value of function to either
                store items together or perform some computation over them
     2    Solution
                Map: For each item, compute given function and emit function value as
                key and item as value
                Reduce: Either save all grouped items or perform further computation
                Example: Inverted Index: Items are words and function is document ID




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   23 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records




  Zubair Nabi                      6: MapReduce Applications   April 18, 2013   24 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records
                Need to collect all records that meet some condition or transform each
                record into another representation




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   24 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records
                Need to collect all records that meet some condition or transform each
                record into another representation
     2    Solution
                Map: For each record, emit it if passes the condition or emit its
                transformed version




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   24 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records
                Need to collect all records that meet some condition or transform each
                record into another representation
     2    Solution
                Map: For each record, emit it if passes the condition or emit its
                transformed version
                Reduce: Identity




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   24 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records
                Need to collect all records that meet some condition or transform each
                record into another representation
     2    Solution
                Map: For each record, emit it if passes the condition or emit its
                transformed version
                Reduce: Identity
                Example: Text parsing or transformation such as word capitalization




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   24 / 27
Distributed Task Execution




     1    Problem
                Large computational problem




  Zubair Nabi                  6: MapReduce Applications   April 18, 2013   25 / 27
Distributed Task Execution




     1    Problem
                Large computational problem
                Need to divide it into multiple parts and combine results from all parts to
                obtain a final result




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   25 / 27
Distributed Task Execution




     1    Problem
                Large computational problem
                Need to divide it into multiple parts and combine results from all parts to
                obtain a final result
     2    Solution
                Map: Perform corresponding computation




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   25 / 27
Distributed Task Execution




     1    Problem
                Large computational problem
                Need to divide it into multiple parts and combine results from all parts to
                obtain a final result
     2    Solution
                Map: Perform corresponding computation
                Reduce: Combine all emitted results into a final one




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   25 / 27
Distributed Task Execution




     1    Problem
                Large computational problem
                Need to divide it into multiple parts and combine results from all parts to
                obtain a final result
     2    Solution
                Map: Perform corresponding computation
                Reduce: Combine all emitted results into a final one
                Example: RGB histogram calculation of bitmap images




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   25 / 27
Sorting




     1    Problem
                A set of records




  Zubair Nabi                      6: MapReduce Applications   April 18, 2013   26 / 27
Sorting




     1    Problem
                A set of records
                Need to sort records in some order




  Zubair Nabi                  6: MapReduce Applications   April 18, 2013   26 / 27
Sorting




     1    Problem
                A set of records
                Need to sort records in some order
     2    Solution
                Map: Identity




  Zubair Nabi                   6: MapReduce Applications   April 18, 2013   26 / 27
Sorting




     1    Problem
                A set of records
                Need to sort records in some order
     2    Solution
                Map: Identity
                Reduce: Identity




  Zubair Nabi                  6: MapReduce Applications   April 18, 2013   26 / 27
Sorting




     1    Problem
                A set of records
                Need to sort records in some order
     2    Solution
                Map: Identity
                Reduce: Identity
                Also possible to sort by value, either perform a secondary sort or
                perform a key-to-value conversion




  Zubair Nabi                   6: MapReduce Applications                  April 18, 2013   26 / 27
References




     1    Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with
          MapReduce. Morgan and Claypool Publishers.
     2    MapReduce Patterns, Algorithms, and Use Cases:
          http://highlyscalable.wordpress.com/2012/02/01/
          mapreduce-patterns/




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   27 / 27

More Related Content

What's hot

Computer architecture virtual memory
Computer architecture virtual memoryComputer architecture virtual memory
Computer architecture virtual memoryMazin Alwaaly
 
Electrical AC & DC Drives in Control of Electrical Drives
Electrical AC & DC Drives in Control of Electrical DrivesElectrical AC & DC Drives in Control of Electrical Drives
Electrical AC & DC Drives in Control of Electrical DrivesHardik Ranipa
 
universal motor
universal motor universal motor
universal motor urvish shah
 
TYPES OF SINGLE PHASE INDUCTION MOTOR
TYPES OF SINGLE PHASE INDUCTION MOTORTYPES OF SINGLE PHASE INDUCTION MOTOR
TYPES OF SINGLE PHASE INDUCTION MOTORUmang Thakkar
 
introduction to microprocessors
introduction to microprocessorsintroduction to microprocessors
introduction to microprocessorsvishi1993
 
8086 assembly language
8086 assembly language8086 assembly language
8086 assembly languageMir Majid
 
Linear Induction Motor (Electric Trains based on magnetic Levtation)
Linear Induction Motor (Electric Trains based on magnetic Levtation)Linear Induction Motor (Electric Trains based on magnetic Levtation)
Linear Induction Motor (Electric Trains based on magnetic Levtation)Ajit Singh Rajawat
 
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...PresidencyUniversity
 
MHD Power generation
MHD Power generationMHD Power generation
MHD Power generationYash Patel
 
Pic 18 microcontroller
Pic 18 microcontrollerPic 18 microcontroller
Pic 18 microcontrollerAshish Ranjan
 
Memory banking-of-8086-final
Memory banking-of-8086-finalMemory banking-of-8086-final
Memory banking-of-8086-finalEstiak Khan
 
special-electrical-machines-ppt
special-electrical-machines-pptspecial-electrical-machines-ppt
special-electrical-machines-pptSaravanan A
 
PERMANENT MAGNET BRUSHLESS DC MOTOR
PERMANENT MAGNET BRUSHLESS DC MOTORPERMANENT MAGNET BRUSHLESS DC MOTOR
PERMANENT MAGNET BRUSHLESS DC MOTORArchana Balachandran
 
Interfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorInterfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorVikas Gupta
 

What's hot (20)

Computer architecture virtual memory
Computer architecture virtual memoryComputer architecture virtual memory
Computer architecture virtual memory
 
Electrical AC & DC Drives in Control of Electrical Drives
Electrical AC & DC Drives in Control of Electrical DrivesElectrical AC & DC Drives in Control of Electrical Drives
Electrical AC & DC Drives in Control of Electrical Drives
 
80386 microprocessor
80386 microprocessor80386 microprocessor
80386 microprocessor
 
Sepic
SepicSepic
Sepic
 
universal motor
universal motor universal motor
universal motor
 
TYPES OF SINGLE PHASE INDUCTION MOTOR
TYPES OF SINGLE PHASE INDUCTION MOTORTYPES OF SINGLE PHASE INDUCTION MOTOR
TYPES OF SINGLE PHASE INDUCTION MOTOR
 
introduction to microprocessors
introduction to microprocessorsintroduction to microprocessors
introduction to microprocessors
 
Micro programmed control
Micro programmed controlMicro programmed control
Micro programmed control
 
8086 assembly language
8086 assembly language8086 assembly language
8086 assembly language
 
Linear Induction Motor (Electric Trains based on magnetic Levtation)
Linear Induction Motor (Electric Trains based on magnetic Levtation)Linear Induction Motor (Electric Trains based on magnetic Levtation)
Linear Induction Motor (Electric Trains based on magnetic Levtation)
 
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...
 
Hybrid stepper motors
Hybrid stepper motorsHybrid stepper motors
Hybrid stepper motors
 
PPT ON Arduino
PPT ON Arduino PPT ON Arduino
PPT ON Arduino
 
MHD Power generation
MHD Power generationMHD Power generation
MHD Power generation
 
Pic 18 microcontroller
Pic 18 microcontrollerPic 18 microcontroller
Pic 18 microcontroller
 
Memory banking-of-8086-final
Memory banking-of-8086-finalMemory banking-of-8086-final
Memory banking-of-8086-final
 
special-electrical-machines-ppt
special-electrical-machines-pptspecial-electrical-machines-ppt
special-electrical-machines-ppt
 
Arm instruction set
Arm instruction setArm instruction set
Arm instruction set
 
PERMANENT MAGNET BRUSHLESS DC MOTOR
PERMANENT MAGNET BRUSHLESS DC MOTORPERMANENT MAGNET BRUSHLESS DC MOTOR
PERMANENT MAGNET BRUSHLESS DC MOTOR
 
Interfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorInterfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessor
 

Viewers also liked

Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
A Birds-Eye View of Pig and Scalding Jobs with hRaven
A Birds-Eye View of Pig and Scalding Jobs with hRavenA Birds-Eye View of Pig and Scalding Jobs with hRaven
A Birds-Eye View of Pig and Scalding Jobs with hRavenDataWorks Summit
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonGrisha Weintraub
 

Viewers also liked (6)

Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
A Birds-Eye View of Pig and Scalding Jobs with hRaven
A Birds-Eye View of Pig and Scalding Jobs with hRavenA Birds-Eye View of Pig and Scalding Jobs with hRaven
A Birds-Eye View of Pig and Scalding Jobs with hRaven
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 

Similar to Topic 6: MapReduce Applications

Topic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationTopic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationZubair Nabi
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusterskazuma_sato
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsYash Khandelwal
 
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewMapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewPrakher Hajela Saxena
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmNilaNila16
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
Dsm Presentation
Dsm PresentationDsm Presentation
Dsm Presentationrichoe
 
Mapreduce introduction
Mapreduce introductionMapreduce introduction
Mapreduce introductionYogender Singh
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application ScriptingZubair Nabi
 

Similar to Topic 6: MapReduce Applications (20)

Topic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationTopic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and Implementation
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithms
 
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewMapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Dsm Presentation
Dsm PresentationDsm Presentation
Dsm Presentation
 
Mapreduce introduction
Mapreduce introductionMapreduce introduction
Mapreduce introduction
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Chapter3 application requirements
Chapter3 application requirementsChapter3 application requirements
Chapter3 application requirements
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application Scripting
 

More from Zubair Nabi

AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationZubair Nabi
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tablesZubair Nabi
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: SchedulingZubair Nabi
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanZubair Nabi
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS HybridsZubair Nabi
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingZubair Nabi
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationZubair Nabi
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud StacksZubair Nabi
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
 

More from Zubair Nabi (20)

AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network Communication
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tables
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: Scheduling
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing World
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in Pakistan
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS Hybrids
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 

Recently uploaded

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Recently uploaded (20)

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Topic 6: MapReduce Applications

  • 1. 6: MapReduce Applications Zubair Nabi zubair.nabi@itu.edu.pk April 18, 2013 Zubair Nabi 6: MapReduce Applications April 18, 2013 1 / 27
  • 2. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 2 / 27
  • 3. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 3 / 27
  • 4. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 5. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 6. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 7. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 8. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 9. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks 4 Reduce logic: The user-provided reduce function is invoked Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 10. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks 4 Reduce logic: The user-provided reduce function is invoked Before the application of the reduce function, the input keys are merged to get globally sorted key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 11. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  • 12. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function The user-provided map function is invoked for every line (can be modified) in the input file and is passed the line number as key and line contents as value Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  • 13. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function The user-provided map function is invoked for every line (can be modified) in the input file and is passed the line number as key and line contents as value The user-provided reduce function is invoked for each key output by the map phase and is passed the set of associated values as iterable values Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  • 14. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 15. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 16. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 17. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 18. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) The reduce function is invoked once for each word with a list of 1s Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 19. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) The reduce function is invoked once for each word with a list of 1s Reduce output: Words and their final counts Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 20. Wordcount: Low-level view A new process is created for each map, called MapRunner Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 21. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 22. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 23. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines MapRunner also has a Mapper instance with a map function, WordCountMapper in this case Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 24. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines MapRunner also has a Mapper instance with a map function, WordCountMapper in this case For each line parse by RecordReader, MapRunner calls WordCountMapper.map() and passes it the line Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 25. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 26. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 27. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 28. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 29. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Whenever the size of a partition buffer exceeds a configurable threshold, its contents are first sorted by key and then flushed to disk Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 30. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Whenever the size of a partition buffer exceeds a configurable threshold, its contents are first sorted by key and then flushed to disk This process is repeated till the map logic has been applied to all lines within the input file Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 31. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  • 32. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  • 33. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Each reduce task fetches its input partitions from machines on which map tasks were run Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  • 34. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Each reduce task fetches its input partitions from machines on which map tasks were run All input partitions are then merged to get a globally sorted partition of key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  • 35. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 36. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 37. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 38. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 39. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector This process is repeated till the reduce logic has been applied key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 40. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector This process is repeated till the reduce logic has been applied key/value pairs At the end of the entire job, each reduce produces an output file with words and their number of occurrences Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 41. Wordcount map in Java 1 public void map( Object key , Text value , Context context ) { 2 StringTokenizer itr = new StringTokenizer ( value . toString ()); 3 while (itr. hasMoreTokens ()) { 4 word.set(itr. nextToken ()); 5 context .write (word , one ); 6 } 7 } Zubair Nabi 6: MapReduce Applications April 18, 2013 11 / 27
  • 42. Wordcount reduce in Java 1 public void reduce (Text key , Iterable < IntWritable > values , 2 Context context ) { 3 int sum = 0; 4 for ( IntWritable val : values ) { 5 sum += val.get (); 6 } 7 result .set(sum ); 8 context .write(key , result ); 9 } Zubair Nabi 6: MapReduce Applications April 18, 2013 12 / 27
  • 43. Wordcount map in Python 1 def map(self , key , value ): 2 [self. _output_collector . collect (word , 1) for word in value . split (’ ’)] Zubair Nabi 6: MapReduce Applications April 18, 2013 13 / 27
  • 44. Wordcount reduce in Python 1 def reduce (self , key , values ): 2 sum__ = 0 3 for value in values : 4 sum__ += value 5 self. _output_collector . collect (key , sum__ ) Zubair Nabi 6: MapReduce Applications April 18, 2013 14 / 27
  • 45. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 15 / 27
  • 46. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 47. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 48. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 49. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 50. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Like putting together a jigsaw puzzle in which all the other pieces have already been assembled and you only need to add two pieces: The map and the reduce pieces Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 51. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Like putting together a jigsaw puzzle in which all the other pieces have already been assembled and you only need to add two pieces: The map and the reduce pieces Fortunately a large number of algorithms easily fit this rigid pattern Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 52. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  • 53. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  • 54. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task 3 The input key/value pairs processed by a specific map task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  • 55. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task 3 The input key/value pairs processed by a specific map task 4 The intermediate key/value pairs processed by a specific reduce task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  • 56. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 57. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 58. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 59. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks 4 The sort order of intermediate keys and in turn, the order in which a reducer encounters keys Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 60. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks 4 The sort order of intermediate keys and in turn, the order in which a reducer encounters keys 5 Partitioning of key space and in turn, the set of keys that a particular reducer encounters Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 61. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  • 62. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  • 63. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs The output of one job becomes the input to the next Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  • 64. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs The output of one job becomes the input to the next Most interactive algorithms need to be run by an external driver program that performs the convergence check Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  • 65. Local aggregation Network and disk latencies are expensive compared to other operations Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  • 66. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  • 67. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Aggressive user of combiners for commutative and associative algorithms can greatly reduce intermediate data Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  • 68. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Aggressive user of combiners for commutative and associative algorithms can greatly reduce intermediate data Another strategy, dubbed “in-mapper combining” can not only decrease the amount of intermediate data but also the number of key/valur pairs emitted by the map tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  • 69. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 21 / 27
  • 70. Counting and Summing 1 Problem A number of documents with a set of terms Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  • 71. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  • 72. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) 2 Solution Map: For each term, emit the term and “1” Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  • 73. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) 2 Solution Map: For each term, emit the term and “1” Reduce: Take the sum (or any other operation) of each term values Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  • 74. Collating 1 Problem A number of documents with a set of terms and some function of one item Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 75. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 76. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 77. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Reduce: Either save all grouped items or perform further computation Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 78. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Reduce: Either save all grouped items or perform further computation Example: Inverted Index: Items are words and function is document ID Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 79. Filtering, Parsing, and Validation 1 Problem A set of records Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 80. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 81. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 82. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Reduce: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 83. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Reduce: Identity Example: Text parsing or transformation such as word capitalization Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 84. Distributed Task Execution 1 Problem Large computational problem Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 85. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 86. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 87. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Reduce: Combine all emitted results into a final one Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 88. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Reduce: Combine all emitted results into a final one Example: RGB histogram calculation of bitmap images Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 89. Sorting 1 Problem A set of records Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 90. Sorting 1 Problem A set of records Need to sort records in some order Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 91. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 92. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Reduce: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 93. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Reduce: Identity Also possible to sort by value, either perform a secondary sort or perform a key-to-value conversion Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 94. References 1 Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers. 2 MapReduce Patterns, Algorithms, and Use Cases: http://highlyscalable.wordpress.com/2012/02/01/ mapreduce-patterns/ Zubair Nabi 6: MapReduce Applications April 18, 2013 27 / 27