SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Introduction to
MapReduce

   Zuhair Khayyat
     3/11/2012
What is MapReduce

   ●   A programming model introduced by Google in OSDI '04 for
       processing large datasets efficiently.
   ●   Features:
               –   Automatic parallelization, no parallel experience required.
               –   Data and process redundancy for failure recovery.
               –   Auto scheduling and Load balancing.
               –   Easy to program, based on two simple functions:
                        ●   Map
                        ●   Reduce.

CS245 - 2012                      Introduction to MapReduce                 2
Why MapReduce?

   ●   For a cluster of:
               –   2000 machines.
               –   Total 16 TB Ram (≈ 8 GB each).
               –   Total 2 PB Disk space (≈ 1 TB each).
   ●   Use the maximum capacity of the cluster to:
               –   Implement a parallel word count for input size 100 TB.




CS245 - 2012                   Introduction to MapReduce              3
Why MapReduce?

   ●   For a cluster of:
               –   2000 machines.
               –   Total 16 TB Ram (≈ 8 GB each).
               –   Total 2 PB Disk space (≈ 1 TB each).
   ●   Use the maximum capacity of the cluster to:
               –   Implement a parallel word count for input size 100 TB.
               –   Implement a parallel sort for the same input file.
   ●    Can you use the same code for both applications?

CS245 - 2012                    Introduction to MapReduce               4
How Fast is MapReduce (Hadoop)

   ●   Sort Benchmark competition (http://sortbenchmark.org/):
               –   2009: 100 TB in 173 minutes using 3452 nodes:
                        ● 2 x Quad core Xeons @ 2.5 GHz.
                        ● 8 GB RAM.


               –   2008: 1TB in 3.48 minutes using 910 nodes:
                        ●   4 x Dual core Xeons @ 2.0 GHz.
                        ●   8 GB RAM.



CS245 - 2012                     Introduction to MapReduce         5
Who uses MapReduce?




CS245 - 2012       Introduction to MapReduce   6
Map & Reduce functions

   ●   The Mapper (Pick a key):
               –   Input: Read input from disk.
               –   Output: Create pairs of <key, value>, known as
                    intermediate pairs.
               –   More input partitions == More parallel Mappers.
   ●   The Reducer (Process values):
               –   Input: a list of <key,value> pairs with a unique key.
               –   Output: Single or multiple of <key, values>
               –   More unique keys == More Parallel Reducers.
CS245 - 2012                    Introduction to MapReduce                  7
How MapReduce Work

   1) Partition input file into M partitions.
   2) Create M Map tasks, read M partitions in parallel and emits
      intermediate <key, value> pairs. Store them into local storage.
   3) Wait for all Map workers to finish, sort and partition
      intermediate <key, value> pairs into R regions.
   4) Start R reduce workers, each reads a list of intermediate with
      a unique key from remote disks.
   5) Write the output of reduce workers to file(s).


CS245 - 2012                Introduction to MapReduce            8
Example – Word count

   ●   Assume an input as following:

       cat flower picture
          snow cat cat
       prince flower sun
        king queen AC




CS245 - 2012                Introduction to MapReduce   9
Example – Word count
   ●   Step1: Partition input file into M partitions.


       cat flower picture                cat flower picture
          snow cat cat
       prince flower sun                     snow cat cat
        king queen AC
                                         prince flower sun

                                           king queen AC




CS245 - 2012                Introduction to MapReduce         10
Example – Word count
●    Step2: Create M Map tasks, read M partitions in parallel and
     emits intermediate <key, value> pairs. Store them into local
     storage.


cat flower picture         Mapper 1            <cat,1> <flower,1> <picture,1>

    snow cat cat
                           Mapper 2              <snow,1> <cat,1> <cat,1>
prince flower sun
                           Mapper 3            <prince,1> <flower,1> <sun,1>
 king queen AC

CS245 - 2012              Introduction to 4
                            Mapper MapReduce    <king,1> <queen,1> <AC,1>
                                                                     11
Example – Word count
  ●   Step3: Wait for all Map workers to finish, sort and partition
      intermediate <key, value> pairs into R regions.

                                         <cat,1>              <AC,1>
<cat,1> <flower,1> <picture,1>          <flower,1>            <cat,1>
                                        <picture,1>           <cat,1>
                                          <cat,1>             <cat,1>
  <snow,1> <cat,1> <cat,1>                <cat,1>           <flower,1>
                                         <snow,1>           <flower,1>
                                         <flower,1>          <king,1>
<prince,1> <flower,1> <sun,1>            <prince,1>         <picture,1>
                                          <sun,1>           <prince,1>
                                                            <queen,1>
                                            <AC,1>           <snow,1>
 CS245 - 2012
                                            <king,1>
 <king,1> <queen,1> <AC,1> Introduction to MapReduce          <sun,1> 12
                                           <queen,1>
Example – Word count
●   Step4: Start R reduce workers, each reads a list of intermediate
    with a unique key from remote disks.

   <AC,1>                  Reducer 1                  <AC,1>
   <cat,1>
   <cat,1>
   <cat,1>                  Reducer 2                 <cat,3>
 <flower,1>
 <flower,1>                 Reducer 3                <flower,2>
  <king,1>
 <picture,1>
 <prince,1>
 <queen,1>
  <snow,1>
CS245 - 2012
   <sun,1>                  Reducer 9
                         Introduction to MapReduce    <sun,1>     13
Example – Word count
●   Step5: Write the output of reduce workers to file(s).

                   <AC,1>
                                                     <AC,1>
                   <cat,3>                           <cat,3>
                                                   <flower,2>
                  <flower,2>
                                                    <king,1>
                   <king,1>                        <picture,1>
                                                   <prince,1>
                  <picture,1>                      <queen,1>
                                                    <snow,1>
                                                     <sun,1>

                   <sun,1>
CS245 - 2012                 Introduction to MapReduce           14
MapReduce framework




CS245 - 2012       Introduction to MapReduce   15
MapReduce Failure Recovery

   ●   The framework works as master worker paradigm.
   ●   The master keeps records of the work done on each worker.
   ●   If a worker fails, the master assigns the same work to another
       worker.
   ●   If a worker is late, another copy of the same work is assigned
       to another worker.
   ●   If the master fails, another backup copy of the master can pick
       up and continue execution from the last check points.


CS245 - 2012               Introduction to MapReduce              16
Advantages of MapReduce

   ●   Parallel IO: hides disk latency.
   ●   Parallel Processing:
               –   Map functions works independently in parallel, each
                    process one unique partition.
               –   Reduce functions work independently in parallel, each
                    on a unique intermediate key.
   ●   Using large clusters of commodity machines gives better
       results than small expensive clusters.



CS245 - 2012                   Introduction to MapReduce             17
Advantages of MapReduce

   ●   Parallel IO: hides disk latency.
   ●   Parallel Processing:
               –   Map functions works independently in parallel, each
                    process one unique partition.
               –   Reduce functions work independently in parallel, each
                    on a unique intermediate key.
   ●   Using large clusters of commodity machines gives
       comparable results than small expensive clusters.



CS245 - 2012                   Introduction to MapReduce             18
Hadoop vs. others
   ●   Algorithm: Sorting 100 TB data.


                         Hadoop                   DEMSort         TritonSort
       Nodes Count         3452                       195             47
        Processor      2x Quad-core            2x Quad-core      2x Quad-core
                     Xeons @ 2.5 GHz         Xeons @ 2.6 GHz   Xeons @ 2.27 GHz
         Memory            8 GB                     16 GB           24 GB
         Network     1 Gigabit Ethernet           InfiniBand    10 Gigabit Fiber
       Throughput      0.578 TB/Min             0.564 TB/Min     0.582 TB/Min



CS245 - 2012                    Introduction to MapReduce                          19
MapReduce weak points

   ●   Overhead of MapReduce is huge.
   ●   Data dependent applications may need multiple iterations of
       MapReduce, for example:
               –   K-means.
               –   PageRank.
   ●   Complex algorithms can be very hard to implement.
               –   Range Queries.
   ●   Sensitive to <key,value> pairs' skewed distribution

CS245 - 2012                   Introduction to MapReduce        20
Implementations of MapReduce

   ●   Hadoop in Java.
   ●   Mars in C++ & CUDA.
   ●   Skynet in Ruby.
   ●   Phoenix in C++
   ●   Microsoft Dryad:
               –   Schedule multiple levels of “MapReduce” like
                     operations..



CS245 - 2012                   Introduction to MapReduce          21
MapReduce in Database



CS245 - 2012          Introduction to MapReduce   22
MapReduce in Database - Ex1

   ●   Select Name from Students where age = 23;


                     Students:
                      Name          ID               Age
                      Ahmed        1177              23
                       Bob         1131              20
                       Sara        1197              22




CS245 - 2012             Introduction to MapReduce         23
MapReduce in Database - Ex2

   ●   Select COUNT(Name) from Students where age > 20 group
       by Name;

                    Students:
                      Name         ID               Age
                     Ahmed        1177              23
                      Bob         1131              20
                      Sara        1197              22




CS245 - 2012            Introduction to MapReduce         24
MapReduce in Database - Ex3

   ●    Select Name, Term from Students, Enrolment where ID = SID
        and age != 20;

   Students:                         Enrolment:
       Name     ID     Age                CID            SID    Term
       Ahmed   1177     23               CS290           1177   042
        Bob    1131     20               CS260           1177   052
       Sara    1197     22              ME222            1131   051
                                      AMCS220            1197   051




CS245 - 2012                 Introduction to MapReduce                 25
MapReduce in Database - Ex4

   ●    Select Name, Term from Students, Enrolment where ID !=
        SID;
   Students:                          Enrolment:
       Name      ID      Age                CID           SID    Term
       Ahmed    1177     23               CS290           1177   042
        Bob     1131     20               CS260           1177   052
       Sara     1197     22               ME222           1131   051
                                        AMCS220           1197   051

   ●    What if the condition ID > SID?

CS245 - 2012                  Introduction to MapReduce                 26
MapReduce in Database - Ex5

    ●   Select Name, Term from Students, Enrolment where ID = SID
        and Admission != Term;

Students:
    Students:                                       Enrolment:
                                                        Enrolment:
  Name           ID    Age         Admission              CID        SID   Term
 Ahmed          1177   23              042               CS290   1177      042
   Bob          1131   20              051               CS260   1177      052
  Sara          1197   22              042               ME222   1131      051
                                                     AMCS220     1197      051




 CS245 - 2012                Introduction to MapReduce                      27
MapReduce in Database - Ex6

   ●    Select y from R, S, T where R.x = S.x and T.a = S.a;


   R:                                                  S:
         x         y             z                          a   b   x


                       T:
                            m                n              a




CS245 - 2012                    Introduction to MapReduce           28
MapReduce in Academic Papers
   ●   NIPS '07: Map-Reduce for Machine Learning on Multicore.
   ●   Escience '08: CloudBLAST: Combining MapReduce and Virtualization on
       Distributed Resources for Bioinformatics Applications.
   ●   KDD '09: Large-scale behavioral targeting.
   ●   GCC '09: Spatial Queries Evaluation with MapReduce.
   ●   SIGIR '09: On single-pass indexing with MapReduce.
   ●   MDAC '10: A novel approach to multiple sequence alignment using
       hadoop data grids.
   ●   VLDB Endowment '11: Social Content Matching in MapReduce.
   ●   VLDB '12: Building Wavelet Histograms on Large Data in MapReduce.
CS245 - 2012                  Introduction to MapReduce                  29
Links
●   http://code.google.com/edu/parallel/mapreduce-tutorial.html
●   http://hadoop.apache.org/mapreduce/
●   http://www.cse.ust.hk/gpuqp/Mars.html
●   http://skynet.rubyforge.org/
●   http://mapreduce.stanford.edu/
●   http://wiki.apache.org/hadoop/PoweredBy
●   http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-
    academic-papers-4th-update-may-2011/


CS245 - 2012              Introduction to MapReduce               30

Contenu connexe

Similaire à MapReduce

Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphAndrew Yongjoon Kong
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworksReem Abdel-Rahman
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
 
Mapreduce introduction
Mapreduce introductionMapreduce introduction
Mapreduce introductionYogender Singh
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDBPingCAP
 
Zh Tw Introduction To Map Reduce
Zh Tw Introduction To Map ReduceZh Tw Introduction To Map Reduce
Zh Tw Introduction To Map Reducekevin liao
 

Similaire à MapReduce (20)

Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
02 Map Reduce
02 Map Reduce02 Map Reduce
02 Map Reduce
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworks
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Map reduce
Map reduceMap reduce
Map reduce
 
Mapreduce introduction
Mapreduce introductionMapreduce introduction
Mapreduce introduction
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
48a tuning
48a tuning48a tuning
48a tuning
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Zh Tw Introduction To Map Reduce
Zh Tw Introduction To Map ReduceZh Tw Introduction To Map Reduce
Zh Tw Introduction To Map Reduce
 

Plus de Zuhair khayyat

Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data CleansingZuhair khayyat
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTZuhair khayyat
 
IEJoin and Big Data Cleansing
IEJoin and Big Data CleansingIEJoin and Big Data Cleansing
IEJoin and Big Data CleansingZuhair khayyat
 
BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015Zuhair khayyat
 
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Zuhair khayyat
 
Large Graph Processing
Large Graph ProcessingLarge Graph Processing
Large Graph ProcessingZuhair khayyat
 
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingZuhair khayyat
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hoodZuhair khayyat
 

Plus de Zuhair khayyat (11)

Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUST
 
IEJoin and Big Data Cleansing
IEJoin and Big Data CleansingIEJoin and Big Data Cleansing
IEJoin and Big Data Cleansing
 
BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015
 
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
 
Large Graph Processing
Large Graph ProcessingLarge Graph Processing
Large Graph Processing
 
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
 
Google appengine
Google appengineGoogle appengine
Google appengine
 
Kineograph
KineographKineograph
Kineograph
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hood
 
Dynamo db
Dynamo dbDynamo db
Dynamo db
 

Dernier

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 

Dernier (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

MapReduce

  • 1. Introduction to MapReduce Zuhair Khayyat 3/11/2012
  • 2. What is MapReduce ● A programming model introduced by Google in OSDI '04 for processing large datasets efficiently. ● Features: – Automatic parallelization, no parallel experience required. – Data and process redundancy for failure recovery. – Auto scheduling and Load balancing. – Easy to program, based on two simple functions: ● Map ● Reduce. CS245 - 2012 Introduction to MapReduce 2
  • 3. Why MapReduce? ● For a cluster of: – 2000 machines. – Total 16 TB Ram (≈ 8 GB each). – Total 2 PB Disk space (≈ 1 TB each). ● Use the maximum capacity of the cluster to: – Implement a parallel word count for input size 100 TB. CS245 - 2012 Introduction to MapReduce 3
  • 4. Why MapReduce? ● For a cluster of: – 2000 machines. – Total 16 TB Ram (≈ 8 GB each). – Total 2 PB Disk space (≈ 1 TB each). ● Use the maximum capacity of the cluster to: – Implement a parallel word count for input size 100 TB. – Implement a parallel sort for the same input file. ● Can you use the same code for both applications? CS245 - 2012 Introduction to MapReduce 4
  • 5. How Fast is MapReduce (Hadoop) ● Sort Benchmark competition (http://sortbenchmark.org/): – 2009: 100 TB in 173 minutes using 3452 nodes: ● 2 x Quad core Xeons @ 2.5 GHz. ● 8 GB RAM. – 2008: 1TB in 3.48 minutes using 910 nodes: ● 4 x Dual core Xeons @ 2.0 GHz. ● 8 GB RAM. CS245 - 2012 Introduction to MapReduce 5
  • 6. Who uses MapReduce? CS245 - 2012 Introduction to MapReduce 6
  • 7. Map & Reduce functions ● The Mapper (Pick a key): – Input: Read input from disk. – Output: Create pairs of <key, value>, known as intermediate pairs. – More input partitions == More parallel Mappers. ● The Reducer (Process values): – Input: a list of <key,value> pairs with a unique key. – Output: Single or multiple of <key, values> – More unique keys == More Parallel Reducers. CS245 - 2012 Introduction to MapReduce 7
  • 8. How MapReduce Work 1) Partition input file into M partitions. 2) Create M Map tasks, read M partitions in parallel and emits intermediate <key, value> pairs. Store them into local storage. 3) Wait for all Map workers to finish, sort and partition intermediate <key, value> pairs into R regions. 4) Start R reduce workers, each reads a list of intermediate with a unique key from remote disks. 5) Write the output of reduce workers to file(s). CS245 - 2012 Introduction to MapReduce 8
  • 9. Example – Word count ● Assume an input as following: cat flower picture snow cat cat prince flower sun king queen AC CS245 - 2012 Introduction to MapReduce 9
  • 10. Example – Word count ● Step1: Partition input file into M partitions. cat flower picture cat flower picture snow cat cat prince flower sun snow cat cat king queen AC prince flower sun king queen AC CS245 - 2012 Introduction to MapReduce 10
  • 11. Example – Word count ● Step2: Create M Map tasks, read M partitions in parallel and emits intermediate <key, value> pairs. Store them into local storage. cat flower picture Mapper 1 <cat,1> <flower,1> <picture,1> snow cat cat Mapper 2 <snow,1> <cat,1> <cat,1> prince flower sun Mapper 3 <prince,1> <flower,1> <sun,1> king queen AC CS245 - 2012 Introduction to 4 Mapper MapReduce <king,1> <queen,1> <AC,1> 11
  • 12. Example – Word count ● Step3: Wait for all Map workers to finish, sort and partition intermediate <key, value> pairs into R regions. <cat,1> <AC,1> <cat,1> <flower,1> <picture,1> <flower,1> <cat,1> <picture,1> <cat,1> <cat,1> <cat,1> <snow,1> <cat,1> <cat,1> <cat,1> <flower,1> <snow,1> <flower,1> <flower,1> <king,1> <prince,1> <flower,1> <sun,1> <prince,1> <picture,1> <sun,1> <prince,1> <queen,1> <AC,1> <snow,1> CS245 - 2012 <king,1> <king,1> <queen,1> <AC,1> Introduction to MapReduce <sun,1> 12 <queen,1>
  • 13. Example – Word count ● Step4: Start R reduce workers, each reads a list of intermediate with a unique key from remote disks. <AC,1> Reducer 1 <AC,1> <cat,1> <cat,1> <cat,1> Reducer 2 <cat,3> <flower,1> <flower,1> Reducer 3 <flower,2> <king,1> <picture,1> <prince,1> <queen,1> <snow,1> CS245 - 2012 <sun,1> Reducer 9 Introduction to MapReduce <sun,1> 13
  • 14. Example – Word count ● Step5: Write the output of reduce workers to file(s). <AC,1> <AC,1> <cat,3> <cat,3> <flower,2> <flower,2> <king,1> <king,1> <picture,1> <prince,1> <picture,1> <queen,1> <snow,1> <sun,1> <sun,1> CS245 - 2012 Introduction to MapReduce 14
  • 15. MapReduce framework CS245 - 2012 Introduction to MapReduce 15
  • 16. MapReduce Failure Recovery ● The framework works as master worker paradigm. ● The master keeps records of the work done on each worker. ● If a worker fails, the master assigns the same work to another worker. ● If a worker is late, another copy of the same work is assigned to another worker. ● If the master fails, another backup copy of the master can pick up and continue execution from the last check points. CS245 - 2012 Introduction to MapReduce 16
  • 17. Advantages of MapReduce ● Parallel IO: hides disk latency. ● Parallel Processing: – Map functions works independently in parallel, each process one unique partition. – Reduce functions work independently in parallel, each on a unique intermediate key. ● Using large clusters of commodity machines gives better results than small expensive clusters. CS245 - 2012 Introduction to MapReduce 17
  • 18. Advantages of MapReduce ● Parallel IO: hides disk latency. ● Parallel Processing: – Map functions works independently in parallel, each process one unique partition. – Reduce functions work independently in parallel, each on a unique intermediate key. ● Using large clusters of commodity machines gives comparable results than small expensive clusters. CS245 - 2012 Introduction to MapReduce 18
  • 19. Hadoop vs. others ● Algorithm: Sorting 100 TB data. Hadoop DEMSort TritonSort Nodes Count 3452 195 47 Processor 2x Quad-core 2x Quad-core 2x Quad-core Xeons @ 2.5 GHz Xeons @ 2.6 GHz Xeons @ 2.27 GHz Memory 8 GB 16 GB 24 GB Network 1 Gigabit Ethernet InfiniBand 10 Gigabit Fiber Throughput 0.578 TB/Min 0.564 TB/Min 0.582 TB/Min CS245 - 2012 Introduction to MapReduce 19
  • 20. MapReduce weak points ● Overhead of MapReduce is huge. ● Data dependent applications may need multiple iterations of MapReduce, for example: – K-means. – PageRank. ● Complex algorithms can be very hard to implement. – Range Queries. ● Sensitive to <key,value> pairs' skewed distribution CS245 - 2012 Introduction to MapReduce 20
  • 21. Implementations of MapReduce ● Hadoop in Java. ● Mars in C++ & CUDA. ● Skynet in Ruby. ● Phoenix in C++ ● Microsoft Dryad: – Schedule multiple levels of “MapReduce” like operations.. CS245 - 2012 Introduction to MapReduce 21
  • 22. MapReduce in Database CS245 - 2012 Introduction to MapReduce 22
  • 23. MapReduce in Database - Ex1 ● Select Name from Students where age = 23; Students: Name ID Age Ahmed 1177 23 Bob 1131 20 Sara 1197 22 CS245 - 2012 Introduction to MapReduce 23
  • 24. MapReduce in Database - Ex2 ● Select COUNT(Name) from Students where age > 20 group by Name; Students: Name ID Age Ahmed 1177 23 Bob 1131 20 Sara 1197 22 CS245 - 2012 Introduction to MapReduce 24
  • 25. MapReduce in Database - Ex3 ● Select Name, Term from Students, Enrolment where ID = SID and age != 20; Students: Enrolment: Name ID Age CID SID Term Ahmed 1177 23 CS290 1177 042 Bob 1131 20 CS260 1177 052 Sara 1197 22 ME222 1131 051 AMCS220 1197 051 CS245 - 2012 Introduction to MapReduce 25
  • 26. MapReduce in Database - Ex4 ● Select Name, Term from Students, Enrolment where ID != SID; Students: Enrolment: Name ID Age CID SID Term Ahmed 1177 23 CS290 1177 042 Bob 1131 20 CS260 1177 052 Sara 1197 22 ME222 1131 051 AMCS220 1197 051 ● What if the condition ID > SID? CS245 - 2012 Introduction to MapReduce 26
  • 27. MapReduce in Database - Ex5 ● Select Name, Term from Students, Enrolment where ID = SID and Admission != Term; Students: Students: Enrolment: Enrolment: Name ID Age Admission CID SID Term Ahmed 1177 23 042 CS290 1177 042 Bob 1131 20 051 CS260 1177 052 Sara 1197 22 042 ME222 1131 051 AMCS220 1197 051 CS245 - 2012 Introduction to MapReduce 27
  • 28. MapReduce in Database - Ex6 ● Select y from R, S, T where R.x = S.x and T.a = S.a; R: S: x y z a b x T: m n a CS245 - 2012 Introduction to MapReduce 28
  • 29. MapReduce in Academic Papers ● NIPS '07: Map-Reduce for Machine Learning on Multicore. ● Escience '08: CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications. ● KDD '09: Large-scale behavioral targeting. ● GCC '09: Spatial Queries Evaluation with MapReduce. ● SIGIR '09: On single-pass indexing with MapReduce. ● MDAC '10: A novel approach to multiple sequence alignment using hadoop data grids. ● VLDB Endowment '11: Social Content Matching in MapReduce. ● VLDB '12: Building Wavelet Histograms on Large Data in MapReduce. CS245 - 2012 Introduction to MapReduce 29
  • 30. Links ● http://code.google.com/edu/parallel/mapreduce-tutorial.html ● http://hadoop.apache.org/mapreduce/ ● http://www.cse.ust.hk/gpuqp/Mars.html ● http://skynet.rubyforge.org/ ● http://mapreduce.stanford.edu/ ● http://wiki.apache.org/hadoop/PoweredBy ● http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in- academic-papers-4th-update-may-2011/ CS245 - 2012 Introduction to MapReduce 30