SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
An Introduction to MapReduce
             Presented by Frane Bandov
    at the Operating Complex IT-Systems seminar
                  Berlin, 1/26/2010
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   2
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   3
Introduction – Problem
Sometimes we have to deal with huge amounts
                 of data
TBytes
250

200

 150

100

 50

  0
            You   Facebook              Yahoo! Groups    German Climate
                                                        Computing Centre

  2/16/10          An Introduction to MapReduce                       4
Introduction – Problem
    The data needs to be processed, but how?


     Can‘t process all of this data on one machine
     Distribute the processing to many machines




2/16/10             An Introduction to MapReduce     5
Introduction – Approach
           Distributed computing is the solution
           “Let’s write our own distributed computing
              software as a solution to our problem”
         Checklist
 design protocols             evelopment takes a long time
                              D
 design data structures
 write the code              Expensive: Cost-benefit ratio?
 assure failure tolerance



   Build complex software for simple computations?

 2/16/10                     An Introduction to MapReduce   6
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   7
Google MapReduce – Idea
      A framework for distributed computing

  Don‘t care about protocols, failure tolerance, etc.

           Just write your simple computation




2/16/10              An Introduction to MapReduce       8
Google MapReduce – Idea
              MapReduce Paradigm
Map:                                  Reduce:
 Apply function to all                  Combine all elements
 elements of a list                     of a list


square x = x * x;                     reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
 [1, 4, 9, 16, 25]                    15




2/16/10               An Introduction to MapReduce                 9
Google MapReduce – Idea
               Basic functioning



      Input     Map                     Reduce   Output




2/16/10           An Introduction to MapReduce            10
Google MapReduce – Overview
                       MapReduce-Based User Program

 GFS                                                              GFS

 Split 1                              Master


 Split 2                      Intermediate
              Worker                                     Worker   File 1
                                  File 1

 Split 3
                              Intermediate
              Worker
                                  File 2                 Worker   File 2
 Split 4

                              Intermediate
 Split 5      Worker
                                  File 3
                                                         Reduce   Output
Input file   Map Phase                                   Phase     files
2/16/10                   An Introduction to MapReduce               11
MapReduce – Fault Tolerance
•  Workers are periodically pinged by master
•  No answer over certain time  worker failed

Mapper fails:
     –  Reset map job as idle
     –  Even if job was completed  intermediate files are
        inaccessible
     –  Notify reducers where to get the new intermediate file
Reducer fails:
     –  Reset its job as idle
2/16/10                   An Introduction to MapReduce       12
MapReduce – Fault Tolerance
Master fails:
     –  Periodically sets checkpoints
     –  In case of failure MapReduce-Operation is aborted
     –  Operation can be restarted from last checkpoint




2/16/10                An Introduction to MapReduce         13
Google MapReduce – GFS
               Google File System
•  In-house distributed file system at Google
•  Stores all input an output files
•  Stores files…
     – divided into 64 MB blocks
     – on at least 3 different machines
•  Machines running GFS also
   run MapReduce
2/16/10              An Introduction to MapReduce   14
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   15
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   16
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   17
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   18
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   19
Alternative Implementations
Apache Hadoop

•    Open-Source-Implementation in Java
•    Jobs can be written in C++, Java, Python, etc.
•    Used by Yahoo!, Facebook, Amazon and others
•    Most commonly used implementation
•    HDFS as open-source-implementation of GFS
•    Can also use Amazon S3, HTTP(S) or FTP
•    Extensions: Hive, Pig, HBase
2/16/10              An Introduction to MapReduce     20
Alternative Implementations
                              Mars
          MapReduce-Implementation for nVidia GPU
                using the CUDA framework

                    MapReduce-Cell
            Implementation for the Cell multi-core
                         processor

                             Qizmt
     MySpace’s implementation of MapReduce in C#

2/16/10                An Introduction to MapReduce   21
Alternative Implementations


     There are many other open- and closed-
     source implementations of MapReduce!




2/16/10           An Introduction to MapReduce   22
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   23
Reception and Criticism
•  Yahoo!: Hadoop on a 10,000 server cluster
•  Facebook analyses the daily log (25TB) on
   a 1,000 server cluster
•  Amazon Elastic MapReduce: Hadoop
   clusters for rent on EC2 and S3
•  IBM and Google: Support university
   courses in distributed programming
•  UC Berkley announced to teach freashmen
   programming MapReduce
2/16/10          An Introduction to MapReduce   24
Reception and Criticism




2/16/10          An Introduction to MapReduce   25
Reception and Criticism
•  Criticism mainly by RDBMS experts
   DeWitt and Stonebraker
•  MapReduce
     – is a step backwards in database access
     – is a poor implementation
     – is not novel
     – is missing features that are routinely provided
       by modern DBMSs
     – is incompatible with the DBMS tools
2/16/10              An Introduction to MapReduce    26
Reception and Criticism
               Response to criticism

              MapReduce is no RDBMS

   It suits well for processing and structuring huge
              amounts of unstructured data

      MapReduce's big inovation is that it enables
     distributing data processing across a network of
         cheap and possibly unreliable computers
2/16/10              An Introduction to MapReduce      27
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   28
Trends and Future Development
   Trend of utilizing MapReduce/Hadoop as
                 parallel database

•  Hive: Query language for Hadoop
•  HBase: Column-oriented distributed database
   (modeled after Google’s BigTable)
•  Map-Reduce-Merge: Adding merge to the
   paradigm allows implementing features of
   relational algebra
2/16/10           An Introduction to MapReduce   29
Trends and Future Development
   Trend to use the MapReduce-paradigm to
         better utilize multi-core CPUs

•  Qt Concurrent
     –  Simplified C++ version of MapReduce for distributing
        tasks between multiple processor cores
•  Mars
•  MapReduce-Cell


2/16/10                An Introduction to MapReduce        30
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   31
Conclusion
                        MapReduce

     provides an easy solution for the processing of
                  large amounts of data

          brings a paradigm shift in programming

                      changed the world,
          i.e. made data processing more efficient and
            cheaper, is the foundation of many other
                   approaches and solutions
2/16/10                 An Introduction to MapReduce     32
Questions?




2/16/10    An Introduction to MapReduce   33
Thank You!




2/16/10    An Introduction to MapReduce   34

Contenu connexe

Tendances

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 

Tendances (20)

Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureData Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - Introduction
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Chp1 - Introduction à l'Informatique Décisionnelle
Chp1 - Introduction à l'Informatique DécisionnelleChp1 - Introduction à l'Informatique Décisionnelle
Chp1 - Introduction à l'Informatique Décisionnelle
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big data-cheat-sheet
Big data-cheat-sheetBig data-cheat-sheet
Big data-cheat-sheet
 
Big Data Overview 2013-2014
Big Data Overview 2013-2014Big Data Overview 2013-2014
Big Data Overview 2013-2014
 
Graph based data models
Graph based data modelsGraph based data models
Graph based data models
 
Web sémantique, Web de données, Web 3.0, Linked Data... Quelques repères pour...
Web sémantique, Web de données, Web 3.0, Linked Data... Quelques repères pour...Web sémantique, Web de données, Web 3.0, Linked Data... Quelques repères pour...
Web sémantique, Web de données, Web 3.0, Linked Data... Quelques repères pour...
 

Similaire à An Introduction to MapReduce

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 

Similaire à An Introduction to MapReduce (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
E031201032036
E031201032036E031201032036
E031201032036
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

An Introduction to MapReduce

  • 1. An Introduction to MapReduce Presented by Frane Bandov at the Operating Complex IT-Systems seminar Berlin, 1/26/2010
  • 2. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 2
  • 3. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 3
  • 4. Introduction – Problem Sometimes we have to deal with huge amounts of data TBytes 250 200 150 100 50 0 You Facebook Yahoo! Groups German Climate Computing Centre 2/16/10 An Introduction to MapReduce 4
  • 5. Introduction – Problem The data needs to be processed, but how? Can‘t process all of this data on one machine  Distribute the processing to many machines 2/16/10 An Introduction to MapReduce 5
  • 6. Introduction – Approach Distributed computing is the solution “Let’s write our own distributed computing software as a solution to our problem” Checklist  design protocols   evelopment takes a long time D  design data structures  write the code  Expensive: Cost-benefit ratio?  assure failure tolerance Build complex software for simple computations? 2/16/10 An Introduction to MapReduce 6
  • 7. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 7
  • 8. Google MapReduce – Idea A framework for distributed computing Don‘t care about protocols, failure tolerance, etc. Just write your simple computation 2/16/10 An Introduction to MapReduce 8
  • 9. Google MapReduce – Idea MapReduce Paradigm Map: Reduce: Apply function to all Combine all elements elements of a list of a list square x = x * x; reduce (+)[1, 2, 3, 4, 5]; map square [1, 2, 3, 4, 5];  [1, 4, 9, 16, 25]  15 2/16/10 An Introduction to MapReduce 9
  • 10. Google MapReduce – Idea Basic functioning Input Map Reduce Output 2/16/10 An Introduction to MapReduce 10
  • 11. Google MapReduce – Overview MapReduce-Based User Program GFS GFS Split 1 Master Split 2 Intermediate Worker Worker File 1 File 1 Split 3 Intermediate Worker File 2 Worker File 2 Split 4 Intermediate Split 5 Worker File 3 Reduce Output Input file Map Phase Phase files 2/16/10 An Introduction to MapReduce 11
  • 12. MapReduce – Fault Tolerance •  Workers are periodically pinged by master •  No answer over certain time  worker failed Mapper fails: –  Reset map job as idle –  Even if job was completed  intermediate files are inaccessible –  Notify reducers where to get the new intermediate file Reducer fails: –  Reset its job as idle 2/16/10 An Introduction to MapReduce 12
  • 13. MapReduce – Fault Tolerance Master fails: –  Periodically sets checkpoints –  In case of failure MapReduce-Operation is aborted –  Operation can be restarted from last checkpoint 2/16/10 An Introduction to MapReduce 13
  • 14. Google MapReduce – GFS Google File System •  In-house distributed file system at Google •  Stores all input an output files •  Stores files… – divided into 64 MB blocks – on at least 3 different machines •  Machines running GFS also run MapReduce 2/16/10 An Introduction to MapReduce 14
  • 15. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 15
  • 16. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 16
  • 17. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 17
  • 18. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 18
  • 19. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 19
  • 20. Alternative Implementations Apache Hadoop •  Open-Source-Implementation in Java •  Jobs can be written in C++, Java, Python, etc. •  Used by Yahoo!, Facebook, Amazon and others •  Most commonly used implementation •  HDFS as open-source-implementation of GFS •  Can also use Amazon S3, HTTP(S) or FTP •  Extensions: Hive, Pig, HBase 2/16/10 An Introduction to MapReduce 20
  • 21. Alternative Implementations Mars MapReduce-Implementation for nVidia GPU using the CUDA framework MapReduce-Cell Implementation for the Cell multi-core processor Qizmt MySpace’s implementation of MapReduce in C# 2/16/10 An Introduction to MapReduce 21
  • 22. Alternative Implementations There are many other open- and closed- source implementations of MapReduce! 2/16/10 An Introduction to MapReduce 22
  • 23. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 23
  • 24. Reception and Criticism •  Yahoo!: Hadoop on a 10,000 server cluster •  Facebook analyses the daily log (25TB) on a 1,000 server cluster •  Amazon Elastic MapReduce: Hadoop clusters for rent on EC2 and S3 •  IBM and Google: Support university courses in distributed programming •  UC Berkley announced to teach freashmen programming MapReduce 2/16/10 An Introduction to MapReduce 24
  • 25. Reception and Criticism 2/16/10 An Introduction to MapReduce 25
  • 26. Reception and Criticism •  Criticism mainly by RDBMS experts DeWitt and Stonebraker •  MapReduce – is a step backwards in database access – is a poor implementation – is not novel – is missing features that are routinely provided by modern DBMSs – is incompatible with the DBMS tools 2/16/10 An Introduction to MapReduce 26
  • 27. Reception and Criticism Response to criticism MapReduce is no RDBMS It suits well for processing and structuring huge amounts of unstructured data MapReduce's big inovation is that it enables distributing data processing across a network of cheap and possibly unreliable computers 2/16/10 An Introduction to MapReduce 27
  • 28. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 28
  • 29. Trends and Future Development Trend of utilizing MapReduce/Hadoop as parallel database •  Hive: Query language for Hadoop •  HBase: Column-oriented distributed database (modeled after Google’s BigTable) •  Map-Reduce-Merge: Adding merge to the paradigm allows implementing features of relational algebra 2/16/10 An Introduction to MapReduce 29
  • 30. Trends and Future Development Trend to use the MapReduce-paradigm to better utilize multi-core CPUs •  Qt Concurrent –  Simplified C++ version of MapReduce for distributing tasks between multiple processor cores •  Mars •  MapReduce-Cell 2/16/10 An Introduction to MapReduce 30
  • 31. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 31
  • 32. Conclusion MapReduce provides an easy solution for the processing of large amounts of data brings a paradigm shift in programming changed the world, i.e. made data processing more efficient and cheaper, is the foundation of many other approaches and solutions 2/16/10 An Introduction to MapReduce 32
  • 33. Questions? 2/16/10 An Introduction to MapReduce 33
  • 34. Thank You! 2/16/10 An Introduction to MapReduce 34