SlideShare une entreprise Scribd logo
1  sur  15
by Jeffrey Dean and Sanjay Ghemawat
Communication of the ACM, 2008
Presented by: Abolfazl Asudeh
MapReduce: Simplified Data Processing
on Large Clusters
Map Reduce
7/4/20132
 Patented by Google
 A parallel programming model
 and an associated implementation
 for processing and generating large datasets
 Users specify the computation in terms of a map and
a reduce function
 The system automatically parallelizes the
computation across large-scale clusters
7/4/20133
 Previously, users had to handle the parallelization of
the programs over hundred or thousand machines
 Distribute the data
 Handle Failure
 Schedule inter-machine communications – to make efficient
use of resources
 By Experiment:
 most of their computations involved applying a map
operation to produce intermediate key/value pairs
 Then applying a reduce operation to
combine/aggregate the pairs
Programming Model
7/4/20134
 Input: a set of input key/value pairs
 Output: a set of output key/value pairs
 Map (written by the user):
 takes an input pair and produces a set of intermediate
key/value pairs.
 MapReduce library:
 groups all intermediate values associated with the same
intermediate key and passes them to the reduce function.
 Reduce (written by the user):
 accepts an intermediate key and a set of values for that
key.
 It merges these values together to form a possibly smaller
set of values
Basic Example: Word Counting
7/4/20135
Basic Example: Word Counting
7/4/20136
Basic Example: Word Counting
7/4/20137
<How,1>
<now,1>
<brown,1>
<cow,1>
<How,1>
<does,1>
<it,1>
<work,1>
<now,1>
<How,1 1>
<now,1 1>
<brown,1>
<cow,1>
<does,1>
<it,1>
<work,1>
How now
Brown cow
How does
It work now
Input
brown 1
cow 1
does 1
How 2
it 1
now 2
work 1
Output
M
M
M
M
Map
R
R
Reduce
Execution Overview
7/4/20138
1. The MapReduce library splits the input files into M
pieces of typically 16-64MB per piece and starts up
many copies of the program on a cluster of
machines.
2. One of the workers becomes Master. It manages
assigning M map jobs and R reduce jobs to the
Workers. It picks the Idle workers and assign the
jobs.
3. The worker that is doing a Map job: reads the
corresponding input split, parses the key/value
pairs and pass to map function (by user)
Execution Overview
7/4/20139
4. the buffered pairs are written to local disk,
partitioned into R regions.
 The locations of buffered pairs on the local disk are
passed back to the master who is responsible for
forwarding these locations to the reduce workers
5. Reduce Worker remotely reads the buffered data
from the local disc of the corresponding mapper.
 Sorts the read data by Intermediate key and group the
results together.
Execution Overview
7/4/201310
6. The reduce worker passes the results for each
intermediate key to the reduce function
7. When all the tasks are done, the Map-Reduce
function returns back to the user program
Execution Overview
7/4/201311
Fault Tolerance
7/4/201312
 Failure: if a worker does not response the PING of
master
 If a map worker Fails:
 Reschedule the WHOLE map tasks (because it writes on
the local disk)
 Send the results Address in the new map worker to all
corresponding reduce workers (if the did not still read
from the previous mapper, read from the new one)
 If a reduce worker Fails:
 Completed reduce tasks do not need to be re-executed
since their output is stored in a global file system
Execution Optimization
7/4/201313
 Locality
 Network bandwidth is a relatively scarce resource
 Compute on local copies which are distributed by HFDS
 Task Granularity
 Ideally, M and R should be much larger than the
number of worker machines
 Having each worker perform many different tasks
improves dynamic load balancing and also speeds
up recovery when a worker fails
Practice
7/4/201314
 Write the map and reduce functions for Page Rank
Algorithm.
Thank you
7/4/201315

Contenu connexe

Tendances

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Tendances (20)

Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Mapreduce Tutorial
Mapreduce TutorialMapreduce Tutorial
Mapreduce Tutorial
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
SQL Server Performance Tuning Baseline
SQL Server Performance Tuning BaselineSQL Server Performance Tuning Baseline
SQL Server Performance Tuning Baseline
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Cassandra
CassandraCassandra
Cassandra
 

Similaire à MapReduce : Simplified Data Processing on Large Clusters

Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Yahoo Developer Network
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 

Similaire à MapReduce : Simplified Data Processing on Large Clusters (20)

MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...
 
E031201032036
E031201032036E031201032036
E031201032036
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
PREGEL a system for large scale graph processing
PREGEL a system for large scale graph processingPREGEL a system for large scale graph processing
PREGEL a system for large scale graph processing
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
A Brief on MapReduce Performance
A Brief on MapReduce PerformanceA Brief on MapReduce Performance
A Brief on MapReduce Performance
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 

Dernier

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

MapReduce : Simplified Data Processing on Large Clusters

  • 1. by Jeffrey Dean and Sanjay Ghemawat Communication of the ACM, 2008 Presented by: Abolfazl Asudeh MapReduce: Simplified Data Processing on Large Clusters
  • 2. Map Reduce 7/4/20132  Patented by Google  A parallel programming model  and an associated implementation  for processing and generating large datasets  Users specify the computation in terms of a map and a reduce function  The system automatically parallelizes the computation across large-scale clusters
  • 3. 7/4/20133  Previously, users had to handle the parallelization of the programs over hundred or thousand machines  Distribute the data  Handle Failure  Schedule inter-machine communications – to make efficient use of resources  By Experiment:  most of their computations involved applying a map operation to produce intermediate key/value pairs  Then applying a reduce operation to combine/aggregate the pairs
  • 4. Programming Model 7/4/20134  Input: a set of input key/value pairs  Output: a set of output key/value pairs  Map (written by the user):  takes an input pair and produces a set of intermediate key/value pairs.  MapReduce library:  groups all intermediate values associated with the same intermediate key and passes them to the reduce function.  Reduce (written by the user):  accepts an intermediate key and a set of values for that key.  It merges these values together to form a possibly smaller set of values
  • 5. Basic Example: Word Counting 7/4/20135
  • 6. Basic Example: Word Counting 7/4/20136
  • 7. Basic Example: Word Counting 7/4/20137 <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> How now Brown cow How does It work now Input brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 Output M M M M Map R R Reduce
  • 8. Execution Overview 7/4/20138 1. The MapReduce library splits the input files into M pieces of typically 16-64MB per piece and starts up many copies of the program on a cluster of machines. 2. One of the workers becomes Master. It manages assigning M map jobs and R reduce jobs to the Workers. It picks the Idle workers and assign the jobs. 3. The worker that is doing a Map job: reads the corresponding input split, parses the key/value pairs and pass to map function (by user)
  • 9. Execution Overview 7/4/20139 4. the buffered pairs are written to local disk, partitioned into R regions.  The locations of buffered pairs on the local disk are passed back to the master who is responsible for forwarding these locations to the reduce workers 5. Reduce Worker remotely reads the buffered data from the local disc of the corresponding mapper.  Sorts the read data by Intermediate key and group the results together.
  • 10. Execution Overview 7/4/201310 6. The reduce worker passes the results for each intermediate key to the reduce function 7. When all the tasks are done, the Map-Reduce function returns back to the user program
  • 12. Fault Tolerance 7/4/201312  Failure: if a worker does not response the PING of master  If a map worker Fails:  Reschedule the WHOLE map tasks (because it writes on the local disk)  Send the results Address in the new map worker to all corresponding reduce workers (if the did not still read from the previous mapper, read from the new one)  If a reduce worker Fails:  Completed reduce tasks do not need to be re-executed since their output is stored in a global file system
  • 13. Execution Optimization 7/4/201313  Locality  Network bandwidth is a relatively scarce resource  Compute on local copies which are distributed by HFDS  Task Granularity  Ideally, M and R should be much larger than the number of worker machines  Having each worker perform many different tasks improves dynamic load balancing and also speeds up recovery when a worker fails
  • 14. Practice 7/4/201314  Write the map and reduce functions for Page Rank Algorithm.