SlideShare une entreprise Scribd logo
1  sur  22
Introduction to MapReduce
Mohamed Baddar
Senior Data Engineer
Contents
Computation Models for Distributed Computing
MPI
MapReduce
Why MapReduce?
How MapReduce works
Simple example
References
2
Distributed Computing
Why ?
Booming of big data generation (social media , e-commerce , banks , etc …)
Big data and machine learning , data mining AI became like bread and butter : better results
comes from analyzing bigger set of data.
How it works ?
Data-Partitioning : divide data into multiple tasks , each implementing the same procedure
(computations) at specific phase on its data segment
Task-Partitioning : assign different tasks to different computation units
Hardware for distributed Computing
Multiple processors (Multi-core processors) 3
Metrics
How to judge computational model suitability ?
Simplicity : level of developer experience
Scalability : adding more computational node , increase throughput / response time
fault tolerance : support recovering computed results when node is down
Maintainability : How easy fix bugs , add features
Cost : need for special hardware (Multi-core processors , large RAM , infiniband or can use
common ethernet cluster of commodity machines)
No one size fits all
sometimes it is better to use hybrid computational models 4
MPI (Message Passing Interface)
● Workload is divided among different
processes (each process may have
multiple threads)
● Communication is via Message passing
● Data exchange is via shared memory
(Physical / Virtual)
● Pros
○ Flexibility : programmer can customize message and communication between nodes
○ Speed : rely on sharing data via memory
Source :
https://computing.llnl.gov/tutorials/mpi/
5
MapReduce
Objective : Design scalable parallel programming
framework to be deployed on
large cluster of commodity machines
Data divided into splits , each processed
by map functions , whose output are
processed by reduce functions.
Originated and first practical
implementation in Google Inc. 2004
MapReduce implementations
Apache Hadoop (Computation)
6
MapReduce Execution (1)
# Mapper (M=3)
#Reducers (R=2)
MapReduce functions
map(K1,V1) list(K2,V2)
reduce(K2,list(V2)) list(V2)
7
MapReduce - Execution (2)
Platform :
Nodes communicating over ethernet network over TCP/IP
Two main type of processes :
Master : orchestrates the work
Worker : process data
Units of work :
Jobs :A MapReduce job is a unit of work that the client wants to be performed: it consists of the
input data, the MapReduce program, and configuration.
task : can be map task (process input to intermediate data), reduce task (process intermediate
data to output). Job is divided into several map / reduce tasks. 8
MapReduce Execution (3)
1. A copy of the master process is created
2. Input data is divided into M splits , each of 16 to 64 MB (user configured)
3. M map tasks are created and given unique IDs , each parses key/value pairs
of each split ,start processing , output is written into a memory buffer.
4. Map output is partitioned to R partitions. When buffers are full , they are
spilled into local hard disks , and master is notified by saved buffered
locations. All records with the same key are put in same partition.
Note : Map output stored in local worker file system , not distributed file system , as it is intermediate
and to avoid complexity.
5. Shuffling : when a reduce receives a notification from master that one of 9
MapReduce Execution (4)
6. When the reduce worker receives all its intermediate output , it sorts them by
key (sorting is need as reduce task may have several keys). (1)
7. When sorting finished , the reduce worker iterates over each key , passing the
key and list of values to the reduce function.
8. The output of reduce function is appended to the file corresponding to this
reduce worker.
9. For each HDFS block of the output of reduce task , one block is stored locally
in the reduce worker and the other 2 (assuming replication factor of 3) is
replicated on two other off-rack node for reliability.
■ Notes : 10
Master responsibilities
Find idle nodes (workers) to assign
map and reduce tasks.
monitor each task status
(idle , in-progress, finished).
Keep track of locations of R map
intermediate output on each map
worker machine.
Keep record of worker IDs and
other info (CPU , memory , disk size)
Continuously push information about
intermediate map output to reduce 11
Fault tolerance (1)
Objective : handle machine failures gracefully ,
i.e. programmer don’t need to handle
it or be aware of details.
Two type of failures :
Master failure
Worker failure
Two main activities
Failure detection
Recover lost (computed) data with least 12
Fault tolerance (2)
Worker failure
Detection : Timeout for master ping , mark worker as failed.
Remove worker from list of available worker.
For all map tasks assigned to that worker :
mark these tasks as idle
these tasks will be eligible for re-scheduling on other workers
map tasks are re-executed as output is stored in local file system in failed machine
all reduce workers are notified with re-execution so they can get intermediate data they
haven’t yet.
No need to re-execute reduce tasks as their output is stored in distributed file system and
13
Semantics in the Presence of Failures
Deterministic and Nondeterministic Functions
Deterministic functions always return the same result any time they are called with a specific set
of input values.
Nondeterministic functions may return different results each time they are called with a specific
set of input values.
If map and reduce function are “deterministic” , distributed implementation of
mapreduce framework must produce the same output of a non-faulting
sequential execution of the program.
Several copies of the same map/reduce task might run on different nodes for
sake of reliability and fault tolerance 14
Semantics in the Presence of Failures (2)
Mapper always write their output to tmp files (atomic commits).
When a map task finishes :
Renames the tmp file to final output.
Sends message to master informing it with the filename.
If another copy of the same map finished before , master ignores it , else store the filename.
Reducers do the same , and if multiple copies of the same reduce task finished ,
MapReduce framework rely on the atomic rename of the file system.
If map task are non-deterministic , and multiple copies of the map task run on
different machines , weak semantic condition can happen : two reducers read15
Semantics in the Presence of Failures (3)
16
● Workers #1 and #2 run the
same copy of map task M1
● Reducer task R1 reads its
input for M1 from worker #1
● Reducer task R2 reads its
input for M1 from worker #2 ,
as worker#1 has failed by the
time R2 has started.
● If M1’s function is deterministic
, we have complete
consistency.
● If M1’s function is not
deterministic , R1 and R2 may
receive different results from
M1.
Task granularity
Load balancing: fine grained is better , faster machines tend to take more tasks than slower machined
over time , leads to less overall job execution time.
Failure recovery : less time to re-execute failed tasks.
Very fine-grained tasks may not be desirable : overhead to manage and too much data shuffling
(consume valuable bandwidth).
Optimal Granularity : Split size = HDFS block size (128 MB by default)
HDFS block is guaranteed to be in the same node.
We need to maximize work done by one mapper locally
if split size < block : not fully utilizing possibility of local data processing
if split size > block : may need data transfer to make map function complete
17
Data locality
Network bandwidth is a valuable resource.
We assume a rack server hardware.
MapReduce scheduler works as follows :
1. Try to assign map task to the node where the
corresponding split block(s) reside , if it is free
assign , else go to step 2
2. try to find a free nod in the same rack to assign
the map task , if can’t find a free off-rack node
to assign.
● More complex implementation uses network cost
model.
18
Backup tasks
Stragglers : a set machines that run a set
of assigned tasks (MapReduce) very slow.
Slow running can be due to many reasons;
bad disk , slow network , low speed CPU.
Other tasks scheduled on stragglers cause
more
load and longer execution time.
Solution Mechanism:
When MapReduce job is close to finish
for all the “in-progress” tasks , issue backup tasks. 19
Refinements
Partitioning function :
Partition the output of mapping tasks into R partitions (each for a reduce task).
Good function should try as possible to make the partitions equal.
Default : hash(key) mod R
Works fine usually
problem arises when specific keys have many records
than the others.
Need design custom hash functions or change the key.
Combiner function
Reduce size of map intermediate output.
20
Refinements (2)
Skipping bad record
Bug in third party library that can’t be fixed , causes code crash at specific records
Terminating a job running for hours / days more expensive than sacrificing small percentage of
accuracy (If context allows , for ex. statistical analysis of large data).
How MapReduce handle that ?
1. Each worker process installs a signal handler that catches segmentation violations ,bus
errors and other possible fatal errors.
2. Before a map / reduce task runs , the MapReduce library stores the key value in global
variable.
3. When a map / reduce task function code generates a signal , the worker sends UDP 21
References
1. MapReduce: simplified data processing on large clusters
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
2. Hadoop Definitive guide , Ch 3
3. https://dgleich.wordpress.com/2012/10/05/why-mapreduce-is-successful-its-the-io/
4. http://www.infoworld.com/article/2616904/business-intelligence/mapreduce.html
5. http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
6. Hadoop Definitive Guide - Chapter 1&2
7. http://research.google.com/archive/mapreduce-osdi04-slides/
8. http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
22

Contenu connexe

Tendances

Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusterskazuma_sato
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 

Tendances (20)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
E031201032036
E031201032036E031201032036
E031201032036
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 

En vedette

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGPradeep MG
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceKrishna Sangeeth KS
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysisRengaraj D
 
Cyber terrorism by_Ali_Fahad
Cyber terrorism by_Ali_FahadCyber terrorism by_Ali_Fahad
Cyber terrorism by_Ali_Fahadaliuet
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBRainforest QA
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
cyber terrorism
cyber terrorismcyber terrorism
cyber terrorismAccenture
 

En vedette (20)

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Algorithms for Cloud Computing
Algorithms for Cloud ComputingAlgorithms for Cloud Computing
Algorithms for Cloud Computing
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysis
 
Cyber terrorism by_Ali_Fahad
Cyber terrorism by_Ali_FahadCyber terrorism by_Ali_Fahad
Cyber terrorism by_Ali_Fahad
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB
 
Ipv6
Ipv6Ipv6
Ipv6
 
IPV6 ppt
IPV6 pptIPV6 ppt
IPV6 ppt
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Cyber terrorism
Cyber terrorismCyber terrorism
Cyber terrorism
 
cyber terrorism
cyber terrorismcyber terrorism
cyber terrorism
 
ipv6 ppt
ipv6 pptipv6 ppt
ipv6 ppt
 

Similaire à Introduction to map reduce

mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacmlmphuong06
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large ClustersIRJET Journal
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentationVu Thi Trang
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 

Similaire à Introduction to map reduce (20)

mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Map reduce
Map reduceMap reduce
Map reduce
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop
HadoopHadoop
Hadoop
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 

Dernier

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

Introduction to map reduce

  • 1. Introduction to MapReduce Mohamed Baddar Senior Data Engineer
  • 2. Contents Computation Models for Distributed Computing MPI MapReduce Why MapReduce? How MapReduce works Simple example References 2
  • 3. Distributed Computing Why ? Booming of big data generation (social media , e-commerce , banks , etc …) Big data and machine learning , data mining AI became like bread and butter : better results comes from analyzing bigger set of data. How it works ? Data-Partitioning : divide data into multiple tasks , each implementing the same procedure (computations) at specific phase on its data segment Task-Partitioning : assign different tasks to different computation units Hardware for distributed Computing Multiple processors (Multi-core processors) 3
  • 4. Metrics How to judge computational model suitability ? Simplicity : level of developer experience Scalability : adding more computational node , increase throughput / response time fault tolerance : support recovering computed results when node is down Maintainability : How easy fix bugs , add features Cost : need for special hardware (Multi-core processors , large RAM , infiniband or can use common ethernet cluster of commodity machines) No one size fits all sometimes it is better to use hybrid computational models 4
  • 5. MPI (Message Passing Interface) ● Workload is divided among different processes (each process may have multiple threads) ● Communication is via Message passing ● Data exchange is via shared memory (Physical / Virtual) ● Pros ○ Flexibility : programmer can customize message and communication between nodes ○ Speed : rely on sharing data via memory Source : https://computing.llnl.gov/tutorials/mpi/ 5
  • 6. MapReduce Objective : Design scalable parallel programming framework to be deployed on large cluster of commodity machines Data divided into splits , each processed by map functions , whose output are processed by reduce functions. Originated and first practical implementation in Google Inc. 2004 MapReduce implementations Apache Hadoop (Computation) 6
  • 7. MapReduce Execution (1) # Mapper (M=3) #Reducers (R=2) MapReduce functions map(K1,V1) list(K2,V2) reduce(K2,list(V2)) list(V2) 7
  • 8. MapReduce - Execution (2) Platform : Nodes communicating over ethernet network over TCP/IP Two main type of processes : Master : orchestrates the work Worker : process data Units of work : Jobs :A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration. task : can be map task (process input to intermediate data), reduce task (process intermediate data to output). Job is divided into several map / reduce tasks. 8
  • 9. MapReduce Execution (3) 1. A copy of the master process is created 2. Input data is divided into M splits , each of 16 to 64 MB (user configured) 3. M map tasks are created and given unique IDs , each parses key/value pairs of each split ,start processing , output is written into a memory buffer. 4. Map output is partitioned to R partitions. When buffers are full , they are spilled into local hard disks , and master is notified by saved buffered locations. All records with the same key are put in same partition. Note : Map output stored in local worker file system , not distributed file system , as it is intermediate and to avoid complexity. 5. Shuffling : when a reduce receives a notification from master that one of 9
  • 10. MapReduce Execution (4) 6. When the reduce worker receives all its intermediate output , it sorts them by key (sorting is need as reduce task may have several keys). (1) 7. When sorting finished , the reduce worker iterates over each key , passing the key and list of values to the reduce function. 8. The output of reduce function is appended to the file corresponding to this reduce worker. 9. For each HDFS block of the output of reduce task , one block is stored locally in the reduce worker and the other 2 (assuming replication factor of 3) is replicated on two other off-rack node for reliability. ■ Notes : 10
  • 11. Master responsibilities Find idle nodes (workers) to assign map and reduce tasks. monitor each task status (idle , in-progress, finished). Keep track of locations of R map intermediate output on each map worker machine. Keep record of worker IDs and other info (CPU , memory , disk size) Continuously push information about intermediate map output to reduce 11
  • 12. Fault tolerance (1) Objective : handle machine failures gracefully , i.e. programmer don’t need to handle it or be aware of details. Two type of failures : Master failure Worker failure Two main activities Failure detection Recover lost (computed) data with least 12
  • 13. Fault tolerance (2) Worker failure Detection : Timeout for master ping , mark worker as failed. Remove worker from list of available worker. For all map tasks assigned to that worker : mark these tasks as idle these tasks will be eligible for re-scheduling on other workers map tasks are re-executed as output is stored in local file system in failed machine all reduce workers are notified with re-execution so they can get intermediate data they haven’t yet. No need to re-execute reduce tasks as their output is stored in distributed file system and 13
  • 14. Semantics in the Presence of Failures Deterministic and Nondeterministic Functions Deterministic functions always return the same result any time they are called with a specific set of input values. Nondeterministic functions may return different results each time they are called with a specific set of input values. If map and reduce function are “deterministic” , distributed implementation of mapreduce framework must produce the same output of a non-faulting sequential execution of the program. Several copies of the same map/reduce task might run on different nodes for sake of reliability and fault tolerance 14
  • 15. Semantics in the Presence of Failures (2) Mapper always write their output to tmp files (atomic commits). When a map task finishes : Renames the tmp file to final output. Sends message to master informing it with the filename. If another copy of the same map finished before , master ignores it , else store the filename. Reducers do the same , and if multiple copies of the same reduce task finished , MapReduce framework rely on the atomic rename of the file system. If map task are non-deterministic , and multiple copies of the map task run on different machines , weak semantic condition can happen : two reducers read15
  • 16. Semantics in the Presence of Failures (3) 16 ● Workers #1 and #2 run the same copy of map task M1 ● Reducer task R1 reads its input for M1 from worker #1 ● Reducer task R2 reads its input for M1 from worker #2 , as worker#1 has failed by the time R2 has started. ● If M1’s function is deterministic , we have complete consistency. ● If M1’s function is not deterministic , R1 and R2 may receive different results from M1.
  • 17. Task granularity Load balancing: fine grained is better , faster machines tend to take more tasks than slower machined over time , leads to less overall job execution time. Failure recovery : less time to re-execute failed tasks. Very fine-grained tasks may not be desirable : overhead to manage and too much data shuffling (consume valuable bandwidth). Optimal Granularity : Split size = HDFS block size (128 MB by default) HDFS block is guaranteed to be in the same node. We need to maximize work done by one mapper locally if split size < block : not fully utilizing possibility of local data processing if split size > block : may need data transfer to make map function complete 17
  • 18. Data locality Network bandwidth is a valuable resource. We assume a rack server hardware. MapReduce scheduler works as follows : 1. Try to assign map task to the node where the corresponding split block(s) reside , if it is free assign , else go to step 2 2. try to find a free nod in the same rack to assign the map task , if can’t find a free off-rack node to assign. ● More complex implementation uses network cost model. 18
  • 19. Backup tasks Stragglers : a set machines that run a set of assigned tasks (MapReduce) very slow. Slow running can be due to many reasons; bad disk , slow network , low speed CPU. Other tasks scheduled on stragglers cause more load and longer execution time. Solution Mechanism: When MapReduce job is close to finish for all the “in-progress” tasks , issue backup tasks. 19
  • 20. Refinements Partitioning function : Partition the output of mapping tasks into R partitions (each for a reduce task). Good function should try as possible to make the partitions equal. Default : hash(key) mod R Works fine usually problem arises when specific keys have many records than the others. Need design custom hash functions or change the key. Combiner function Reduce size of map intermediate output. 20
  • 21. Refinements (2) Skipping bad record Bug in third party library that can’t be fixed , causes code crash at specific records Terminating a job running for hours / days more expensive than sacrificing small percentage of accuracy (If context allows , for ex. statistical analysis of large data). How MapReduce handle that ? 1. Each worker process installs a signal handler that catches segmentation violations ,bus errors and other possible fatal errors. 2. Before a map / reduce task runs , the MapReduce library stores the key value in global variable. 3. When a map / reduce task function code generates a signal , the worker sends UDP 21
  • 22. References 1. MapReduce: simplified data processing on large clusters http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf 2. Hadoop Definitive guide , Ch 3 3. https://dgleich.wordpress.com/2012/10/05/why-mapreduce-is-successful-its-the-io/ 4. http://www.infoworld.com/article/2616904/business-intelligence/mapreduce.html 5. http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf 6. Hadoop Definitive Guide - Chapter 1&2 7. http://research.google.com/archive/mapreduce-osdi04-slides/ 8. http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf 22

Notes de l'éditeur

  1. image source http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
  2. source : google paper
  3. In google paper , sorting is mentioned as reducer responsibility , however in Hadoop Definitive Guide , sorting is mentioned as mapper responsibility and reducer is responsible for merging sorting intermediate output.