SlideShare une entreprise Scribd logo
1  sur  21
MapReduce:
Simplified Data
Processing on
Large Clusters
Papers We Love
Bucharest Chapter
October 12, 2015
I am Adrian Florea
Architect Lead @ IBM Bucharest Software Lab
Hello!
What is?
◉MapReduce is a programming model and implementation for processing large data sets
◉Users specify a Map function that processes a key-value pair to generate a set of
intermediate key-value pairs and a Reduce function that aggregates all intermediate
values that share the same intermediate key in order to combine the derived data
appropriately
map:(k1, v1)->[(k2, v2)]
reduce:(k2, [v2])->[(k3, v3)]
Map
◉function (mathematics)
◉map (Java)
◉Select (.NET)
Where else have we seen this?
Reduce
◉fold (functional programming)
◉reduce (Java)
◉Aggregate (.NET)
Other analogies
"To draw an analogy to SQL, map is like
the group-by clause of an aggregate
query. Reduce is analogous to the
aggregate function that is computed
over all the rows with the same group-
by attribute"
D.J. DeWitt & M. Stonebraker
Divide-and-conquer algorithms
“recursively breaking down a problem into two
or more sub-problems of the same (or related)
type (divide), until these become simple enough
to be solved directly (conquer). The solutions to
the sub-problems are then combined to give a
solution to the original problem.”
Jeffrey Dean
Google Fellow
December 6, 2004 4PM
Sanjay Ghemawat
Google Fellow
History
◉April 1960: John McCarthy introduced the concept of “maplist”
◉September 4, 1998: Google founded
◉1998-2003: hundreds of special-purpose large data computation
programs in Google
◉February 2003: 1st version of MapReduce
◉August 2003: MapReduce significant enhancements
◉June 18, 2004: Patent US7650331 B1 filed
◉December 6, 2004: 1st MapReduce public presentation
◉2005: Hadoop implementation started in Java (Douglass R. Cutting &
Michael J. Cafarella)
◉September 4, 2007: Hadoop 0.14.1
◉January 19, 2010: Patent US7650331 B1 published
◉July 6, 2015: Hadoop 2.7.1
Distribution issues
◉Communication and routing
which nodes should be involved?
what transport protocol should be used?
threads/events/connections management
remote execution of your processing code?
◉Fault tolerance and fault detection
◉Load balancing / partitioning of data
heterogeneity of nodes
skew in data
network topology
◉Parallelization strategy
algorithmic issues of work splitting
“without having to deal with
failures, the rest of the support
code just isn’t so complicate”
S. Ghemawat
MapReduce model
Map Operation Reduce Operation
Input Data Intermediate Data Output Data
application-independent
Map Module
application-independent
Reduce Module
MapReduce system
application-specific
Map Operation
application-specific
Reduce OperationInput
Data
Intermediate
Data
Output
Data
Original article Hadoop wiki
Original article Hadoop wiki
MapReduce model in practice
map(String key, String value):
for each Word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(key, AsString(result));
map:(k1, v1)->[(k2, v2)]
reduce:(k2, [v2])->[(k3, v3)]
How long does it take to go
through 1 TB?
Sequentially: 3 hours
MapReduce: startup overhead 70 seconds + computation 80 seconds
Environment
◉ 1800 Linux dual-processorx86 machines, 2-4 GB memory
◉ Fast Ethernet/Giga Ethernet
◉ Inexpensive IDE disks and a distributed Google File System
Take a walk through a Google data center
Tianhe-2, #1 supercomputer: 3,120,000 cores, 1,5 PB total memory
Execution diagram
◉Master process is a task itself initiated by the WQM and is responsible for
assigning all other tasks to worker processes
◉Each worker invokes at least a map thread and a reduce thread
◉If a worker fails, its tasks are reassigned to another worker process
◉When WQM receives a job, it allocates the job to the master that calculates
and requires M+R+1 processes to be allocated to the job
◉WQM responds with the process allocation info (can result less processes) to
the master that will manage the performance of the job
◉Reduce tasks begin work when the master informs them that there are
intermediate files ready
◉Input data (files/DB/memory) are splitted in data blocks (16-64 MB)
automatically or configurable
◉The worker to which a map task has been assigned applies the map() operator
to the respective input data block
◉When the worker completes the task, it informs the master of the status
◉Master informs workers where to find intermediate data and schedules their
reading
◉Workers (3 & 4) sort the intermediate key-value pairs, then merge (by applying
reduce()) them and write to output
Workflow diagram
◉When a process completes a task it informs WQM
which updates the status tables
◉When WQM discovers one process failed, it assign its
tasks to a new process and updates the status tables
Task Status Table
◉TaskID
◉Status (InProgress, Waiting, Completed, Failed)
◉ProcessID
◉InputFiles (Input, Intermediate)
◉OutputFiles
Process Status Table
◉ProcessID
◉Status (Idle, Busy, Failed)
◉Location (CPU ID, etc.)
◉Current (TaskID, WQM)
Questions from the audience
@ original paper presentation
◉Q: Wanted to know of any task that could not be handled using MapReduce?
A: join operations could not be performed with the current model
◉Q: Wondered how MapReduce differs from parallel databases?
A: MapReduce is stored across a large number of machines as compared to parallel databases,
the abstractions are fairly simple to use in MapReduce, and
MapReduce also benefits greatly from locality optimizations
Bibliography
◉ J. Dean, S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI'04,
Dec. 6, 2004
◉ J. Dean, S. Ghemawat, “System and method for efficient large-scale data processing”, Patent
US 7650331 B1, Jan. 19, 2010
◉ S. Ghemawat, J. Dean, J. Zhao, M. Austern, A. Spector, "Google Technology RoundTable: Map
Reduce", Aug. 21, 2008 – Youtube
◉ P. Mahadevan, "OSDI'04 Conference Reports", ;LOGIN: Vol. 30, No. 2, Apr. 2005, p. 61
◉ R. Jacotin, “Lecture: The Google MapReduce”, SlideShare, October 3, 2014
Any questions ?
You can find me at
Thanks!

Contenu connexe

Tendances

Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoopishan0019
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 

Tendances (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
E031201032036
E031201032036E031201032036
E031201032036
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Hadoop
HadoopHadoop
Hadoop
 

En vedette

Infortask | O sistema mais simples e prático para organizar atividades!
Infortask | O sistema mais simples e prático para organizar atividades!Infortask | O sistema mais simples e prático para organizar atividades!
Infortask | O sistema mais simples e prático para organizar atividades!Infortask
 
Application form FOR LINK'D IN
Application form FOR LINK'D INApplication form FOR LINK'D IN
Application form FOR LINK'D INLisa Haines
 
10 reasons for not booking your flight on go to gate
10 reasons for not booking your flight on go to gate10 reasons for not booking your flight on go to gate
10 reasons for not booking your flight on go to gateMenna-Allah Ashraf
 
Subject Selection - Industrial Arts
Subject Selection - Industrial ArtsSubject Selection - Industrial Arts
Subject Selection - Industrial Artsfairvalehigh
 
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...Global Business Events
 
Failures in fpd/ orthodontics courses in india
Failures in fpd/ orthodontics courses in indiaFailures in fpd/ orthodontics courses in india
Failures in fpd/ orthodontics courses in indiaIndian dental academy
 
500 important spoken tamil situations into spoken english sentences sample
500 important spoken tamil situations into spoken english sentences   sample500 important spoken tamil situations into spoken english sentences   sample
500 important spoken tamil situations into spoken english sentences sampleJayakumar K S
 
Bachelors of Commerce Degree Final Certificate
Bachelors of Commerce Degree Final CertificateBachelors of Commerce Degree Final Certificate
Bachelors of Commerce Degree Final CertificateShadrack Mwenda
 
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkReal-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkSingleStore
 
Basic components of computer system
Basic components  of computer systemBasic components  of computer system
Basic components of computer systemTECHNOHABIT
 

En vedette (18)

Mapas mentales
Mapas mentalesMapas mentales
Mapas mentales
 
Modern History
Modern HistoryModern History
Modern History
 
Infortask | O sistema mais simples e prático para organizar atividades!
Infortask | O sistema mais simples e prático para organizar atividades!Infortask | O sistema mais simples e prático para organizar atividades!
Infortask | O sistema mais simples e prático para organizar atividades!
 
Cv Pradipta
Cv PradiptaCv Pradipta
Cv Pradipta
 
Application form FOR LINK'D IN
Application form FOR LINK'D INApplication form FOR LINK'D IN
Application form FOR LINK'D IN
 
Veneers/ fixed orthodontics courses
Veneers/ fixed orthodontics coursesVeneers/ fixed orthodontics courses
Veneers/ fixed orthodontics courses
 
10 reasons for not booking your flight on go to gate
10 reasons for not booking your flight on go to gate10 reasons for not booking your flight on go to gate
10 reasons for not booking your flight on go to gate
 
BWS - Profile
BWS - ProfileBWS - Profile
BWS - Profile
 
Subject Selection - Industrial Arts
Subject Selection - Industrial ArtsSubject Selection - Industrial Arts
Subject Selection - Industrial Arts
 
Seguridad informatica
Seguridad informaticaSeguridad informatica
Seguridad informatica
 
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
 
Failures in fpd/ orthodontics courses in india
Failures in fpd/ orthodontics courses in indiaFailures in fpd/ orthodontics courses in india
Failures in fpd/ orthodontics courses in india
 
500 important spoken tamil situations into spoken english sentences sample
500 important spoken tamil situations into spoken english sentences   sample500 important spoken tamil situations into spoken english sentences   sample
500 important spoken tamil situations into spoken english sentences sample
 
JKUAT Degree Cert
JKUAT Degree CertJKUAT Degree Cert
JKUAT Degree Cert
 
Bachelors of Commerce Degree Final Certificate
Bachelors of Commerce Degree Final CertificateBachelors of Commerce Degree Final Certificate
Bachelors of Commerce Degree Final Certificate
 
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkReal-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
 
Learning strategies
Learning strategiesLearning strategies
Learning strategies
 
Basic components of computer system
Basic components  of computer systemBasic components  of computer system
Basic components of computer system
 

Similaire à MapReduce Data Processing Large Clusters

Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 

Similaire à MapReduce Data Processing Large Clusters (20)

MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 

Dernier

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 

Dernier (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

MapReduce Data Processing Large Clusters

  • 1. MapReduce: Simplified Data Processing on Large Clusters Papers We Love Bucharest Chapter October 12, 2015
  • 2. I am Adrian Florea Architect Lead @ IBM Bucharest Software Lab Hello!
  • 3. What is? ◉MapReduce is a programming model and implementation for processing large data sets ◉Users specify a Map function that processes a key-value pair to generate a set of intermediate key-value pairs and a Reduce function that aggregates all intermediate values that share the same intermediate key in order to combine the derived data appropriately map:(k1, v1)->[(k2, v2)] reduce:(k2, [v2])->[(k3, v3)]
  • 4. Map ◉function (mathematics) ◉map (Java) ◉Select (.NET) Where else have we seen this? Reduce ◉fold (functional programming) ◉reduce (Java) ◉Aggregate (.NET)
  • 5. Other analogies "To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function that is computed over all the rows with the same group- by attribute" D.J. DeWitt & M. Stonebraker Divide-and-conquer algorithms “recursively breaking down a problem into two or more sub-problems of the same (or related) type (divide), until these become simple enough to be solved directly (conquer). The solutions to the sub-problems are then combined to give a solution to the original problem.”
  • 6. Jeffrey Dean Google Fellow December 6, 2004 4PM Sanjay Ghemawat Google Fellow
  • 7. History ◉April 1960: John McCarthy introduced the concept of “maplist” ◉September 4, 1998: Google founded ◉1998-2003: hundreds of special-purpose large data computation programs in Google ◉February 2003: 1st version of MapReduce ◉August 2003: MapReduce significant enhancements ◉June 18, 2004: Patent US7650331 B1 filed ◉December 6, 2004: 1st MapReduce public presentation ◉2005: Hadoop implementation started in Java (Douglass R. Cutting & Michael J. Cafarella) ◉September 4, 2007: Hadoop 0.14.1 ◉January 19, 2010: Patent US7650331 B1 published ◉July 6, 2015: Hadoop 2.7.1
  • 8. Distribution issues ◉Communication and routing which nodes should be involved? what transport protocol should be used? threads/events/connections management remote execution of your processing code? ◉Fault tolerance and fault detection ◉Load balancing / partitioning of data heterogeneity of nodes skew in data network topology ◉Parallelization strategy algorithmic issues of work splitting “without having to deal with failures, the rest of the support code just isn’t so complicate” S. Ghemawat
  • 9. MapReduce model Map Operation Reduce Operation Input Data Intermediate Data Output Data
  • 10. application-independent Map Module application-independent Reduce Module MapReduce system application-specific Map Operation application-specific Reduce OperationInput Data Intermediate Data Output Data
  • 13. MapReduce model in practice map(String key, String value): for each Word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values) int result = 0; for each v in values: result += ParseInt(v); Emit(key, AsString(result)); map:(k1, v1)->[(k2, v2)] reduce:(k2, [v2])->[(k3, v3)]
  • 14. How long does it take to go through 1 TB? Sequentially: 3 hours MapReduce: startup overhead 70 seconds + computation 80 seconds Environment ◉ 1800 Linux dual-processorx86 machines, 2-4 GB memory ◉ Fast Ethernet/Giga Ethernet ◉ Inexpensive IDE disks and a distributed Google File System
  • 15. Take a walk through a Google data center
  • 16. Tianhe-2, #1 supercomputer: 3,120,000 cores, 1,5 PB total memory
  • 17. Execution diagram ◉Master process is a task itself initiated by the WQM and is responsible for assigning all other tasks to worker processes ◉Each worker invokes at least a map thread and a reduce thread ◉If a worker fails, its tasks are reassigned to another worker process ◉When WQM receives a job, it allocates the job to the master that calculates and requires M+R+1 processes to be allocated to the job ◉WQM responds with the process allocation info (can result less processes) to the master that will manage the performance of the job ◉Reduce tasks begin work when the master informs them that there are intermediate files ready ◉Input data (files/DB/memory) are splitted in data blocks (16-64 MB) automatically or configurable ◉The worker to which a map task has been assigned applies the map() operator to the respective input data block ◉When the worker completes the task, it informs the master of the status ◉Master informs workers where to find intermediate data and schedules their reading ◉Workers (3 & 4) sort the intermediate key-value pairs, then merge (by applying reduce()) them and write to output
  • 18. Workflow diagram ◉When a process completes a task it informs WQM which updates the status tables ◉When WQM discovers one process failed, it assign its tasks to a new process and updates the status tables Task Status Table ◉TaskID ◉Status (InProgress, Waiting, Completed, Failed) ◉ProcessID ◉InputFiles (Input, Intermediate) ◉OutputFiles Process Status Table ◉ProcessID ◉Status (Idle, Busy, Failed) ◉Location (CPU ID, etc.) ◉Current (TaskID, WQM)
  • 19. Questions from the audience @ original paper presentation ◉Q: Wanted to know of any task that could not be handled using MapReduce? A: join operations could not be performed with the current model ◉Q: Wondered how MapReduce differs from parallel databases? A: MapReduce is stored across a large number of machines as compared to parallel databases, the abstractions are fairly simple to use in MapReduce, and MapReduce also benefits greatly from locality optimizations
  • 20. Bibliography ◉ J. Dean, S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI'04, Dec. 6, 2004 ◉ J. Dean, S. Ghemawat, “System and method for efficient large-scale data processing”, Patent US 7650331 B1, Jan. 19, 2010 ◉ S. Ghemawat, J. Dean, J. Zhao, M. Austern, A. Spector, "Google Technology RoundTable: Map Reduce", Aug. 21, 2008 – Youtube ◉ P. Mahadevan, "OSDI'04 Conference Reports", ;LOGIN: Vol. 30, No. 2, Apr. 2005, p. 61 ◉ R. Jacotin, “Lecture: The Google MapReduce”, SlideShare, October 3, 2014
  • 21. Any questions ? You can find me at Thanks!