SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
MapReduce
 Ahmed Elmorsy
What is MapReduce?
● MapReduce is a programming model for
  processing and generating large data sets.

● Inspired by the map and reduce primitives
  present in Lisp and many other functional
  languages

● Use of a functional model with user-specified
  map and reduce operations allows us to
  parallelize large computations easily.
Map function
Map, written by the user, takes an input pair
and produces a set of intermediate key/value
pairs. The MapReduce library groups together
all intermediate values associated with the
same intermediate key I and passes them to the
Reduce function.


   map (k1,v1) → list(k2,v2)
Reduce function
The Reduce function, also written by the user,
accepts an intermediate key I and a set of
values for that key. It merges together these
values to form a possibly smaller set of values.
Typically just zero or one output value is
produced per Reduce invocation.


reduce (k2,list(v2)) → list(v2)
Example (Word Count)
          Problem

    Counting the number of
 occurrences of each word in a
 large collection of documents
Example (Word Count)
Map function:

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Example (Word Count)
Reduce function:

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Execution Overview
● The Map invocations are distributed across
  multiple machines by automatically
  partitioning the input data into a set of M splits.
● Reduce invocations are distributed by
  partitioning the intermediate key space into R
  pieces using a partitioning function.
● The number of partitions (R) and the
  partitioning function are specified by the user.
How Master works?
● The master picks idle workers and assigns
  each one a map task or a reduce task.

● For each map task and reduce task, it stores
  the state (idle, in-progress, or
  completed), and the identity of the worker
  machine (for non-idle tasks).
Fault Tolerance

● Worker Failure

● Master Failure
Worker Failure
● The master pings every worker periodically.

● If no response is received from a worker,
  the master marks the worker as failed.

● Any map task or reduce task in progress on a
  failed worker is also reset to idle and
  becomes eligible for rescheduling.
Worker Failure
● Any map tasks completed by the worker are
  reset back to their initial idle state, and
  therefore become eligible for scheduling on
  other workers (WHY?!).

● Completed reduce tasks do not need to be re-
  executed (WHY?!).
Master Failure
There are two options:
1. Make the master write periodic checkpoints
   of the master data structures described
   above. If the master task dies, a new copy
   can be started from the last checkpointed
   state.

2. Abort the MapReduce computation if the
   master fails.
Backup Tasks
● One of the common causes that lengthens
  the total time taken for a MapReduce
  operation is a “straggler”.
● When a MapReduce operation is close to
  completion, the master schedules backup
  executions of the remaining in-progress
  tasks.
● The task is marked as completed whenever
  either the primary or the backup execution
  completes
Refinements
1.   Partitioning Function
2.   Combiner Function
3.   Input and Output Types
4.   Skipping Bad Records
5.   Status Information
6.   Counters
More Examples
● Distributed Grep
● Count of URL Access Frequency
● Reverse Web-Link Graph
● Inverted Index
● Distributed Sort
Apache Hadoop
Open Source Implementation of MapReduce
Hadoop Modules
● Hadoop Common

● Hadoop Distributed File System (HDFS™)

● Hadoop YARN

● Hadoop MapReduce
Projects based on Hadoop
● Apache Hive
Developed by Facebook and used by Netflix.

● Apache Pig
Developed at Yahoo! and used by Twitter.

● Apache Cassandra
Developed by Facebook
Template Hadoop Program
public class MyJob extends Configured implements Tool {
   public static class MapClass extends MapReduceBase
   implements Mapper<Text, Text, Text, Text> {
       public void map(Text key, Text value,
       OutputCollector<Text, Text> output, Reporter reporter)
       throws IOException {
          //Map Function
       }
   }
   public static class Reduce extends MapReduceBase implements
   Reducer<Text, Text, Text, Text> {
       public void reduce(Text key, Iterator<Text> values,
       OutputCollector<Text, Text> output, Reporter reporter)
       throws IOException {
       }
   }
public int run(String[] args) throws Exception {
   Configuration conf = getConf();
   JobConf job = new JobConf(conf, MyJob.class);
   Path in = new Path(args[0]);
   Path out = new Path(args[1]);
   FileInputFormat.setInputPaths(job, in);
   FileOutputFormat.setOutputPath(job, out);
   job.setJobName("MyJob");
   job.setMapperClass(MapClass.class);
   job.setReducerClass(Reduce.class);
   job.setInputFormat(KeyValueTextInputFormat.class);
   job.setOutputFormat(TextOutputFormat.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
   job.set("key.value.separator.in.input.line", ",");
   JobClient.runJob(job);
   return 0;
}
public static void main(String[] args) throws Exception {
       int res = ToolRunner.run(new Configuration(), new
       MyJob(), args);
       System.exit(res);
    }
}



To Run it, You have to generate the JAR file, then
you can use the command:

bin/hadoop jar playground/MyJob.jar MyJob
input/cite75_99.txt output
Readings

         Chapter 4 in
(Lam, Chuck. Hadoop in action.
Manning Publications Co., 2010.)
References
[1] Jeffrey Dean and Sanjay Ghemawat. MapReduce:
Simplified data processing
on large clusters. In OSDI, pages 137–150, 2004.

[2] Lam, Chuck. Hadoop in action. Manning
Publications Co., 2010.

[3] http://hadoop.apache.org/

Contenu connexe

Tendances

Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr AgileNCR2013
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Rajesh Ananda Kumar
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting techniqueUday Vakalapudi
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 

Tendances (20)

Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
 
MapReduce
MapReduceMapReduce
MapReduce
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

En vedette

Secuencias bloque i historia 2013
Secuencias bloque i  historia 2013Secuencias bloque i  historia 2013
Secuencias bloque i historia 2013Andrea Nevarez
 
Asana walkthrough
Asana walkthroughAsana walkthrough
Asana walkthroughJamie Lin
 
unSEXY Conf 2013: Andrew Watterson, Asana
unSEXY Conf 2013: Andrew Watterson, Asana unSEXY Conf 2013: Andrew Watterson, Asana
unSEXY Conf 2013: Andrew Watterson, Asana 500 Startups
 
Managing an agile mobile app development project with asana
Managing an agile mobile app development project with asanaManaging an agile mobile app development project with asana
Managing an agile mobile app development project with asanaAnthony Monticchio
 
Chinese lanterns
Chinese lanternsChinese lanterns
Chinese lanterns1197sana
 
Electronic Arts
Electronic ArtsElectronic Arts
Electronic Arts1197sana
 

En vedette (7)

Fsck Sx
Fsck SxFsck Sx
Fsck Sx
 
Secuencias bloque i historia 2013
Secuencias bloque i  historia 2013Secuencias bloque i  historia 2013
Secuencias bloque i historia 2013
 
Asana walkthrough
Asana walkthroughAsana walkthrough
Asana walkthrough
 
unSEXY Conf 2013: Andrew Watterson, Asana
unSEXY Conf 2013: Andrew Watterson, Asana unSEXY Conf 2013: Andrew Watterson, Asana
unSEXY Conf 2013: Andrew Watterson, Asana
 
Managing an agile mobile app development project with asana
Managing an agile mobile app development project with asanaManaging an agile mobile app development project with asana
Managing an agile mobile app development project with asana
 
Chinese lanterns
Chinese lanternsChinese lanterns
Chinese lanterns
 
Electronic Arts
Electronic ArtsElectronic Arts
Electronic Arts
 

Similaire à MapReduce

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
MapReduce wordcount program
MapReduce wordcount program MapReduce wordcount program
MapReduce wordcount program Sarwan Singh
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分sg7879
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSArchana Gopinath
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Big data shim
Big data shimBig data shim
Big data shimtistrue
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 

Similaire à MapReduce (20)

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
MapReduce wordcount program
MapReduce wordcount program MapReduce wordcount program
MapReduce wordcount program
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Scalding
ScaldingScalding
Scalding
 
Big data shim
Big data shimBig data shim
Big data shim
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 

Dernier

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Dernier (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

MapReduce

  • 2. What is MapReduce? ● MapReduce is a programming model for processing and generating large data sets. ● Inspired by the map and reduce primitives present in Lisp and many other functional languages ● Use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily.
  • 3. Map function Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. map (k1,v1) → list(k2,v2)
  • 4. Reduce function The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. reduce (k2,list(v2)) → list(v2)
  • 5. Example (Word Count) Problem Counting the number of occurrences of each word in a large collection of documents
  • 6. Example (Word Count) Map function: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");
  • 7. Example (Word Count) Reduce function: reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 8.
  • 9. Execution Overview ● The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. ● Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function. ● The number of partitions (R) and the partitioning function are specified by the user.
  • 10. How Master works? ● The master picks idle workers and assigns each one a map task or a reduce task. ● For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
  • 11. Fault Tolerance ● Worker Failure ● Master Failure
  • 12. Worker Failure ● The master pings every worker periodically. ● If no response is received from a worker, the master marks the worker as failed. ● Any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.
  • 13. Worker Failure ● Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers (WHY?!). ● Completed reduce tasks do not need to be re- executed (WHY?!).
  • 14. Master Failure There are two options: 1. Make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. 2. Abort the MapReduce computation if the master fails.
  • 15. Backup Tasks ● One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”. ● When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. ● The task is marked as completed whenever either the primary or the backup execution completes
  • 16. Refinements 1. Partitioning Function 2. Combiner Function 3. Input and Output Types 4. Skipping Bad Records 5. Status Information 6. Counters
  • 17. More Examples ● Distributed Grep ● Count of URL Access Frequency ● Reverse Web-Link Graph ● Inverted Index ● Distributed Sort
  • 18. Apache Hadoop Open Source Implementation of MapReduce
  • 19. Hadoop Modules ● Hadoop Common ● Hadoop Distributed File System (HDFS™) ● Hadoop YARN ● Hadoop MapReduce
  • 20. Projects based on Hadoop ● Apache Hive Developed by Facebook and used by Netflix. ● Apache Pig Developed at Yahoo! and used by Twitter. ● Apache Cassandra Developed by Facebook
  • 21. Template Hadoop Program public class MyJob extends Configured implements Tool { public static class MapClass extends MapReduceBase implements Mapper<Text, Text, Text, Text> { public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { //Map Function } } public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { } }
  • 22. public int run(String[] args) throws Exception { Configuration conf = getConf(); JobConf job = new JobConf(conf, MyJob.class); Path in = new Path(args[0]); Path out = new Path(args[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setJobName("MyJob"); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setInputFormat(KeyValueTextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.set("key.value.separator.in.input.line", ","); JobClient.runJob(job); return 0; }
  • 23. public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new MyJob(), args); System.exit(res); } } To Run it, You have to generate the JAR file, then you can use the command: bin/hadoop jar playground/MyJob.jar MyJob input/cite75_99.txt output
  • 24. Readings Chapter 4 in (Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.)
  • 25. References [1] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137–150, 2004. [2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010. [3] http://hadoop.apache.org/