SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
www.edureka.co/big-data-and-hadoop
Hadoop the ultimate data storage
And processing Together
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
Analyze different use-cases where MapReduce is used
Differentiate between Traditional way and MapReduce way
Learn about Hadoop 2.x MapReduce architecture and components
Understand execution flow of YARN MapReduce application
Implement basic MapReduce concepts
Run a MapReduce Program
At the end of this module, you will be able to
Slide 3 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
Weather Forecasting
HealthCare
 Problem Statement:
» De-identify personal health information.
 Problem Statement:
» Finding Maximum temperature recorded in a year.
Slide 4 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
MapReduce
FeaturesLarge Scale
Distributed Model
Used in
Function
Design Pattern
Parallel
Programming
A Program Model
Classification
Analytics
Recommendation
Index and Search
Map
Reduce
Classification
Eg: Top N records
Analytics
Eg: Join, Selection
Recommendation
Eg: Sort
Summarization
Eg: Inverted Index
Implemented
Google
Apache Hadoop
HDFS
Pig
Hive
HBase
For
Slide 5 www.edureka.co/big-data-and-hadoop
The Traditional Way
Very
Big
Data
Split Data matches
All
matches
grep
grep
grep cat
grep
:
matches
matches
matches
Split Data
Split Data
Split Data
Slide 6 www.edureka.co/big-data-and-hadoop
MapReduce Way
Very
Big
Data
Split Data
All
matches
:
Split Data
Split Data
Split Data
M
A
P
R
E
D
U
C
E
MapReduce Framework
Slide 7 www.edureka.co/big-data-and-hadoop
MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
Deer Bear River
Dear Bear River
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
K2,List(V2)List(K2,V2)
K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)
Slide 8 www.edureka.co/big-data-and-hadoop
Anatomy of a MapReduce Program
MapReduce
Map:
Reduce:
(K1, V1) List (K2, V2)
(K2, list (V2)) List (K3, V3)
Key Value
Slide 9 www.edureka.co/big-data-and-hadoop
Why MapReduce?
Two biggest Advantages:
» Taking processing to the data
» Processing data in parallel
a
b
c
Map Task
HDFS Block
Data Center
Rack
Node
Slide 10 www.edureka.co/big-data-and-hadoop
 ApplicationMaster
» One per application
» Short life
» Coordinates and Manages MapReduce Jobs
» Negotiates with Resource Manager to
schedule tasks
» The tasks are started by NodeManager(s)
 Job History Server
» Maintains information about submitted
MapReduce jobs after their ApplicationMaster
terminates
 Client
» Submits a MapReduce Job
 Resource Manager
» Cluster Level resource manager
» Long Life, High Quality Hardware
 Node Manager
» One per Data Node
» Monitors resources on Data Node
Hadoop 2.x MapReduce Components
 Container
» Created by NM when requested
» Allocates certain amount of resources
(memory, CPU etc.) on a slave node
Slide 11 www.edureka.co/big-data-and-hadoop
BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN – Moving beyond MapReduce
Slide 12 www.edureka.co/big-data-and-hadoop
MapReduce Application Execution
Executing MapReduce Application on YARN
Slide 13 www.edureka.co/big-data-and-hadoop
YARN MR Application Execution Flow
MapReduce Job Execution
» Job Submission
» Job Initialization
» Tasks Assignment
» Memory Assignment
» Status Updates
» Failure Recovery
Slide 14 www.edureka.co/big-data-and-hadoop
HDFS
Application Job Object
Client JVM
Client
Resource
Manager
Management Node
Run Job
2. Get New Application ID
4. Submit Application Context
3. Prepare the
Application submit
context
3.1 App Jar
3.2 Job Resources(Block
locations)
3.3 User Information
1. Notify Start Application
YARN MR Application Execution Flow
Slide 15 www.edureka.co/big-data-and-hadoop
HDFS
3. Prepare the
Application submit
context
3.1 App Jar
3.2 Job Resources(Block
locations)
3.3 User Information
Node Manager
5. Start AppMaster container /
Allocate Context for AppMaster
App Master
6.Alloate
Container for
AppMaster
7.Request
Resources
8.Notify with resources
Availability
Data Node
YARN MR Application Execution Flow
Application Job Object
Client JVM
Client
Resource
Manager
Management Node
Run Job
2. Get New Application ID
4. Submit Application Context
1. Notify Start Application
Slide 16 www.edureka.co/big-data-and-hadoop
HDFS
Resource
Manager
3. Prepare the Application
submit context
3.1 App Jar
3.2 Job Resources(Block
locations)
3.3 User Information
Management Node
Node Manager
5. Start AppMaster container / Allocate
Context for AppMaster
App Master
6. Allocate
Container for
AppMaster
7.Request
Resources
8.Notify with resources
Availability
Data Node
Client
Node Manager
Data node-1
Node Manager
Map Block
9.Start Container
in the worker node
Data node-2
Node Manager
Map Block
10.NM allocate
Container
10.NM allocate
Container
2. Get New Application
4. Submit Application
1. Notify Start Application
9.Start Container
in the worker
node
YARN MR Application Execution Flow
Slide 17 www.edureka.co/big-data-and-hadoop
YARN MR Application Execution Flow
11.Task get Executed.
12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate
Container
13.Output of All the Maps given to reducer and Reducer get executed
14.Once Job finished, Application Master notify the Resource Manager and Client Library
15.Application Master closed.
Slide 18 www.edureka.co/big-data-and-hadoop
Hadoop 2.x : YARN Workflow
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Container 1.2
Container 1.1
Container 2.1
Container 2.2
Container 2.3
App
Master 2
App
Master 1
Scheduler
Applications
Manager (AsM)
Resource
Manager
Slide 19 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application Client RM NM AM
1
Slide 20 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
Client RM NM AM
1
2
Slide 21 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
Client RM NM AM
1
2
3
Slide 22 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
Client RM NM AM
1
2
3
4
Slide 23 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
Client RM NM AM
1
2
3
4
5
Slide 24 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
Client RM NM AM
1
2
3
4
5
6
Slide 25 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
Client RM NM AM
1
2
3
4
5
7 6
Slide 26 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
Client RM NM AM
1
2
3
4
5
7
8
6
Slide 27 www.edureka.co/big-data-and-hadoop
Input Splits
INPUT DATA
Physical
Division
Logical
Division
HDFS
Blocks
Input
Splits
Slide 28 www.edureka.co/big-data-and-hadoop
Relation Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
 Logical records do not fit neatly into the HDFS blocks.
 Logical records are lines that cross the boundary of the blocks.
 First split contains line 5 although it spans across blocks.
File
Lines
Block
Boundary
Block
Boundary
Block
Boundary
Block
Boundary
Split Split Split
Slide 29 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Node 1 Node 2
INPUT DATA
Slide 30 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Map
Node 1
Map
Node 2
INPUT DATA
Slide 31 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Map
Node 1
Map
Node 2
INPUT DATA
Slide 32 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Map
Node 1
Map
Node 2
Node 1 Node 2
INPUT DATA
Slide 33 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
Slide 34 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Reducer output is stored
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
Slide 35 www.edureka.co/big-data-and-hadoop
Combiner
Combiner
Reducer
(B,1)
(C,1)
(D,1)
(E,1)
(D,1)
(B,1)
(D,1)
(A,1)
(A,1)
(C,1)
(B,1)
(D,1)
(B,2)
(C,1)
(D,2)
(E,1)
(D,2)
(A,2)
(C,1)
(B,1)
(A, [2])
(B, [2,1])
(C, [1,1])
(D, [2,2])
(E, [1])
(A,2)
(B,3)
(C,2)
(D,4)
(E,1)
Shuffle
CombinerMapper
Mapper
B
C
D
E
D
B
D
A
A
C
B
D
Block1Block2
Slide 36 www.edureka.co/big-data-and-hadoop
Partitioner – Redirecting Output from Mapper
Map
Map
Map
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner
Slide 37 www.edureka.co/big-data-and-hadoop
Getting Data to the Mapper
Input File Input File
Input split Input split Input split Input split
RecordReader RecordReader RecordReader RecordReader
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)
Slide 38 www.edureka.co/big-data-and-hadoop
Partition and Shuffle
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
(intermediates) (intermediates) (intermediates)
Reducer Reducer Reducer
Slide 39 www.edureka.co/big-data-and-hadoop
Demo of Word Count Program
To illustrate Default Input Format
(Text Input Format)
Demo
Slide 40 www.edureka.co/big-data-and-hadoop
Input file
Input Split Input Split Input Split
Record
Reader
Record
Reader
Record
Reader
Mapper Mapper Mapper
(Intermediates) (Intermediates) (Intermediates)
InputFormat
Input Split
Record
Reader
Mapper
Input file
(Intermediates)
Input Format
Slide 41 www.edureka.co/big-data-and-hadoop
Combine File
Input Format<K,V>
Text Input Format
Key Value Text
Input Format
Nline Input Format
Sequence File
Input Format<K,V>
File Input Format
<K,V>
Input Format<K,V>
org.apache.hadoop.mapreduce
<<interface>>
Composable
Input Format
<K,V>
Composite Input Format
<K,V>
DB Input
Format<T>
Sequence File As
Binary Input Format
Sequence File As
Text Input Format
Sequence File Input
Filter<K,V>
Input Format – Class Hierarchy
Slide 42 www.edureka.co/big-data-and-hadoop
Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
OutputFormat
Output Format
Slide 43 www.edureka.co/big-data-and-hadoop
Text Output Format
<K,V>
Sequence File
Output Format<K,V>
Output Format <K,V>
org.apache.hadoop.mapreduce
DB Output Format
<K,V>
File Output Format
<K,V>
Null Output Format
<K,V>
Filter Output Format
<K,V>
Sequence File As Binary
Output Format
Lazy Output Format
<K,V>
Output Format – Class Hierarchy
Slide 44 www.edureka.co/big-data-and-hadoop
Demo
Demo: Custom Input Format
XML Parsing with Map Reduce

Contenu connexe

Tendances

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
Edureka!
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Tendances (20)

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

En vedette

Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
skaluska
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 

En vedette (20)

Efficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopEfficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in Hadoop
 
Pig - Processing XML data
Pig - Processing XML dataPig - Processing XML data
Pig - Processing XML data
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXML
 
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse OptimisationBigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
 
La plateforme OpenData 3.0 pour libérer et valoriser les données
La plateforme OpenData 3.0 pour libérer et valoriser les données  La plateforme OpenData 3.0 pour libérer et valoriser les données
La plateforme OpenData 3.0 pour libérer et valoriser les données
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015
 
MapReduce 簡單介紹與練習
MapReduce 簡單介紹與練習MapReduce 簡單介紹與練習
MapReduce 簡單介紹與練習
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 
Sales Forecasting
Sales ForecastingSales Forecasting
Sales Forecasting
 

Similaire à XML Parsing with Map Reduce

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
James McGalliard
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
Apache Storm
Apache StormApache Storm
Apache Storm
Edureka!
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
AMIT BORUDE
 

Similaire à XML Parsing with Map Reduce (20)

Distributed Cache With MapReduce
Distributed Cache With MapReduceDistributed Cache With MapReduce
Distributed Cache With MapReduce
 
Bulk Loading Into HBase With MapReduce
Bulk Loading Into HBase With MapReduceBulk Loading Into HBase With MapReduce
Bulk Loading Into HBase With MapReduce
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Strata + Hadoop World 2012: Knitting Boar
Strata + Hadoop World 2012: Knitting BoarStrata + Hadoop World 2012: Knitting Boar
Strata + Hadoop World 2012: Knitting Boar
 
E031201032036
E031201032036E031201032036
E031201032036
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 

Plus de Edureka!

Plus de Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

XML Parsing with Map Reduce

  • 1. www.edureka.co/big-data-and-hadoop Hadoop the ultimate data storage And processing Together
  • 2. Slide 2 www.edureka.co/big-data-and-hadoop Objectives Analyze different use-cases where MapReduce is used Differentiate between Traditional way and MapReduce way Learn about Hadoop 2.x MapReduce architecture and components Understand execution flow of YARN MapReduce application Implement basic MapReduce concepts Run a MapReduce Program At the end of this module, you will be able to
  • 3. Slide 3 www.edureka.co/big-data-and-hadoop Where MapReduce is Used? Weather Forecasting HealthCare  Problem Statement: » De-identify personal health information.  Problem Statement: » Finding Maximum temperature recorded in a year.
  • 4. Slide 4 www.edureka.co/big-data-and-hadoop Where MapReduce is Used? MapReduce FeaturesLarge Scale Distributed Model Used in Function Design Pattern Parallel Programming A Program Model Classification Analytics Recommendation Index and Search Map Reduce Classification Eg: Top N records Analytics Eg: Join, Selection Recommendation Eg: Sort Summarization Eg: Inverted Index Implemented Google Apache Hadoop HDFS Pig Hive HBase For
  • 5. Slide 5 www.edureka.co/big-data-and-hadoop The Traditional Way Very Big Data Split Data matches All matches grep grep grep cat grep : matches matches matches Split Data Split Data Split Data
  • 6. Slide 6 www.edureka.co/big-data-and-hadoop MapReduce Way Very Big Data Split Data All matches : Split Data Split Data Split Data M A P R E D U C E MapReduce Framework
  • 7. Slide 7 www.edureka.co/big-data-and-hadoop MapReduce Paradigm The Overall MapReduce Word Count Process Input Splitting Mapping Shuffling Reducing Final Result List(K3,V3) Deer Bear River Dear Bear River Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Deer, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Deer, 1 Car, 1 Bear, 1 K2,List(V2)List(K2,V2) K1,V1 Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Bear, (1,1) Car, (1,1,1) Deer, (1,1) River, (1,1)
  • 8. Slide 8 www.edureka.co/big-data-and-hadoop Anatomy of a MapReduce Program MapReduce Map: Reduce: (K1, V1) List (K2, V2) (K2, list (V2)) List (K3, V3) Key Value
  • 9. Slide 9 www.edureka.co/big-data-and-hadoop Why MapReduce? Two biggest Advantages: » Taking processing to the data » Processing data in parallel a b c Map Task HDFS Block Data Center Rack Node
  • 10. Slide 10 www.edureka.co/big-data-and-hadoop  ApplicationMaster » One per application » Short life » Coordinates and Manages MapReduce Jobs » Negotiates with Resource Manager to schedule tasks » The tasks are started by NodeManager(s)  Job History Server » Maintains information about submitted MapReduce jobs after their ApplicationMaster terminates  Client » Submits a MapReduce Job  Resource Manager » Cluster Level resource manager » Long Life, High Quality Hardware  Node Manager » One per Data Node » Monitors resources on Data Node Hadoop 2.x MapReduce Components  Container » Created by NM when requested » Allocates certain amount of resources (memory, CPU etc.) on a slave node
  • 11. Slide 11 www.edureka.co/big-data-and-hadoop BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html YARN – Moving beyond MapReduce
  • 12. Slide 12 www.edureka.co/big-data-and-hadoop MapReduce Application Execution Executing MapReduce Application on YARN
  • 13. Slide 13 www.edureka.co/big-data-and-hadoop YARN MR Application Execution Flow MapReduce Job Execution » Job Submission » Job Initialization » Tasks Assignment » Memory Assignment » Status Updates » Failure Recovery
  • 14. Slide 14 www.edureka.co/big-data-and-hadoop HDFS Application Job Object Client JVM Client Resource Manager Management Node Run Job 2. Get New Application ID 4. Submit Application Context 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information 1. Notify Start Application YARN MR Application Execution Flow
  • 15. Slide 15 www.edureka.co/big-data-and-hadoop HDFS 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information Node Manager 5. Start AppMaster container / Allocate Context for AppMaster App Master 6.Alloate Container for AppMaster 7.Request Resources 8.Notify with resources Availability Data Node YARN MR Application Execution Flow Application Job Object Client JVM Client Resource Manager Management Node Run Job 2. Get New Application ID 4. Submit Application Context 1. Notify Start Application
  • 16. Slide 16 www.edureka.co/big-data-and-hadoop HDFS Resource Manager 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information Management Node Node Manager 5. Start AppMaster container / Allocate Context for AppMaster App Master 6. Allocate Container for AppMaster 7.Request Resources 8.Notify with resources Availability Data Node Client Node Manager Data node-1 Node Manager Map Block 9.Start Container in the worker node Data node-2 Node Manager Map Block 10.NM allocate Container 10.NM allocate Container 2. Get New Application 4. Submit Application 1. Notify Start Application 9.Start Container in the worker node YARN MR Application Execution Flow
  • 17. Slide 17 www.edureka.co/big-data-and-hadoop YARN MR Application Execution Flow 11.Task get Executed. 12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate Container 13.Output of All the Maps given to reducer and Reducer get executed 14.Once Job finished, Application Master notify the Resource Manager and Client Library 15.Application Master closed.
  • 18. Slide 18 www.edureka.co/big-data-and-hadoop Hadoop 2.x : YARN Workflow Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Container 1.2 Container 1.1 Container 2.1 Container 2.2 Container 2.3 App Master 2 App Master 1 Scheduler Applications Manager (AsM) Resource Manager
  • 19. Slide 19 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application Client RM NM AM 1
  • 20. Slide 20 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM Client RM NM AM 1 2
  • 21. Slide 21 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM Client RM NM AM 1 2 3
  • 22. Slide 22 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM Client RM NM AM 1 2 3 4
  • 23. Slide 23 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers Client RM NM AM 1 2 3 4 5
  • 24. Slide 24 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container Client RM NM AM 1 2 3 4 5 6
  • 25. Slide 25 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status Client RM NM AM 1 2 3 4 5 7 6
  • 26. Slide 26 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM Client RM NM AM 1 2 3 4 5 7 8 6
  • 27. Slide 27 www.edureka.co/big-data-and-hadoop Input Splits INPUT DATA Physical Division Logical Division HDFS Blocks Input Splits
  • 28. Slide 28 www.edureka.co/big-data-and-hadoop Relation Between Input Splits and HDFS Blocks 1 2 3 4 5 6 7 8 9 10 11  Logical records do not fit neatly into the HDFS blocks.  Logical records are lines that cross the boundary of the blocks.  First split contains line 5 although it spans across blocks. File Lines Block Boundary Block Boundary Block Boundary Block Boundary Split Split Split
  • 29. Slide 29 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Node 1 Node 2 INPUT DATA
  • 30. Slide 30 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Map Node 1 Map Node 2 INPUT DATA
  • 31. Slide 31 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Map Node 1 Map Node 2 INPUT DATA
  • 32. Slide 32 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Map Node 1 Map Node 2 Node 1 Node 2 INPUT DATA
  • 33. Slide 33 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Map Node 1 Map Node 2 Reduce Node 1 Reduce Node 2 INPUT DATA
  • 34. Slide 34 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Reducer output is stored Map Node 1 Map Node 2 Reduce Node 1 Reduce Node 2 INPUT DATA
  • 35. Slide 35 www.edureka.co/big-data-and-hadoop Combiner Combiner Reducer (B,1) (C,1) (D,1) (E,1) (D,1) (B,1) (D,1) (A,1) (A,1) (C,1) (B,1) (D,1) (B,2) (C,1) (D,2) (E,1) (D,2) (A,2) (C,1) (B,1) (A, [2]) (B, [2,1]) (C, [1,1]) (D, [2,2]) (E, [1]) (A,2) (B,3) (C,2) (D,4) (E,1) Shuffle CombinerMapper Mapper B C D E D B D A A C B D Block1Block2
  • 36. Slide 36 www.edureka.co/big-data-and-hadoop Partitioner – Redirecting Output from Mapper Map Map Map Reducer Reducer Reducer Partitioner Partitioner Partitioner
  • 37. Slide 37 www.edureka.co/big-data-and-hadoop Getting Data to the Mapper Input File Input File Input split Input split Input split Input split RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper (intermediates) (intermediates) (intermediates) (intermediates)
  • 38. Slide 38 www.edureka.co/big-data-and-hadoop Partition and Shuffle Mapper Mapper Mapper Mapper (intermediates) (intermediates) (intermediates) (intermediates) Partitioner Partitioner Partitioner Partitioner (intermediates) (intermediates) (intermediates) Reducer Reducer Reducer
  • 39. Slide 39 www.edureka.co/big-data-and-hadoop Demo of Word Count Program To illustrate Default Input Format (Text Input Format) Demo
  • 40. Slide 40 www.edureka.co/big-data-and-hadoop Input file Input Split Input Split Input Split Record Reader Record Reader Record Reader Mapper Mapper Mapper (Intermediates) (Intermediates) (Intermediates) InputFormat Input Split Record Reader Mapper Input file (Intermediates) Input Format
  • 41. Slide 41 www.edureka.co/big-data-and-hadoop Combine File Input Format<K,V> Text Input Format Key Value Text Input Format Nline Input Format Sequence File Input Format<K,V> File Input Format <K,V> Input Format<K,V> org.apache.hadoop.mapreduce <<interface>> Composable Input Format <K,V> Composite Input Format <K,V> DB Input Format<T> Sequence File As Binary Input Format Sequence File As Text Input Format Sequence File Input Filter<K,V> Input Format – Class Hierarchy
  • 42. Slide 42 www.edureka.co/big-data-and-hadoop Reducer RecordWriter Output file Reducer RecordWriter Output file Reducer RecordWriter Output file OutputFormat Output Format
  • 43. Slide 43 www.edureka.co/big-data-and-hadoop Text Output Format <K,V> Sequence File Output Format<K,V> Output Format <K,V> org.apache.hadoop.mapreduce DB Output Format <K,V> File Output Format <K,V> Null Output Format <K,V> Filter Output Format <K,V> Sequence File As Binary Output Format Lazy Output Format <K,V> Output Format – Class Hierarchy