SlideShare une entreprise Scribd logo
1  sur  11
Mapreduce script
K.Haripritha
II-MSC (IT)
Bigdata analysis.
Nadar saraswathi college of arts and science.
INTRODUCTION
• MapReduce is a programming model and an associated
implementation for processing and generating big data sets
with a parallel, distributed algorithm on a cluster.
• A MapReduce program is composed of a map procedure (or
method), which performs filtering and sorting , and
a reduce method, which performs a summary operation.
• The "MapReduce System" (also called "infrastructure" or
"framework") orchestrates the processing by marshalling the
distributed servers, running the various tasks in parallel,
managing all communications and data transfers between the
various parts of the system, and providing
for redundancy and fault tolerance.
OVER VIEW
• MapReduce is a framework for
processing parallelizable problems across large
datasets using a large number of computers
(nodes), collectively referred to as a cluster (if
all nodes are on the same local network and
use similar hardware) or a grid (if the nodes
are shared across geographically and
administratively distributed systems, and use
more heterogeneous hardware).
• A MapReduce framework is usually composed of
three operations :
• Map: each worker node applies the map function
to the local data, and writes the output to a
temporary storage. A master node ensures that
only one copy of redundant input data is
processed.
• Shuffle: worker nodes redistribute data based on
the output keys , such that all data belonging to
one key is located on the same worker node.
• Reduce: worker nodes now process each group of
output data, per key, in parallel.
• The Map and Reduce functions of MapReduce are both
defined with respect to data structured in (key, value)
pairs. Map takes one pair of data with a type in one data
domain, and returns a list of pairs in a different domain:
• Map(k1,v1) → list(k2,v2)
• The Map function is applied in parallel to every pair (keyed
by k1) in the input dataset. This produces a list of pairs (keyed
by k2) for each call. After that, the MapReduce framework
collects all pairs with the same key (k2) from all lists and
groups them together, creating one group for each key.
• The Reduce function is then applied in parallel to each group,
which in turn produces a collection of values in the same
domain:
• Reduce(k2, list (v2)) → list(v3)
DATA FLOW
• Software framework architecture adheres to open-closed
principle where code is effectively divided into
unmodifiable frozen spots and extensible hot spots. The
frozen spot of the MapReduce framework is a large
distributed sort. The hot spots, which the application
defines, are:
• an input reader
• a Map function
• a partition function
• a compare function
• a Reduce function
• an output writer
• Input reader:
• The input reader divides the input into appropriate size
'splits' and the framework assigns one split to
each Map function. The input readerreads data from stable
storage and generates key/value pairs.
• Map function:
• The Map function takes a series of key/value pairs,
processes each, and generates zero or more output key/value
pairs. The input and output types of the map can be different
from each other.
• Partition function:
• Each Map function output is allocated to a
particular reducer by the application's partition function
for sharding purposes. The partition function is given the
key and the number of reducers and returns the index of the
desired reducer.
• Comparison function:
• The input for each Reduce is pulled from the machine where
the Map ran and sorted using the
application's comparison function.
• Reduce function:
• The framework calls the application's Reduce function once
for each unique key in the sorted order. The Reduce can
iterate through the values that are associated with that key
and produce zero or more outputs.
• In the word count example, the Reduce function takes the
input values, sums them and generates a single output of the
word and the final sum.
• Output writer:
• The Output Writer writes the output of the Reduce to the
stable storage.
Performance considerations
• MapReduce programs are not guaranteed to be fast. The
main benefit of this programming model is to exploit
the optimized shuffle operation of the platform, and
only having to write the Map and Reduce parts of the
program.
• In practice, the author of a MapReduce program
however has to take the shuffle step into consideration;
in particular the partition function and the amount of
data written by the Map function can have a large
impact on the performance and scalability.
Distribution and reliability
• MapReduce achieves reliability by parceling out a
number of operations on the set of data to each node in
the network. Each node is expected to report back
periodically with completed work and status updates.
• If a node falls silent for longer than that interval, the
master node records the node as dead and sends out the
node's assigned work to other nodes.
• Individual operations use atomic operations for naming
file outputs as a check to ensure that there are not
parallel conflicting threads running.
Uses
• MapReduce is useful in a wide range of
applications, including distributed pattern-
based searching, distributed sorting, web link-
graph reversal, Singular Value
Decomposition,web access log stats, inverted
index construction, document
clustering, machine learning, and statistical
machine translation.

Contenu connexe

Tendances

Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
Uday Vakalapudi
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 
Wei's notes on MapReduce Scheduling
Wei's notes on MapReduce SchedulingWei's notes on MapReduce Scheduling
Wei's notes on MapReduce Scheduling
Lu Wei
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
jencyjayastina
 

Tendances (20)

Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Map reduce
Map reduceMap reduce
Map reduce
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduce
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop Mapreduce joins
Hadoop Mapreduce joinsHadoop Mapreduce joins
Hadoop Mapreduce joins
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Wei's notes on MapReduce Scheduling
Wei's notes on MapReduce SchedulingWei's notes on MapReduce Scheduling
Wei's notes on MapReduce Scheduling
 
Pregel - Paper Review
Pregel - Paper ReviewPregel - Paper Review
Pregel - Paper Review
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Hadoop Map Reduce OS
Hadoop Map Reduce OSHadoop Map Reduce OS
Hadoop Map Reduce OS
 
What is MapReduce ?
What is MapReduce ?What is MapReduce ?
What is MapReduce ?
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
A load balancing model based on cloud partitioning
A load balancing model based on cloud partitioningA load balancing model based on cloud partitioning
A load balancing model based on cloud partitioning
 

Similaire à Mapreduce script

module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 

Similaire à Mapreduce script (20)

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
E031201032036
E031201032036E031201032036
E031201032036
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
try
trytry
try
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 

Plus de Haripritha (9)

Wireless sensor networks
Wireless sensor networksWireless sensor networks
Wireless sensor networks
 
Datamining
DataminingDatamining
Datamining
 
Operating system
Operating systemOperating system
Operating system
 
Java
JavaJava
Java
 
Presentation 2
Presentation 2Presentation 2
Presentation 2
 
Computer Organization
Computer OrganizationComputer Organization
Computer Organization
 
Open addressing &amp rehashing,extendable hashing
Open addressing &amp rehashing,extendable hashingOpen addressing &amp rehashing,extendable hashing
Open addressing &amp rehashing,extendable hashing
 
programming language in c&c++
programming language in c&c++programming language in c&c++
programming language in c&c++
 
encoding
encodingencoding
encoding
 

Dernier

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

Mapreduce script

  • 1. Mapreduce script K.Haripritha II-MSC (IT) Bigdata analysis. Nadar saraswathi college of arts and science.
  • 2. INTRODUCTION • MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. • A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting , and a reduce method, which performs a summary operation. • The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
  • 3. OVER VIEW • MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware).
  • 4. • A MapReduce framework is usually composed of three operations : • Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. • Shuffle: worker nodes redistribute data based on the output keys , such that all data belonging to one key is located on the same worker node. • Reduce: worker nodes now process each group of output data, per key, in parallel.
  • 5. • The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: • Map(k1,v1) → list(k2,v2) • The Map function is applied in parallel to every pair (keyed by k1) in the input dataset. This produces a list of pairs (keyed by k2) for each call. After that, the MapReduce framework collects all pairs with the same key (k2) from all lists and groups them together, creating one group for each key. • The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: • Reduce(k2, list (v2)) → list(v3)
  • 6. DATA FLOW • Software framework architecture adheres to open-closed principle where code is effectively divided into unmodifiable frozen spots and extensible hot spots. The frozen spot of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are: • an input reader • a Map function • a partition function • a compare function • a Reduce function • an output writer
  • 7. • Input reader: • The input reader divides the input into appropriate size 'splits' and the framework assigns one split to each Map function. The input readerreads data from stable storage and generates key/value pairs. • Map function: • The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be different from each other. • Partition function: • Each Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reducer.
  • 8. • Comparison function: • The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function. • Reduce function: • The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs. • In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum. • Output writer: • The Output Writer writes the output of the Reduce to the stable storage.
  • 9. Performance considerations • MapReduce programs are not guaranteed to be fast. The main benefit of this programming model is to exploit the optimized shuffle operation of the platform, and only having to write the Map and Reduce parts of the program. • In practice, the author of a MapReduce program however has to take the shuffle step into consideration; in particular the partition function and the amount of data written by the Map function can have a large impact on the performance and scalability.
  • 10. Distribution and reliability • MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. • If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes. • Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running.
  • 11. Uses • MapReduce is useful in a wide range of applications, including distributed pattern- based searching, distributed sorting, web link- graph reversal, Singular Value Decomposition,web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.