SlideShare a Scribd company logo
1 of 12
Download to read offline
MapReduce with HADOOP



      Vitalie Scurtu
What is hadoop?
Hadoop is a set of open source frameworks for
  parallel and distributive computing:
• HDFS: Distributed file system
• MapReduce: A technique and a framework for
  parallel computation in cluster.
• ZooKeeper: A configuration service.
• and others: Hive ,HBase ,Mahout, Pig.
• Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209
  seconds in Terabyte Sorting Competition.
Why distributed computing?
• Reduced costs. More computers are cheaper
  then more powerful computer.
• Scalability. We can add new computer to the
  cluster anytime.
• Super power and super speed.
• Distributed algorithms.
• Stability
• Robust frameworks.
Configuring Hadoop
• It is java and it uses xml file for configuration.
• Installation is very simple.
• Every computer can become a part of the cluster.
• To try a demo we need only 30 minutes.
• Uses an advanced configuration system named
  ZooKeeper
• cat /usr/local/hadoop/conf/slaves
         hadoop-master
         hadoop-slave01
         hadoop-slave02
         hadoop-slave03
         hadoop-slave06
HDFS
          Hadoop Distributed File System
•   Distributed file system
•   Support for huge files (GB, terrabyte)
•   Hardware Failure safe, replication
•   File access model is “Write-once-read-many”
•   Cross-platform (java)
MapReduce
• An uniq model for distributed computation, main algorithm is divided in
  two
    – Map
        • Accepts in input key-value pairs (dictionary)
        • Records must be independend (Key A does not depend on Key B)
        • It does the intermediary computations and prepares the data for Reduce stage.
    – Reduce
        • Accepts in input collections of key-value with intermediary results.
        • Parallel Sorting and Grouping functions.
        • Returns the final result.
    – Map -> Reduce
        • It is not only a distributed framework but also a development methodology thanks to its
          uniq formula. The algorithms contrains makes it possible for the developer to think
          about implementation and not to focus on the parallel computation. Once a problem is
          transormed into a MapReduce algorithm, the framework is applicable.
    – Computation time: max(time_of_each_map) + max(time_of_each_reduce)
MapReduce

        Map1


        Map2
                    Reduce   Output
Input   Map3




        Map4
Example of Applications
• Problem: Extract all the texts from a database
   with 1 million posts and compute the occurency
   of each token.
   mapper.py <- Takes as input an id
                -> Prints each token with its occurency
  reducer.py <- Takes as input a list of tokens with
   ids occurency
               -> Sums the occurency of all tokens
   and outputs the final result.
Experiment 1, 100K docs, 5 slaves
•   Time without MapReduce
     –   906.63user
     –   4.18system
     –   0:14:32 elapsed
     –   104%CPU (0avgtext+0avgdata 0maxresident)k
•   Time with MapReduce
     –   3.79user
     –   0.40system
     –   0:21:00 elapsed
     –   0%CPU (0avgtext+0avgdata 0maxresident)k

     –   10/10/25 11:10:36 INFO streaming.StreamJob:   map 0% reduce 0%
     –   10/10/25 11:10:50 INFO streaming.StreamJob:   map 16% reduce 0%
     –   10/10/25 11:11:48 INFO streaming.StreamJob:   map 33% reduce 0%
     –   10/10/25 11:12:10 INFO streaming.StreamJob:   map 49% reduce 0%
     –   10/10/25 11:14:09 INFO streaming.StreamJob:   map 66% reduce 0%
     –   10/10/25 11:14:37 INFO streaming.StreamJob:   map 82% reduce 0%
     –   10/10/25 11:16:26 INFO streaming.StreamJob:   map 83% reduce 0%
     –   10/10/25 11:18:12 INFO streaming.StreamJob:   map 83% reduce 17%
     –   10/10/25 11:20:18 INFO streaming.StreamJob:   map 99% reduce 17%
Experiment 2, 1M doc, 5 slaves
•   Time without MapReduce
     –   6892.08user
     –   25.03system
     –   1:56:37 elapsed
     –   98%CPU (0avgtext+0avgdata 0maxresident)k
•   Time with MapReduce
     –   6.30user
     –   0.98system
     –   3:26:18elapsed
     –   0%CPU (0avgtext+0avgdata 0maxresident)k

     –   10/10/26 15:04:36 INFO streaming.StreamJob:   map 100% reduce 14%
     –   10/10/26 15:04:37 INFO streaming.StreamJob:   map 100% reduce 16%
     –   10/10/26 15:04:39 INFO streaming.StreamJob:   map 100% reduce 25%
     –   10/10/26 15:04:40 INFO streaming.StreamJob:   map 100% reduce 27%
     –   10/10/26 15:04:42 INFO streaming.StreamJob:   map 100% reduce 30%
     –   10/10/26 15:04:44 INFO streaming.StreamJob:   map 100% reduce 32%
     –   10/10/26 15:04:45 INFO streaming.StreamJob:   map 100% reduce 34%
     –   10/10/26 15:04:48 INFO streaming.StreamJob:   map 100% reduce 35%
     –   10/10/26 15:07:29 INFO streaming.StreamJob:   map 83% reduce 35%
     –   10/10/26 15:07:35 INFO streaming.StreamJob:   map 100% reduce 35%
     –   10/10/26 15:09:57 INFO streaming.StreamJob:   map 100% reduce 36%
     –   10/10/26 15:09:59 INFO streaming.StreamJob:   map 100% reduce 37%
Experiment 3, 1M doc, 3 slaves
•   Time without MapReduce
     –   6892.08user
     –   25.03system
     –   1:56:37 elapsed
     –   98%CPU (0avgtext+0avgdata 0maxresident)k
•   Time with MapReduce
     –   5.50user
     –   0.97system
     –   00:53:20elapsed
     –   0%CPU (0avgtext+0avgdata 0maxresident)k
     –   10/10/26 15:04:36 INFO streaming.StreamJob:   map 100% reduce 14%
     –   10/10/26 15:04:37 INFO streaming.StreamJob:   map 100% reduce 16%
     –   10/10/26 15:04:39 INFO streaming.StreamJob:   map 100% reduce 25%
     –   10/10/26 15:04:40 INFO streaming.StreamJob:   map 100% reduce 27%
     –   10/10/26 15:04:42 INFO streaming.StreamJob:   map 100% reduce 30%
     –   10/10/26 15:04:44 INFO streaming.StreamJob:   map 100% reduce 32%
     –   10/10/26 15:04:45 INFO streaming.StreamJob:   map 100% reduce 34%
     –   10/10/26 15:04:48 INFO streaming.StreamJob:   map 100% reduce 35%
     –   10/10/26 15:07:29 INFO streaming.StreamJob:   map 83% reduce 35%
     –   10/10/26 15:07:35 INFO streaming.StreamJob:   map 100% reduce 35%
     –   10/10/26 15:09:57 INFO streaming.StreamJob:   map 100% reduce 36%
     –   10/10/26 15:09:59 INFO streaming.StreamJob:   map 100% reduce 37%
What’s next?
• MapReduce can be applied in many problems
  and natural language processing applications.
  Examples
  – Sentiment analysis.
  – Computing probabilities of huge data.
  – Retrieval problem.
  – Huge data statistics and analysis.
  – MapReduce is not only a framework it is also a
    distributed computing methodology.

More Related Content

What's hot

Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachSpark Summit
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Rob Emanuele
 
Processing Big Data in Realtime
Processing Big Data in RealtimeProcessing Big Data in Realtime
Processing Big Data in RealtimeTikal Knowledge
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushDaniel Ben-Zvi
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with SparkAlpine Data
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineG. Bruce Berriman
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 

What's hot (20)

Giraph
GiraphGiraph
Giraph
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Doom in SpaceX
Doom in SpaceXDoom in SpaceX
Doom in SpaceX
 
Kafka short
Kafka shortKafka short
Kafka short
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?
 
Heatmap
HeatmapHeatmap
Heatmap
 
Processing Big Data in Realtime
Processing Big Data in RealtimeProcessing Big Data in Realtime
Processing Big Data in Realtime
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rush
 
R user group 2011 09
R user group 2011 09R user group 2011 09
R user group 2011 09
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 

Viewers also liked (17)

Question1
Question1Question1
Question1
 
Quantità dell'informazione
Quantità dell'informazioneQuantità dell'informazione
Quantità dell'informazione
 
Ecostystems
EcostystemsEcostystems
Ecostystems
 
Question1
Question1Question1
Question1
 
Medium Information Quantity
Medium Information QuantityMedium Information Quantity
Medium Information Quantity
 
Cartoon on teaching of sense organs
Cartoon on teaching of sense organsCartoon on teaching of sense organs
Cartoon on teaching of sense organs
 
Lost boy draft 1
Lost boy draft 1Lost boy draft 1
Lost boy draft 1
 
Misfortune 2nd Draft
Misfortune 2nd DraftMisfortune 2nd Draft
Misfortune 2nd Draft
 
Misfortune
MisfortuneMisfortune
Misfortune
 
For the love of family
For the love of family For the love of family
For the love of family
 
Persona non grata first draft
Persona non grata first draftPersona non grata first draft
Persona non grata first draft
 
Food and health
Food and healthFood and health
Food and health
 
Jung
JungJung
Jung
 
Script working title
Script working titleScript working title
Script working title
 
Script working title katie's killer
Script working title   katie's killerScript working title   katie's killer
Script working title katie's killer
 
See no evil
See no evilSee no evil
See no evil
 
Script working title
Script working title Script working title
Script working title
 

Similar to MapReduce with Hadoop

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsDonald Nguyen
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analyticsinside-BigData.com
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceSpiros Economakis
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceSpiros Oikonomakis
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsYu Liu
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
ASE2010
ASE2010ASE2010
ASE2010swy351
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution Chen Wu
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 

Similar to MapReduce with Hadoop (20)

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Dasia 2022
Dasia 2022Dasia 2022
Dasia 2022
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Future
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and Experiments
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
ASE2010
ASE2010ASE2010
ASE2010
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
IAC 2020
IAC 2020IAC 2020
IAC 2020
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 

Recently uploaded

It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyEthan lee
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756dollysharma2066
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...rajveerescorts2022
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxpriyanshujha201
 
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...Any kyc Account
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdfRenandantas16
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMANIlamathiKannappan
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxWorkforce Group
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxAndy Lambert
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒anilsa9823
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfAdmir Softic
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...amitlee9823
 

Recently uploaded (20)

It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
 
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pillsMifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
 
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMAN
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptx
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 

MapReduce with Hadoop

  • 1. MapReduce with HADOOP Vitalie Scurtu
  • 2. What is hadoop? Hadoop is a set of open source frameworks for parallel and distributive computing: • HDFS: Distributed file system • MapReduce: A technique and a framework for parallel computation in cluster. • ZooKeeper: A configuration service. • and others: Hive ,HBase ,Mahout, Pig. • Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209 seconds in Terabyte Sorting Competition.
  • 3. Why distributed computing? • Reduced costs. More computers are cheaper then more powerful computer. • Scalability. We can add new computer to the cluster anytime. • Super power and super speed. • Distributed algorithms. • Stability • Robust frameworks.
  • 4. Configuring Hadoop • It is java and it uses xml file for configuration. • Installation is very simple. • Every computer can become a part of the cluster. • To try a demo we need only 30 minutes. • Uses an advanced configuration system named ZooKeeper • cat /usr/local/hadoop/conf/slaves hadoop-master hadoop-slave01 hadoop-slave02 hadoop-slave03 hadoop-slave06
  • 5. HDFS Hadoop Distributed File System • Distributed file system • Support for huge files (GB, terrabyte) • Hardware Failure safe, replication • File access model is “Write-once-read-many” • Cross-platform (java)
  • 6. MapReduce • An uniq model for distributed computation, main algorithm is divided in two – Map • Accepts in input key-value pairs (dictionary) • Records must be independend (Key A does not depend on Key B) • It does the intermediary computations and prepares the data for Reduce stage. – Reduce • Accepts in input collections of key-value with intermediary results. • Parallel Sorting and Grouping functions. • Returns the final result. – Map -> Reduce • It is not only a distributed framework but also a development methodology thanks to its uniq formula. The algorithms contrains makes it possible for the developer to think about implementation and not to focus on the parallel computation. Once a problem is transormed into a MapReduce algorithm, the framework is applicable. – Computation time: max(time_of_each_map) + max(time_of_each_reduce)
  • 7. MapReduce Map1 Map2 Reduce Output Input Map3 Map4
  • 8. Example of Applications • Problem: Extract all the texts from a database with 1 million posts and compute the occurency of each token. mapper.py <- Takes as input an id -> Prints each token with its occurency reducer.py <- Takes as input a list of tokens with ids occurency -> Sums the occurency of all tokens and outputs the final result.
  • 9. Experiment 1, 100K docs, 5 slaves • Time without MapReduce – 906.63user – 4.18system – 0:14:32 elapsed – 104%CPU (0avgtext+0avgdata 0maxresident)k • Time with MapReduce – 3.79user – 0.40system – 0:21:00 elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/25 11:10:36 INFO streaming.StreamJob: map 0% reduce 0% – 10/10/25 11:10:50 INFO streaming.StreamJob: map 16% reduce 0% – 10/10/25 11:11:48 INFO streaming.StreamJob: map 33% reduce 0% – 10/10/25 11:12:10 INFO streaming.StreamJob: map 49% reduce 0% – 10/10/25 11:14:09 INFO streaming.StreamJob: map 66% reduce 0% – 10/10/25 11:14:37 INFO streaming.StreamJob: map 82% reduce 0% – 10/10/25 11:16:26 INFO streaming.StreamJob: map 83% reduce 0% – 10/10/25 11:18:12 INFO streaming.StreamJob: map 83% reduce 17% – 10/10/25 11:20:18 INFO streaming.StreamJob: map 99% reduce 17%
  • 10. Experiment 2, 1M doc, 5 slaves • Time without MapReduce – 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k • Time with MapReduce – 6.30user – 0.98system – 3:26:18elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14% – 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16% – 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25% – 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27% – 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30% – 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32% – 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34% – 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35% – 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36% – 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%
  • 11. Experiment 3, 1M doc, 3 slaves • Time without MapReduce – 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k • Time with MapReduce – 5.50user – 0.97system – 00:53:20elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14% – 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16% – 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25% – 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27% – 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30% – 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32% – 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34% – 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35% – 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36% – 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%
  • 12. What’s next? • MapReduce can be applied in many problems and natural language processing applications. Examples – Sentiment analysis. – Computing probabilities of huge data. – Retrieval problem. – Huge data statistics and analysis. – MapReduce is not only a framework it is also a distributed computing methodology.