Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial | CloudxLab

•

0 j'aime•471 vues

Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial: 1) Hadoop Streaming and Why Do We Need it? 2) Writing Streaming Jobs 3) Testing Streaming jobs and Hands-on on CloudxLab

Technologie

MapReduce
Welcome to Session on
Hadoop Streaming

MapReduce
Why Hadoop Streaming?
Why?
● Java mapreduce is cumbersome
● Legacy code as mapper or reducer
● Many non-java programmers
It is a Hadoop Library which makes it possible to use any binary as
mapper or reducer

MapReduce
What is not Hadoop Streaming?
● Real time data processing
● Continously running a process
(Unbounded Data Processing)

MapReduce
MAP / REDUCE
A Hadoop Library which makes it possible to use any binary as
mapper or reducer
Mapper - gives out key<tab>value per line
Streaming Job

MapReduce
MAP / REDUCE
A Hadoop Library which makes it possible to use any binary as
mapper or reducer
Mapper - gives out key<tab>value per line
Reducer - gets ungrouped sorted data key<tab>value
Streaming Job

MapReduce
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
-input /data/mr/wordcount/input -output wordcount_output_unix
-mapper 'sed "s/ /n/g"' -reducer "uniq -c"
MAP / REDUCE
A Hadoop Library which makes it possible to use any binary as
mapper or reducer
Mapper - gives out key<tab>value per line
Reducer - gets ungrouped sorted data key<tab>value
Streaming Job
This is our
mapper
This is the
reducer

MapReduce
replace space with new line
sed “s/ /n/g’
uniq -c
sa re sa ga
sa
re
sa
ga
1 ga
1 re
2 sa
ga
re
sa
sa
Streaming JobMAP / REDUCE

MapReduce
MAP / REDUCE Streaming Job - More Reducers
replace space with new line
sed 's/ /n/g'
uniq -c
sa re sa ga
sa
re
sa
ga
re
uniq -c uniq -c
sa
sa
ga
ga 1 re 1 sa 2

MapReduce
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
-D mapred.reduce.tasks=2
-input /data/mr/wordcount/input/
-output wordcount_clean_unix
-mapper ./mycmd.sh
-reducer 'uniq -c'
-file mycmd.sh
MAP / REDUCE Streaming Job
#mycmd.sh - clean up further
#!/bin/bash
sed -r 's/[ t]+/n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z"
Ship a script
Multiple Reducer Argument: -D mapred.reduce.tasks=2

MapReduce
STREAMING JOB - HANDS-ON
Doc:
http://hadoop.apache.org/docs/r1.2.1/streaming.html

MapReduce
MAP / REDUCE
Notes
• OK to have no reducer
• hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-input sgiri/wordcount/input/ -output
sgiri/wordcount/output21fe32/ -mapper ./mycmd.sh -file
mycmd.sh
• If no reducer and don’t want sorting
• use -D mapred.reduce.tasks=0
• Maps will decide the number of output files
• Number of Maps
• A function of number of InputSplits
• conf.setNumMapTasks(int num) or -D mapred.map.tasks=1

MapReduce
MAP / REDUCE
Number of Reduces
More reducer mean
• Faster
• More framework load
• Lowers chances of failure
• (0.95 to 1.75) * (Max Tasks)
• Max Tasks =
• No. of Nodes * Max Reduce tasks simultaneously per task
tracker.
• mapreduce.tasktracker.reduce.tasks.maximum = 2
•

MapReduce
● First test on very small data
○ Random Sample data
● Separately Test Mapper and Reducer
● Steaming Job could be tested with simple unix command:
○ cat inputfile | mymapper | sort | myreducer > outputfile
MAP / REDUCE
Testing

Contenu connexe

Tendances

Introduction to Parallel ComputingAkhila Prabhakaran

Map ReducePrashant Gupta

Introduction to Hadoop and Hadoop component rebeccatho

Introduction to HadoopDr. C.V. Suresh Babu

The Basics of MongoDBvaluebound

Hadoop YARNVigen Sahakyan

Memory consistency modelspalani kumar

Parallel computing persentationVIKAS SINGH BHADOURIA

Introduction to MongoDB and CRUD operationsAnand Kumar

Cloud computing lab experimentsrichendraravi

Big Data and HadoopFlavio Vit

Lecture 26 local beam searchHema Kashyap

Jdbc architecture and driver types pptkamal kotecha

RSA ALGORITHMShashank Shetty

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

The CAP Theorem Aleksandar Bradic

Hadoop OozieMadhur Nawandar

Testing Hadoop jobs with MRUnitEric Wendelin

Hadoop Overview & Architecture EMC

NoSQL databasesHarri Kauhanen

Tendances (20)

Introduction to Parallel Computing

Map Reduce

Introduction to Hadoop and Hadoop component

Introduction to Hadoop

The Basics of MongoDB

Hadoop YARN

Memory consistency models

Parallel computing persentation

Introduction to MongoDB and CRUD operations

Cloud computing lab experiments

Big Data and Hadoop

Lecture 26 local beam search

Jdbc architecture and driver types ppt

RSA ALGORITHM

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

The CAP Theorem

Hadoop Oozie

Testing Hadoop jobs with MRUnit

Hadoop Overview & Architecture

NoSQL databases

Similaire à Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial | CloudxLab

20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee

Big Data Analysis With RHadoopDavid Chiu

ちょっとHadoopについて語ってみるか（仮題）moai kids

Overview of Spark for HPCGlenn K. Lockwood

Hadoop MapReduce Streaming and PipesHanborq Inc.

Introduction to map reduceBhupesh Chawda

SparkNotesDemet Aksoy

Parallel R and HadoopGlenn K. Lockwood

Apache Hadoop - A Deep Dive (Part 2 - MapReduce)Debarchan Sarkar

NetFlow Data processing using Hadoop and VerticaJosef Niedermeier

Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network

Aws dc elastic-mapreducebeaknit

Hadoop Performance comparisonarunkumar sadhasivam

Hadoop ppt2Ankit Gupta

MapReduce and NoSQLAaron Cordova

MeethadoopIIIT-H

Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies

RHadoop - beginnersMohamed Ramadan

Introduction to Spark on HadoopCarol McDonald

Similaire à Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial | CloudxLab (20)

20141111 파이썬으로 Hadoop MR프로그래밍

Big Data Analysis With RHadoop

ちょっとHadoopについて語ってみるか（仮題）

Overview of Spark for HPC

Hadoop MapReduce Streaming and Pipes

Introduction to map reduce

SparkNotes

Parallel R and Hadoop

Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

NetFlow Data processing using Hadoop and Vertica

Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...

Aws dc elastic-mapreduce

Hadoop Performance comparison

Hadoop ppt2

MapReduce and NoSQL

Meethadoop

Advanced Hadoop Tuning and Optimization - Hadoop Consulting

RHadoop - beginners

Introduction to Spark on Hadoop

Plus de CloudxLab

Understanding computer vision with Deep LearningCloudxLab

Deep Learning OverviewCloudxLab

Recurrent Neural NetworksCloudxLab

Natural Language ProcessingCloudxLab

Naive BayesCloudxLab

AutoencodersCloudxLab

Training Deep Neural NetsCloudxLab

Reinforcement LearningCloudxLab

Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab

Introduction to Deep Learning | CloudxLabCloudxLab

Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab

Ensemble Learning and Random ForestsCloudxLab

Decision TreesCloudxLab

Plus de CloudxLab (20)

Understanding computer vision with Deep Learning

Deep Learning Overview

Recurrent Neural Networks

Natural Language Processing

Naive Bayes

Autoencoders

Training Deep Neural Nets

Reinforcement Learning

Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab

Introduction to Deep Learning | CloudxLab

Dimensionality Reduction | Machine Learning | CloudxLab

Ensemble Learning and Random Forests

Decision Trees

Dernier

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Take control of your SAP testing with UiPath Test SuiteDianaGray10

"ML in Production",Oleksandr BaganFwdays

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Commit 2024 - Secret Management made easyAlfredo García Lavilla

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Dernier (20)

What's New in Teams Calling, Meetings and Devices March 2024

Connect Wave/ connectwave Pitch Deck Presentation

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Take control of your SAP testing with UiPath Test Suite

"ML in Production",Oleksandr Bagan

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

SIP trunking in Janus @ Kamailio World 2024

The Ultimate Guide to Choosing WordPress Pros and Cons

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

DSPy a system for AI to Write Prompts and Do Fine Tuning

Commit 2024 - Secret Management made easy

SAP Build Work Zone - Overview L2-L3.pptx

Dev Dives: Streamline document processing with UiPath Studio Web

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Are Multi-Cloud and Serverless Good or Bad?

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial | CloudxLab

1. MapReduce Welcome to Session on Hadoop Streaming

2. MapReduce Why Hadoop Streaming? Why? ● Java mapreduce is cumbersome ● Legacy code as mapper or reducer ● Many non-java programmers It is a Hadoop Library which makes it possible to use any binary as mapper or reducer

3. MapReduce What is not Hadoop Streaming? ● Real time data processing ● Continously running a process (Unbounded Data Processing)

4. MapReduce MAP / REDUCE A Hadoop Library which makes it possible to use any binary as mapper or reducer Mapper - gives out key<tab>value per line Streaming Job

5. MapReduce MAP / REDUCE A Hadoop Library which makes it possible to use any binary as mapper or reducer Mapper - gives out key<tab>value per line Reducer - gets ungrouped sorted data key<tab>value Streaming Job

6. MapReduce hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -input /data/mr/wordcount/input -output wordcount_output_unix -mapper 'sed "s/ /n/g"' -reducer "uniq -c" MAP / REDUCE A Hadoop Library which makes it possible to use any binary as mapper or reducer Mapper - gives out key<tab>value per line Reducer - gets ungrouped sorted data key<tab>value Streaming Job This is our mapper This is the reducer

7. MapReduce hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -input /data/mr/wordcount/input -output wordcount_output_unix -mapper 'sed "s/ /n/g"' -reducer "uniq -c" MAP / REDUCE Word Count using unix commands as mapper & Reducer Streaming Job This is our mapper This is the reducer

8. MapReduce replace space with new line sed “s/ /n/g’ uniq -c sa re sa ga sa re sa ga 1 ga 1 re 2 sa ga re sa sa Streaming JobMAP / REDUCE

9. MapReduce MAP / REDUCE Streaming Job - More Reducers replace space with new line sed 's/ /n/g' uniq -c sa re sa ga sa re sa ga re uniq -c uniq -c sa sa ga ga 1 re 1 sa 2

10. MapReduce hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -D mapred.reduce.tasks=2 -input /data/mr/wordcount/input/ -output wordcount_clean_unix -mapper ./mycmd.sh -reducer 'uniq -c' -file mycmd.sh MAP / REDUCE Streaming Job #mycmd.sh - clean up further #!/bin/bash sed -r 's/[ t]+/n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z" Ship a script Multiple Reducer Argument: -D mapred.reduce.tasks=2

11. MapReduce STREAMING JOB - HANDS-ON Doc: http://hadoop.apache.org/docs/r1.2.1/streaming.html

12. MapReduce MAP / REDUCE Notes • OK to have no reducer • hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input sgiri/wordcount/input/ -output sgiri/wordcount/output21fe32/ -mapper ./mycmd.sh -file mycmd.sh • If no reducer and don’t want sorting • use -D mapred.reduce.tasks=0 • Maps will decide the number of output files • Number of Maps • A function of number of InputSplits • conf.setNumMapTasks(int num) or -D mapred.map.tasks=1

13. MapReduce MAP / REDUCE Number of Reduces More reducer mean • Faster • More framework load • Lowers chances of failure • (0.95 to 1.75) * (Max Tasks) • Max Tasks = • No. of Nodes * Max Reduce tasks simultaneously per task tracker. • mapreduce.tasktracker.reduce.tasks.maximum = 2 •

14. MapReduce ● First test on very small data ○ Random Sample data ● Separately Test Mapper and Reducer ● Steaming Job could be tested with simple unix command: ○ cat inputfile | mymapper | sort | myreducer > outputfile MAP / REDUCE Testing

15. MapReduce Hadoop Streaming Thank you!

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial | CloudxLab

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial | CloudxLab

Similaire à Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Plus de CloudxLab

Plus de CloudxLab (20)

Dernier

Dernier (20)

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial | CloudxLab