SlideShare une entreprise Scribd logo
1  sur  25
1 www.prace-ri.euIntroduction to Hadoop (part 2)
Introduction to Hadoop (part 2)
Vienna Technical University - IT Services (TU.it)
Dr. Giovanna Roda
2 www.prace-ri.euIntroduction to Hadoop (part 2)
Some advanced topics
▶ the YARN resource manager
▶ fault tolerance and HDFS Erasure Coding
▶ the mrjob library
▶ HDFS i/o benchmarking
▶ MapReduce spilling
3 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
Hadoop jobs are managed by YARN (acronym for Yet Another Resource Negotiator), that is responsible
for allocating resources and manage job scheduling.
Basic resource types are:
▶ memory (memory-mb)
▶ virtual cores (vcores)
Yarn supports an extensible resource model that allows to define any countable resource. A
countable resource is a resource that is consumed while a container is running, but is released
afterwards. Such a resource can be for instance:
▶ GPU (gpu)
.
4 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
Each job submitted to the Yarn is assigned:
▶ a container, that is an abstract entity which incorporates resources such as memory, cpu,
disk, network etc. Container resources are allocated by the Scheduler.
▶ an ApplicationMaster service assigned by the Application Manager for monitoring
the progress of the job, restarting tasks if needed
.
5 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
The main idea of Yarn is to have two distinct daemons for job monitoring and scheduling,
one global and one local for each application:
▶ the Resource Manager is the global job manager, consisting of:
▶ Scheduler → responsible for allocating of resources across all applications
▶ Applications Manager → accept job submissions, restart Application Masters on
failure
▶ an Application Master for each application, a daemon responsible for negotiating
resources, monitoring status of the job, restarting failed tasks.
6 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN architecture
Source: Apache Software Foundation
7 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN Schedulers
Yarn supports by default two types of schedulers:
▶ Capacity Scheduler
share resources providing capacity guarantees → allows scheduler determinism
▶ Fair Scheduler
all apps get, on average, an equal share of resources over time. Fair sharing allows one
app to use the entire cluster if it’s the only one running.
By default, fairness is based on memory but this can be configured to include also other
resources.
8 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN Dynamic Resource Pools
Yarn supports Dynamic Resource Pools (or dynamic queues).
Each job is assigned to a queue (the queue can be determined for instance by the user’s Unix
primary group).
Queue are assigned a weight and resources are split among queues according to their weights.
For instance, if you have one queue with weight 20 and three other queues with weight 10, the
first queue will get 40%, and each of the other queues 20% of the available resources because:
1 ∗ 20𝑥 + 3 ∗ 10𝑥 = 100
9 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
TestDFSIO is a tool included in the Hadoop software distribution for measuring read and write
performance of the HDFS filesystem.
It’s a useful tool for assessing the performance of your Hadoop filesystem, identify performance
bottlenecks, support decisions for tuning your HDFS configuration.
TestDFSio uses MapReduce to write and read files on the HDFS filesystem; the reducer is used
to collect and summarize test data.
10 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
Note: you won’t be able to run TestDFSIO on a cluster unless you’re the cluster’s
admininstrator. Of course, it is recommended to use TestDFSIO to run tests when the Hadoop
cluster is not in use.
To run test as a non-superuser you need to have read/write access to
/benchmarks/TestDFSIO on HDFS or else specify another directory (on HDFS) where to
write and read data with the option –D test.build.data=myDir .
11 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
TestDFSIO writes and reads file on HDFS spanning exactly one mapper for file.
These are the main options for running stress tests:
▶ -write to run write tests
▶ -read to run read tests
▶ -nrFiles the number of files (set to be equal to the number of mappers)
▶ -fileSize size of files (followed by B|KB|MB|GB|TB)
12 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: run a write test
Run a test with 80 files (remember: this is the number of mappers) each of size 10GB.
JARFILE= /opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0-
native/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0-
tests.jar
yarn $JARFILE TestDFSIO -write -nrFiles 80 -fileSize 10GB
The jar file needed to run TestDFSIO is hadoop-mapreduce-client-
jobclient*tests.jar and its location depends on your Hadoop installation.
13 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Test results are appended by default to the file TestDFSIO_results.log. The log file can
be changed with the option -resFile resultFileName.
The main measurements returned by the HDFSio test are:
▶ throughput in mb/sec
▶ average IO rate in mb/sec
▶ standard deviation of the IO rate
▶ test execution time
14 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Main units of measure:
▶ throughput or data transfer rate measures the amount of data read or written (expressed
in Megabytes per second -- MB/s) to the filesystem.
▶ IO rate also abbreviated as IOPS measures IO operations per second, which means the amount of
read or write operations that could be done in one seconds time.
Throughput and IO rate are connected:
Average IO size x IOPS = Throughput in MB/s
15 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Sample output of a test write run:
----- TestDFSIO ----- : write
Date & time: Sun Sep 13 16:14:39 CEST 2020
Number of files: 80
Total MBytes processed: 819200
Throughput mb/sec: 30.64
Average IO rate mb/sec: 34.44
IO rate std deviation: 17.45
Test exec time sec: 429.73
16 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Concurrent versus overall throughput
The resulting throughput is the average throughput among all map tasks. To get an
approximate overall throughput on the cluster one should divide the total MBytes by the test
execution time in seconds.
In our test the overall write throughput on the cluster is:
819200 / 429.73 = 1906MB/sec
17 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: some things to try
▶ run a read test using the same command used for the write test but using the option –
read in place of –write.
▶ use the option -D dfs.replication=1 to measure the effect of replication on I/O
performance
yarn $JARFILE TestDFSIO -D dfs.replication=1 
-write -nrFiles 80 -fileSize 10GB
18 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: remove data after completing tests
When done with testing, remove the temporary files with:
yarn $JARFILE TestDFSIO -clean
If you used a special myDir output directory use
yarn $JARFILE TestDFSIO –clean –D test.build.data=myDir
Further reading: TestDFSIO documentation , source code TestDFSIO.java.
19 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library
A typical data processing pipeline consists of multiple steps that needs to run in a sequence.
mrjob is a Python library that offers a convenient framework which allows you to write multi-
step MapReduce jobs in pure Python, test them on your machine, and run them on a cluster.
20 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: how to define a MapReduce job
A job is defined by a class that inherits from MRJob. This class contains methods that define
the steps of your job.
A “step” consists of a mapper, a combiner, and a reducer.
Step-by-step tutorials and documentation can be foud in: https://mrjob.readthedocs.io/
21 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce job
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
22 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce job
Call the script mr_wordcount.py and run it with:
python mr_wordcount.py file.txt
This command will return the frequencies of characters, words, and lines in file.txt.
To get aggregated frequency counts for multiple files use:
python mr_wordcount.py file1.txt file2.txt …
23 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce pipeline
To define a succession of MapReduce jobs define steps as a list of steps:
def steps(self):
return [
MRStep(mapper=self.mapper_get_words,
combiner=self.combiner_count_words,
reducer=self.reducer_count_words),
MRStep(reducer=self.reducer_find_max_word)
]
24 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: run offline and on a cluster
Note that up to now we ran everything offline, on our local machine. To run the script on a
cluster use the option -r Hadoop. This will set Hadoop as the target runner.
Alternatively:
▶ -r emr for AWS clusters
▶ -r dataproc for Google cloud
25 www.prace-ri.euIntroduction to Hadoop (part 2)
THANK YOU FOR YOUR ATTENTION
www.prace-ri.eu

Contenu connexe

Tendances

Tendances (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 

Similaire à Introduction to Hadoop part 2

Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Similaire à Introduction to Hadoop part 2 (20)

H04502048051
H04502048051H04502048051
H04502048051
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
final report
final reportfinal report
final report
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
hadoop
hadoophadoop
hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Plus de Giovanna Roda

Plus de Giovanna Roda (6)

Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
The need for new paradigms in IT services provisioning
The need for new paradigms in IT services provisioningThe need for new paradigms in IT services provisioning
The need for new paradigms in IT services provisioning
 
Apache Spark™ is here to stay
Apache Spark™ is here to stayApache Spark™ is here to stay
Apache Spark™ is here to stay
 
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval ToolsChances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval Tools
 
CLEF-IP 2009: retrieval experiments in the Intellectual Property domain
CLEF-IP 2009: retrieval experiments in the Intellectual Property domainCLEF-IP 2009: retrieval experiments in the Intellectual Property domain
CLEF-IP 2009: retrieval experiments in the Intellectual Property domain
 
Patent Search: An important new test bed for IR
Patent Search: An important new test bed for IRPatent Search: An important new test bed for IR
Patent Search: An important new test bed for IR
 

Dernier

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Dernier (20)

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 

Introduction to Hadoop part 2

  • 1. 1 www.prace-ri.euIntroduction to Hadoop (part 2) Introduction to Hadoop (part 2) Vienna Technical University - IT Services (TU.it) Dr. Giovanna Roda
  • 2. 2 www.prace-ri.euIntroduction to Hadoop (part 2) Some advanced topics ▶ the YARN resource manager ▶ fault tolerance and HDFS Erasure Coding ▶ the mrjob library ▶ HDFS i/o benchmarking ▶ MapReduce spilling
  • 3. 3 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Hadoop jobs are managed by YARN (acronym for Yet Another Resource Negotiator), that is responsible for allocating resources and manage job scheduling. Basic resource types are: ▶ memory (memory-mb) ▶ virtual cores (vcores) Yarn supports an extensible resource model that allows to define any countable resource. A countable resource is a resource that is consumed while a container is running, but is released afterwards. Such a resource can be for instance: ▶ GPU (gpu) .
  • 4. 4 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Each job submitted to the Yarn is assigned: ▶ a container, that is an abstract entity which incorporates resources such as memory, cpu, disk, network etc. Container resources are allocated by the Scheduler. ▶ an ApplicationMaster service assigned by the Application Manager for monitoring the progress of the job, restarting tasks if needed .
  • 5. 5 www.prace-ri.euIntroduction to Hadoop (part 2) YARN The main idea of Yarn is to have two distinct daemons for job monitoring and scheduling, one global and one local for each application: ▶ the Resource Manager is the global job manager, consisting of: ▶ Scheduler → responsible for allocating of resources across all applications ▶ Applications Manager → accept job submissions, restart Application Masters on failure ▶ an Application Master for each application, a daemon responsible for negotiating resources, monitoring status of the job, restarting failed tasks.
  • 6. 6 www.prace-ri.euIntroduction to Hadoop (part 2) YARN architecture Source: Apache Software Foundation
  • 7. 7 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Schedulers Yarn supports by default two types of schedulers: ▶ Capacity Scheduler share resources providing capacity guarantees → allows scheduler determinism ▶ Fair Scheduler all apps get, on average, an equal share of resources over time. Fair sharing allows one app to use the entire cluster if it’s the only one running. By default, fairness is based on memory but this can be configured to include also other resources.
  • 8. 8 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Dynamic Resource Pools Yarn supports Dynamic Resource Pools (or dynamic queues). Each job is assigned to a queue (the queue can be determined for instance by the user’s Unix primary group). Queue are assigned a weight and resources are split among queues according to their weights. For instance, if you have one queue with weight 20 and three other queues with weight 10, the first queue will get 40%, and each of the other queues 20% of the available resources because: 1 ∗ 20𝑥 + 3 ∗ 10𝑥 = 100
  • 9. 9 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO TestDFSIO is a tool included in the Hadoop software distribution for measuring read and write performance of the HDFS filesystem. It’s a useful tool for assessing the performance of your Hadoop filesystem, identify performance bottlenecks, support decisions for tuning your HDFS configuration. TestDFSio uses MapReduce to write and read files on the HDFS filesystem; the reducer is used to collect and summarize test data.
  • 10. 10 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO Note: you won’t be able to run TestDFSIO on a cluster unless you’re the cluster’s admininstrator. Of course, it is recommended to use TestDFSIO to run tests when the Hadoop cluster is not in use. To run test as a non-superuser you need to have read/write access to /benchmarks/TestDFSIO on HDFS or else specify another directory (on HDFS) where to write and read data with the option –D test.build.data=myDir .
  • 11. 11 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO TestDFSIO writes and reads file on HDFS spanning exactly one mapper for file. These are the main options for running stress tests: ▶ -write to run write tests ▶ -read to run read tests ▶ -nrFiles the number of files (set to be equal to the number of mappers) ▶ -fileSize size of files (followed by B|KB|MB|GB|TB)
  • 12. 12 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: run a write test Run a test with 80 files (remember: this is the number of mappers) each of size 10GB. JARFILE= /opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0- native/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0- tests.jar yarn $JARFILE TestDFSIO -write -nrFiles 80 -fileSize 10GB The jar file needed to run TestDFSIO is hadoop-mapreduce-client- jobclient*tests.jar and its location depends on your Hadoop installation.
  • 13. 13 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Test results are appended by default to the file TestDFSIO_results.log. The log file can be changed with the option -resFile resultFileName. The main measurements returned by the HDFSio test are: ▶ throughput in mb/sec ▶ average IO rate in mb/sec ▶ standard deviation of the IO rate ▶ test execution time
  • 14. 14 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Main units of measure: ▶ throughput or data transfer rate measures the amount of data read or written (expressed in Megabytes per second -- MB/s) to the filesystem. ▶ IO rate also abbreviated as IOPS measures IO operations per second, which means the amount of read or write operations that could be done in one seconds time. Throughput and IO rate are connected: Average IO size x IOPS = Throughput in MB/s
  • 15. 15 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Sample output of a test write run: ----- TestDFSIO ----- : write Date & time: Sun Sep 13 16:14:39 CEST 2020 Number of files: 80 Total MBytes processed: 819200 Throughput mb/sec: 30.64 Average IO rate mb/sec: 34.44 IO rate std deviation: 17.45 Test exec time sec: 429.73
  • 16. 16 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Concurrent versus overall throughput The resulting throughput is the average throughput among all map tasks. To get an approximate overall throughput on the cluster one should divide the total MBytes by the test execution time in seconds. In our test the overall write throughput on the cluster is: 819200 / 429.73 = 1906MB/sec
  • 17. 17 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: some things to try ▶ run a read test using the same command used for the write test but using the option – read in place of –write. ▶ use the option -D dfs.replication=1 to measure the effect of replication on I/O performance yarn $JARFILE TestDFSIO -D dfs.replication=1 -write -nrFiles 80 -fileSize 10GB
  • 18. 18 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: remove data after completing tests When done with testing, remove the temporary files with: yarn $JARFILE TestDFSIO -clean If you used a special myDir output directory use yarn $JARFILE TestDFSIO –clean –D test.build.data=myDir Further reading: TestDFSIO documentation , source code TestDFSIO.java.
  • 19. 19 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library A typical data processing pipeline consists of multiple steps that needs to run in a sequence. mrjob is a Python library that offers a convenient framework which allows you to write multi- step MapReduce jobs in pure Python, test them on your machine, and run them on a cluster.
  • 20. 20 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: how to define a MapReduce job A job is defined by a class that inherits from MRJob. This class contains methods that define the steps of your job. A “step” consists of a mapper, a combiner, and a reducer. Step-by-step tutorials and documentation can be foud in: https://mrjob.readthedocs.io/
  • 21. 21 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce job from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): yield "chars", len(line) yield "words", len(line.split()) yield "lines", 1 def reducer(self, key, values): yield key, sum(values) if __name__ == '__main__': MRWordFrequencyCount.run()
  • 22. 22 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce job Call the script mr_wordcount.py and run it with: python mr_wordcount.py file.txt This command will return the frequencies of characters, words, and lines in file.txt. To get aggregated frequency counts for multiple files use: python mr_wordcount.py file1.txt file2.txt …
  • 23. 23 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce pipeline To define a succession of MapReduce jobs define steps as a list of steps: def steps(self): return [ MRStep(mapper=self.mapper_get_words, combiner=self.combiner_count_words, reducer=self.reducer_count_words), MRStep(reducer=self.reducer_find_max_word) ]
  • 24. 24 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: run offline and on a cluster Note that up to now we ran everything offline, on our local machine. To run the script on a cluster use the option -r Hadoop. This will set Hadoop as the target runner. Alternatively: ▶ -r emr for AWS clusters ▶ -r dataproc for Google cloud
  • 25. 25 www.prace-ri.euIntroduction to Hadoop (part 2) THANK YOU FOR YOUR ATTENTION www.prace-ri.eu