SlideShare une entreprise Scribd logo
1  sur  62
Télécharger pour lire hors ligne
Benchmarking 
Hadoop & Big Data benchmarking 
Dr. ir. ing. Bart Vandewoestyne 
Sizing Servers Lab, Howest, Kortrijk 
IWT TETRA User Group Meeting - November 28, 2014 
1 / 62
Benchmarking 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
2 / 62
Benchmarking 
Intro: Hadoop essentials 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
3 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 
Hadoop is VMware, but the other way around. 
4 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 1.0 
Source: Apache Hadoop YARN : moving beyond 
MapReduce and batch processing with Apache Hadoop 2, 
Hortonworks, 2014) 
MapReduce and HDFS are the 
core components, while other 
components are built around the 
core. 
5 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 2.0 
Source: Apache Hadoop YARN : moving beyond 
MapReduce and batch processing with Apache Hadoop 2, 
Hortonworks, 2014) 
YARN adds a more general 
interface to run non-MapReduce 
jobs within the Hadoop 
framework. 
6 / 62
Benchmarking 
Intro: Hadoop essentials 
HDFS 
Hadoop Distributed File System 
Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx 
7 / 62
Benchmarking 
Intro: Hadoop essentials 
MapReduce 
MapReduce = Programming Model 
WordCount example: 
Source: Optimizing Hadoop for MapReduce, Khaled Tannir 
8 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop distributions 
9 / 62
Benchmarking 
Cloudera demo 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
10 / 62
Benchmarking 
Cloudera demo 
HDFS 
11 / 62
Benchmarking 
Cloudera demo 
NameNode and DataNodes 
12 / 62
Benchmarking 
Cloudera demo 
Hosts and their roles 
13 / 62
Benchmarking 
Cloudera demo 
NameNode WebUI 
NameNode WebUI address 
http://sandy-quad-1.sslab.lan:50070/ 
14 / 62
Benchmarking 
Cloudera demo 
Replication factor 
15 / 62
Benchmarking 
Cloudera demo 
HDFS Blocks 
16 / 62
Benchmarking 
Cloudera demo 
Hue:
le upload 
17 / 62
Benchmarking 
Cloudera demo 
Hadoop jobs: counters/metrics 
18 / 62
Benchmarking 
Benchmarks 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
19 / 62
Benchmarking 
Benchmarks 
Why benchmark? 
My three reasons for using benchmarks: 
1 Evaluating the eect of a hardware/software upgrade: 
OS, Java VM,. . . 
Hadoop, Cloudera CDH, Pig, Hive, Impala,. . . 
2 Debugging: 
Compare with other clusters or published results. 
3 Performance tuning: 
E.g. Cloudera CDH default con
g is defensive, not optimal. 
20 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
21 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Hadoop: Available tests 
hadoop jar /some/path/to/hadoop-*test*.jar 
22 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO 
Read and write test for HDFS. 
Helpful for 
getting an idea of how fast your cluster is in terms of I/O, 
stress testing HDFS, 
discover network performance bottlenecks, 
shake out the hardware, OS and Hadoop setup of your cluster 
machines (particularly the NameNode and the DataNodes). 
23 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: write test 
Generate 10
les of size 1 GB for a total of 10 GB: 
$ hadoop jar hadoop-*test*.jar  
TestDFSIO -write -nrFiles 10 -fileSize 1000 
TestDFSIO is designed to use 1 map task per
le 
(1:1 mapping from
les to map tasks) 
24 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: write test output 
Typical output of write test 
----- TestDFSIO ----- : write 
Date  time: Mon Oct 06 10:21:28 CEST 2014 
Number of files: 10 
Total MBytes processed: 10000.0 
Throughput mb/sec: 12.874702111579893 
Average IO rate mb/sec: 13.013071060180664 
IO rate std deviation: 1.4416050051562712 
Test exec time sec: 114.346 
25 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Interpreting TestDFSIO results 
De
nition (Throughput) 
Throughput(N) = 
PN 
i=0
lesizei PN 
i=0 timei 
De
nition (Average IO rate) 
Average IO rate(N) = 
PN 
i=0 ratei 
N 
= 
PN
lesizei 
timei 
N 
i=0 
Here, N is the number of map tasks. 
26 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: read test 
Read 10 input
les, each of size 1 GB: 
$ hadoop jar hadoop-*test*.jar  
TestDFSIO -read -nrFiles 10 -fileSize 1000 
27 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: read test output 
Typical output of read test 
----- TestDFSIO ----- : read 
Date  time: Mon Oct 06 10:56:15 CEST 2014 
Number of files: 10 
Total MBytes processed: 10000.0 
Throughput mb/sec: 402.4306813151435 
Average IO rate mb/sec: 492.8257751464844 
IO rate std deviation: 196.51233829270575 
Test exec time sec: 33.206 
28 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
In
uence of HDFS replication factor 
When interpreting TestDFSIO results, keep in mind: 
The HDFS replication factor plays an important role! 
A higher replication factor leads to slower writes. 
For three identical TestDFSIO write runs (units are MB/s): 
HDFS replication factor 
1 2 3 
Throughput 190 25 13 
Average IO-rate 190  10 25  3 13  1 
29 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort 
Goal 
Sort 1TB of data (or any other amount of data) as fast as possible. 
Probably most well-known Hadoop benchmark. 
Combines testing the HDFS and MapReduce layers of an 
Hadoop cluster. 
Typical areas where TeraSort is helpful 
Iron out your Hadoop con
guration after your cluster passed a 
convincing TestDFSIO benchmark
rst. 
Determine whether your MapReduce-related parameters are 
set to proper values. 
30 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
TeraGen 
/user/bart/terasort-input 
TeraSort 
/user/bart/terasort-output 
TeraValidate 
/user/bart/terasort-validate 
31 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
32 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
terasort /user/bart/input /user/bart/output 
 5 hours on our 4-node cluster 
33 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
terasort /user/bart/input /user/bart/output 
 5 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
teravalidate /user/bart/output /user/bart/validate 
If something went wrong, TeraValidate's output contains the 
problem report. 
34 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: duration 
35 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: counters 
36 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
NNBench 
Goal 
Load test the NameNode hardware and software. 
Generates a lot of HDFS-related requests with normally very 
small payloads. 
Purpose: put a high HDFS management stress on the 
NameNode. 
Can simulate requests for creating, reading, renaming and 
deleting
les on HDFS. 
37 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
NNBench: example 
Create 1000
les using 12 maps and 6 reducers: 
$ hadoop jar hadoop-*test*.jar nnbench  
-operation create_write  
-maps 12  
-reduces 6  
-blockSize 1  
-bytesToWrite 0  
-numberOfFiles 1000  
-replicationFactorPerFile 3  
-readFileAfterOpen true  
-baseDir /user/bart/NNBench-`hostname -s` 
38 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench 
Goal 
Loop a small job a number of times. 
checks whether small job runs are responsive and running 
eciently on the cluster 
complimentary to TeraSort 
puts its focus on the MapReduce layer 
impact on the HDFS layer is very limited 
39 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench: example 
Run a loop of 50 small test jobs: 
$ hadoop jar hadoop-*test*.jar  
mrbench -baseDir /user/bart/MRBench  
-numRuns 50 
40 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench: example 
Run a loop of 50 small test jobs: 
$ hadoop jar hadoop-*test*.jar  
mrbench -baseDir /user/bart/MRBench  
-numRuns 50 
Example output: 
DataLines Maps Reduces AvgTime (milliseconds) 
1 2 1 28822 
! average
nish time of executed jobs was 28 seconds. 
41 / 62
Benchmarking 
Benchmarks 
BigBench 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
42 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench 
Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index 
43 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench 
Big Data benchmark based on TPC-DS. 
Focus is mostly on MapReduce engines. 
Collaboration between industry and academia. 
https://github.com/intel-hadoop/Big-Bench/ 
History 
Launched at First Workshop on Big Data Benchmarking 
(May 8-9, 2012). 
Full kit at Fifth Workshop on Big Data Benchmarking 
(August 5-6, 2014). 
44 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench data model 
Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013. 
45 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench: Data Model - 3 V's 
Variety 
BigBench data is 
structured, 
semi-structured, 
unstructured. 
Velocity 
Periodic refreshes for all data. 
Dierent velocity for dierent areas: 
Vstructured  Vunstructured  Vsemistructured 
Volume 
TPC-DS: discrete scale factors 
(100, 300, 1000, 3000, 10000, 3000 and 100000). 
BigBench: continuous scale factor. 
46 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench: Workload 
Workload queries 
30 queries 
Speci

Contenu connexe

Tendances

Tendances (20)

Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 

En vedette

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

En vedette (17)

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپIntroduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similaire à Hadoop & Big Data benchmarking

Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
Cloudera, Inc.
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
pardhavi reddy
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
thkoch
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
Vasyl Senko
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
inovex GmbH
 

Similaire à Hadoop & Big Data benchmarking (20)

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
H04502048051
H04502048051H04502048051
H04502048051
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 

Dernier

Dernier (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Hadoop & Big Data benchmarking

  • 1. Benchmarking Hadoop & Big Data benchmarking Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk IWT TETRA User Group Meeting - November 28, 2014 1 / 62
  • 2. Benchmarking Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 2 / 62
  • 3. Benchmarking Intro: Hadoop essentials Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 3 / 62
  • 4. Benchmarking Intro: Hadoop essentials Hadoop Hadoop is VMware, but the other way around. 4 / 62
  • 5. Benchmarking Intro: Hadoop essentials Hadoop 1.0 Source: Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2, Hortonworks, 2014) MapReduce and HDFS are the core components, while other components are built around the core. 5 / 62
  • 6. Benchmarking Intro: Hadoop essentials Hadoop 2.0 Source: Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2, Hortonworks, 2014) YARN adds a more general interface to run non-MapReduce jobs within the Hadoop framework. 6 / 62
  • 7. Benchmarking Intro: Hadoop essentials HDFS Hadoop Distributed File System Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx 7 / 62
  • 8. Benchmarking Intro: Hadoop essentials MapReduce MapReduce = Programming Model WordCount example: Source: Optimizing Hadoop for MapReduce, Khaled Tannir 8 / 62
  • 9. Benchmarking Intro: Hadoop essentials Hadoop distributions 9 / 62
  • 10. Benchmarking Cloudera demo Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 10 / 62
  • 12. Benchmarking Cloudera demo NameNode and DataNodes 12 / 62
  • 13. Benchmarking Cloudera demo Hosts and their roles 13 / 62
  • 14. Benchmarking Cloudera demo NameNode WebUI NameNode WebUI address http://sandy-quad-1.sslab.lan:50070/ 14 / 62
  • 15. Benchmarking Cloudera demo Replication factor 15 / 62
  • 16. Benchmarking Cloudera demo HDFS Blocks 16 / 62
  • 18. le upload 17 / 62
  • 19. Benchmarking Cloudera demo Hadoop jobs: counters/metrics 18 / 62
  • 20. Benchmarking Benchmarks Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 19 / 62
  • 21. Benchmarking Benchmarks Why benchmark? My three reasons for using benchmarks: 1 Evaluating the eect of a hardware/software upgrade: OS, Java VM,. . . Hadoop, Cloudera CDH, Pig, Hive, Impala,. . . 2 Debugging: Compare with other clusters or published results. 3 Performance tuning: E.g. Cloudera CDH default con
  • 22. g is defensive, not optimal. 20 / 62
  • 23. Benchmarking Benchmarks Micro Benchmarks Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 21 / 62
  • 24. Benchmarking Benchmarks Micro Benchmarks Hadoop: Available tests hadoop jar /some/path/to/hadoop-*test*.jar 22 / 62
  • 25. Benchmarking Benchmarks Micro Benchmarks TestDFSIO Read and write test for HDFS. Helpful for getting an idea of how fast your cluster is in terms of I/O, stress testing HDFS, discover network performance bottlenecks, shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes). 23 / 62
  • 26. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: write test Generate 10
  • 27. les of size 1 GB for a total of 10 GB: $ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 TestDFSIO is designed to use 1 map task per
  • 29. les to map tasks) 24 / 62
  • 30. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: write test output Typical output of write test ----- TestDFSIO ----- : write Date time: Mon Oct 06 10:21:28 CEST 2014 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 12.874702111579893 Average IO rate mb/sec: 13.013071060180664 IO rate std deviation: 1.4416050051562712 Test exec time sec: 114.346 25 / 62
  • 31. Benchmarking Benchmarks Micro Benchmarks Interpreting TestDFSIO results De
  • 33. lesizei PN i=0 timei De
  • 34. nition (Average IO rate) Average IO rate(N) = PN i=0 ratei N = PN
  • 35. lesizei timei N i=0 Here, N is the number of map tasks. 26 / 62
  • 36. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: read test Read 10 input
  • 37. les, each of size 1 GB: $ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 27 / 62
  • 38. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: read test output Typical output of read test ----- TestDFSIO ----- : read Date time: Mon Oct 06 10:56:15 CEST 2014 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 402.4306813151435 Average IO rate mb/sec: 492.8257751464844 IO rate std deviation: 196.51233829270575 Test exec time sec: 33.206 28 / 62
  • 39. Benchmarking Benchmarks Micro Benchmarks In uence of HDFS replication factor When interpreting TestDFSIO results, keep in mind: The HDFS replication factor plays an important role! A higher replication factor leads to slower writes. For three identical TestDFSIO write runs (units are MB/s): HDFS replication factor 1 2 3 Throughput 190 25 13 Average IO-rate 190 10 25 3 13 1 29 / 62
  • 40. Benchmarking Benchmarks Micro Benchmarks TeraSort Goal Sort 1TB of data (or any other amount of data) as fast as possible. Probably most well-known Hadoop benchmark. Combines testing the HDFS and MapReduce layers of an Hadoop cluster. Typical areas where TeraSort is helpful Iron out your Hadoop con
  • 41. guration after your cluster passed a convincing TestDFSIO benchmark
  • 42. rst. Determine whether your MapReduce-related parameters are set to proper values. 30 / 62
  • 43. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow TeraGen /user/bart/terasort-input TeraSort /user/bart/terasort-output TeraValidate /user/bart/terasort-validate 31 / 62
  • 44. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster 32 / 62
  • 45. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar terasort /user/bart/input /user/bart/output 5 hours on our 4-node cluster 33 / 62
  • 46. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar terasort /user/bart/input /user/bart/output 5 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar teravalidate /user/bart/output /user/bart/validate If something went wrong, TeraValidate's output contains the problem report. 34 / 62
  • 47. Benchmarking Benchmarks Micro Benchmarks TeraSort: duration 35 / 62
  • 48. Benchmarking Benchmarks Micro Benchmarks TeraSort: counters 36 / 62
  • 49. Benchmarking Benchmarks Micro Benchmarks NNBench Goal Load test the NameNode hardware and software. Generates a lot of HDFS-related requests with normally very small payloads. Purpose: put a high HDFS management stress on the NameNode. Can simulate requests for creating, reading, renaming and deleting
  • 50. les on HDFS. 37 / 62
  • 51. Benchmarking Benchmarks Micro Benchmarks NNBench: example Create 1000
  • 52. les using 12 maps and 6 reducers: $ hadoop jar hadoop-*test*.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /user/bart/NNBench-`hostname -s` 38 / 62
  • 53. Benchmarking Benchmarks Micro Benchmarks MRBench Goal Loop a small job a number of times. checks whether small job runs are responsive and running eciently on the cluster complimentary to TeraSort puts its focus on the MapReduce layer impact on the HDFS layer is very limited 39 / 62
  • 54. Benchmarking Benchmarks Micro Benchmarks MRBench: example Run a loop of 50 small test jobs: $ hadoop jar hadoop-*test*.jar mrbench -baseDir /user/bart/MRBench -numRuns 50 40 / 62
  • 55. Benchmarking Benchmarks Micro Benchmarks MRBench: example Run a loop of 50 small test jobs: $ hadoop jar hadoop-*test*.jar mrbench -baseDir /user/bart/MRBench -numRuns 50 Example output: DataLines Maps Reduces AvgTime (milliseconds) 1 2 1 28822 ! average
  • 56. nish time of executed jobs was 28 seconds. 41 / 62
  • 57. Benchmarking Benchmarks BigBench Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 42 / 62
  • 58. Benchmarking Benchmarks BigBench BigBench Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index 43 / 62
  • 59. Benchmarking Benchmarks BigBench BigBench Big Data benchmark based on TPC-DS. Focus is mostly on MapReduce engines. Collaboration between industry and academia. https://github.com/intel-hadoop/Big-Bench/ History Launched at First Workshop on Big Data Benchmarking (May 8-9, 2012). Full kit at Fifth Workshop on Big Data Benchmarking (August 5-6, 2014). 44 / 62
  • 60. Benchmarking Benchmarks BigBench BigBench data model Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013. 45 / 62
  • 61. Benchmarking Benchmarks BigBench BigBench: Data Model - 3 V's Variety BigBench data is structured, semi-structured, unstructured. Velocity Periodic refreshes for all data. Dierent velocity for dierent areas: Vstructured Vunstructured Vsemistructured Volume TPC-DS: discrete scale factors (100, 300, 1000, 3000, 10000, 3000 and 100000). BigBench: continuous scale factor. 46 / 62
  • 62. Benchmarking Benchmarks BigBench BigBench: Workload Workload queries 30 queries Speci
  • 63. ed in English (sort of) No required syntax (
  • 64. rst implementation in Aster SQL MR) Kit implemented in Hive, Hadoop MR, Mahout, OpenNLP Business functions (McKinsey) Marketing Merchandising Operations Supply chain Reporting (customers and products) 47 / 62
  • 65. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Data Sources Number of Queries Percentage Structured 18 60 % Semi-structured 7 23 % Unstructured 5 17 % Analytic techniques Number of Queries Percentage Statistics analysis 6 20 % Data mining 17 57 % Reporting 8 27 % 48 / 62
  • 66. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Query Types Number of Queries Percentage Pure HiveQL 14 46 % Mahout 5 17 % OpenNLP 5 17 % Custom MR 6 20 % 49 / 62
  • 67. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Query Types Number of Queries Percentage Pure HiveQL 14 46 % Mahout 5 17 % OpenNLP 5 17 % Custom MR 6 20 % Note that your implementation may vary! 50 / 62
  • 68. Benchmarking Benchmarks BigBench BIgBench: Benchmark Process Source: http://www.tele-task.de/archive/video/flash/24896/ 51 / 62
  • 69. Benchmarking Benchmarks BigBench BigBench: Metric Number of queries run: 30 (2 S + 1) Measured times: TL: loading process TP: power test TTT1 :
  • 70. rst throughput test TTDM : data maintenance task TTT2 : second throughput test De
  • 71. nition (BigBench queries per hour) BBQpH = 30 3 S 3600 S TL + S TP + TTT1 + S TTDM + TTT2 Similar to TPC-DS metric. 52 / 62
  • 72. Benchmarking Benchmarks BigBench BigBench: results 53 / 62
  • 73. Benchmarking Benchmarks BigBench BigBench: monitoring 54 / 62
  • 74. Benchmarking Benchmarks BigBench BigBench: monitoring 55 / 62
  • 75. Benchmarking Benchmarks BigBench BigBench: monitoring 56 / 62
  • 76. Benchmarking Benchmarks BigBench BigBench: monitoring 57 / 62
  • 77. Benchmarking Benchmarks BigBench BigBench: in progress 58 / 62 Source: The Hortonworks Blog
  • 78. Benchmarking Conclusions Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 59 / 62
  • 79. Benchmarking Conclusions Conclusions Use Hadoop distributions! Hadoop cluster administration ! Cloudera Manager. Micro-benchmarks $ BigBench. 60 / 62
  • 80. Benchmarking Conclusions Conclusions Use Hadoop distributions! Hadoop cluster administration ! Cloudera Manager. Micro-benchmarks $ BigBench. Your best benchmark is your own application! 61 / 62
  • 81. Benchmarking Conclusions Questions? Source: https://gigaom.com/2011/12/19/my-hadoop-is-bigger-than-yours/ 62 / 62