SlideShare a Scribd company logo
1 of 41
Download to read offline
@BDOOP_BCN
Benchmarking Hadoop
by Nicolas Poggi @ni_po
June 2, 2015
About Nicolas Poggi @ni_po
What is BDOOP about?
● A group to share on Data
● Scalability
● Performance
● Configurations
● Cluster design
● Benchmarking
● …a/couple of beer/s!
• Having sysadmins in mind
● Also POs
● Not a group to learn
• Java
• Mapreduce programming
• Hadoop base concepts
BDOOP Group Objectives
● Create a local community to
● Learn Big Data
● performance and scalability
● Share
● day-to-day problems and solutions
● Present your work and findings
● Have talks from renown experts
● > Your objective here <
Benchmarking Motivation and Intro
Hadoop design
 Hadoop designed to solve complex data
 Structured and non structured
 With [close to] linear scalability
 Simplifying the programming model
 From MPI, OpenMP, CUDA, …
 Operates as a blackbox for data analysts
Image source: Hadoop, the definitive guide
Hadoop attributes
 Fault tolerant
 from commodityhardware
 Built in redundancy
 via replication
 Automatic scales out / down
 With [almost] linear scalability
 Move computation to data
 minimize communication
 Share nothing architecture
Hadoop highly-scalable but…
 Not a high-performance solution!
 Requires
 Design,
 Clusters, topology clusters
 Setup,
 OS, Hadoop config
 and tuning required
 Iterative approach
 Time consuming
 And extensive benchmarking!
Hadoop parameters
 > 100+ tunable parameters
 mapred.map/reduce.tasks.speculative.execution
 obscure and interrelated
 io.sort.mb 100 (300)
 io.sort.record.percent 5% (15%)
 io.sort.spill.percent 80% (95 – 100%)
 Number of Mappers and Reducers
 Rule of thumb 0.5 - 2 per CPU core
Hadoop ecosystem
 Large and spread
 Dominated by big players
 Custom patches
 Default values not ideal
 Product claims
 Cloud vs. On-premise
 IaaS
 PaaS
 EMR, HDInsight
 Needs standardization
and auditing!
DATA
Product claims
 Need auditing!
Workload (jobs)
 All jobs are different!
 Different requirements
 CPU bound
 Memory bound
 I/O bound
 … a bit of all
 Different tuning for
each
 Needs benchmarking!
Terasort
K-means
Wordcount
Sample mappers and reducer for 3 popular
benchmarks:
One for all config?
Vertical line:Average performance forthisworkloadacrossconfigurations
Valuesto the right: above average
Valuesto the left: below average
Is there one software configurationiterationthat fits everybody?
Configurations
Good for Terasort but
bad for Wordcount
Good for Terasort but
bad for Wordcount
Good for Wordcount but
very bad for Terasort
Example of SSD impact to Execution time
 Impact of SSDs to running time of Terasort
SSDs
HDDs
Configurations
SSD
SATA
Too many choices?
Remote volumes
-
-
Rotational HDDs
JBODs
Large VMs
Small VMs
GbEthernet
InfiniBand
RAID
Cost
Performance
On-Premise
Cloud
And where is my system
configurationpositionedon
each of these axes?
Highavailability
Replication
+
+
Benchmarks
Why benchmark?
 Validate assumptions
 Reproduce bad behavior
 Debugging
 Measure performance and scale
 Simulate higher load
 Find bottlenecks/ limits
 Plan for growth
 Test different
 SW and HW
Source: Based on High Performance MySQL, benchmarking MySQL chapter
Benchmarking stakeholders and use cases
 End-user / consumer
 Compare products
 Developer
 Profiling
 CI / QA
 Sysadmin / architect
 Cluster sizing
 SW and HW vendors
 Product claims
 Marketing
 Researcher
 …
Big Data Vs
 Volume
 Velocity
 Variety
 Structured, semi, unstructured data
 Different types of data (genres)
 Veracity
 Value
Sample scale factorfrom TPCx-HS
Data generation
 Real vs. Synthetic
 Random data vs. repeatable
 Datageneration time
 Paralle
 Datadistribution
 Flat or uniformly distributed
 Gaussian (normal distribution,
skew)
Issues Benchmarking Big Data
 Big Scale
 Single node vs Multiple nodes
 10MB vs 10TB
 On-metal vs. virtualized vs. cloud
 Non-deterministic/ Randomness
 Need to average multiple runs
 How long to benchmark
 Systemwarm-up
 Distributed systems
 Failures?
Types of benchmarks and Standards
 Micro benchmarks
 HDFSIO
 Functional
 Terasort, ETL
 Genre-specific
 Graph 500
 Application level
 BigBench
 TPC (implementation) vs SPEC (reference)
TPC vs. SPEC models
 Specification based
 Performance, price,
energy in one benchmark
 End-to-end
 Multiple tests (ACID, load)
 Independent review
 Full disclosure
 TPC Technology
Conference
 Kit based
 Performance and energy
in separate benchmarks
 Server-centric
 Single test
 Peer review
 Summary disclosure
 SPEC Research Group,
ICPE
Source: From presentation by Meikel Poess, 1stWBDB, May 2012
Data Benchmarks
Classical SQL OLAP DB Big Data
 First there was TPC-H
 Classical SQL OLAP
benchmark
 MRBench for M/R
 On top of Hive or Impala
for Hadoop
 Then sorting
 Terasort
 Unofficial standard
 Now part of TPCx-HS
 Hadoop samples
 Wordcount, grep,terasort,DFSIO
 YCSB
 From Yahoo!
 For NoSQL, HBASE implementation
 GridMix
 CALDA
 HiBench
 SWIM
 BigBench
 based on TCP-DS + ML
 30 queries
 BigDataBench
 33 workloads
 TPCx-HS
Comparisonof popularHadoop benchmarks
Spec[1
]
App
domains
Workload
types
Workload
s
Scalable
data
sets[2]
Diverse
implem[3]
Multi-
tenancy[4]
Subset[5] Simulator
[6]
BigDataBench Y Five Four[7] Thirty-
three [8] Eight[9] Y Y Y Y
BigBench Y One Three Ten Three N N N N
CloudSuite N N/A Two Eight Three N N N Y
HiBench N N/A Two Ten Three N N N N
CALDA Y N/A One Five N/A Y N N N
YCSB Y N/A One Six N/A Y N N N
LinkBench Y N/A One Ten N/A Y N N N
AMP Benchma
rks
Y N/A One Four N/A Y N N N
The Differences of BigDataBench from Other Benchmarks Suites.
Source: BigDataBench homepage
What to measure and metrics
 Job execution time
 Throughput
 Units / time
 Framework overhead
 # of spills
 Scalability
 Concurrency
 Abstract metrics
 CPU
 MEM
 DISK
 IOPS, latency, bandwidth
 NET
 Latency bandwidth
 TPCx-HS performance
metric (HSph@SF)
Benchmarking
Project ALOJA online repository
 Entry point for explore the results collected from the
executions,
 Provides insights on the obtained results through
continuouslyevolving data views.
 Online results at: http://hadoop.bsc.es
ALOJA Platform: Evolution and status
 Benchmarking, Repository, and Analytics tools for Big Data
 Composed of open-source
 Benchmarking, provisioning and orchestration tools,
 high-level system performance metric collection,
 low-level Hadoop instrumentation based on BSC Tools
 and Web based data analytics tools
 Andrecommendations
 Online Big Data Benchmark repository of:
 20,000+ runs (from HiBench)
 Sharable, comparable, repeatable, verifiable executions
 Abstracting and leveraging tools for BD benchmarking
 Not reinventing the wheel but,
 most current BD tools designed for production, not for benchmarking
 leverages current compatible tools and projects
 Dev VM toolset and sandbox
 via Vagrant
Big Data
Benchmarking
Online
Repository
Analytics
Workflow in ALOJA
Cluster(s)
definition
• VM sizes
• # nodes
• OS, disks
• Capabilities
Execution
plan
• Start cluster
• Exec Benchmarks
• Gather results
• Cleanup
Import
data
• Convert perf metric
• Parse logs
• Import into DB
Evaluate
data
• Data views in Vagrant VM
• Or http://hadoop.bsc.es
PA and KD
•Predictive
Analytics
•Knowledge
Discovery
Historic
Repo
34
Benchmarks Execution comparisons
 You can compare, side by side, all execution parameters:
 CPU, Memory, Network, Disk, Hadoop parameters….
Sample:
http://hadoop.bsc.es/perfcharts?execs[]=91144
HiBench suiteHiBench : A Benchmark Suite for Hadoop
HiBench
A Comprehensive & Realistic Benchmark Suite
Enhanced DFSIO
Micro Benchmarks Web Search
Sort
WordCount
TeraSort
Nutch Indexing
Page Rank
Machine Learning
Bayesian Classification
K-Means Clustering
HDFS
Code at: https://github.com/intel-hadoop/HiBench
Job resource requirements 1/2
Source: Intel HiBench
Job resource requirements 2/2
Source: Intel HiBench
Impact of SW configurations in Speedup
Number of mappers Compression algorithm
No comp.
ZLIB
BZIP2
snappy
4m
6m
8m
10m
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Impact of HW configurationsin Speedup
Disks and Network Cloud remote volumes
Local only
1 Remote
2 Remotes
3 Remotes
3 Remotes
/tmp local
2 Remotes
/tmp local
1 Remotes
/tmp local
HDD-ETH
HDD-IB
SSD-ETH
SDD-IB
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Cost/Performance Scalability
 Terasort (100GB)
Sample from: http://hadoop.bsc.es/nodeseval
Execution time Execution cost
InfiniBand + SDD (LOCAL)
GbE SDD + (LOCAL) CLOUD (local disk/tmpand HDFS)
CLOUD (/tmpinLocal Disk, HDFSin Blob storage 1-3
devices)
CLOUD (/tmpandHDFSin Blob storage
1-3 devices)
InfiniBand + SATA disks (LOCAL)
GbE+ SATA disks (LOCAL)
Price
Performance
Cost-effectiveness On-premise vs. Cloud)
Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Common Benchmarking pitfalls
 Scalability
 Assuming near scalability
 Compare apples to apples
 if benchmarking HW change HW
 but leave SW the same
 Terasort in v1 != Terasort in v2
 Test for Big Data use large data
 stress the system
 If results are too good to be true, they
probably aren't
 Don’t believe in miracles
 Expect vendor lies
Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain
Resources
 ALOJA Benchmarking platform and online repository
 http://hadoop.bsc.es/
 Big Data Benchmarking Community (BDBC) mailing list
 (~200 members from ~80organizations)
 http://clds.sdsc.edu/bdbc/community
 Workshop Big Data Benchmarking (WBDB)
 Next: http://clds.sdsc.edu/wbdb2015.ca
 SPEC Research Big Data working group
 http://research.spec.org/working-groups/big-data-working-group.html
 Slides and video:
 Michael Frank on Big Data benchmarking
 http://www.tele-task.de/archive/podcast/20430/
 Tilmann Rabl Big Data Benchmarking Tutorial
 http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl
@BDOOP_BCN
Benchmarking Hadoop
by Nicolas Poggi @ni_po
June 2, 2015

More Related Content

What's hot

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKBig Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKPrincipled Technologies
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Datainside-BigData.com
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 

What's hot (20)

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKBig Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 

Viewers also liked

TestDFSIO
TestDFSIOTestDFSIO
TestDFSIOhhyin
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Keeping Pressure Vessels Safe with the Sharck™  Probe
Keeping Pressure Vessels Safe with the Sharck™  ProbeKeeping Pressure Vessels Safe with the Sharck™  Probe
Keeping Pressure Vessels Safe with the Sharck™  ProbeEddyfi
 
Assigment 6
Assigment 6Assigment 6
Assigment 6fuzuli41
 
что такое Smm в 2013 году на примере
что такое Smm в 2013 году на примеречто такое Smm в 2013 году на примере
что такое Smm в 2013 году на примереАнтон Чернятин
 
Engineering Mechanics Statics design problem # 5.4 concrete chut by Kehali...
Engineering Mechanics Statics  design problem  # 5.4  concrete chut by Kehali...Engineering Mechanics Statics  design problem  # 5.4  concrete chut by Kehali...
Engineering Mechanics Statics design problem # 5.4 concrete chut by Kehali...kehali Haileselassie
 
Bipolar junction transistor characterstics biassing and amplification, lab 9
Bipolar junction transistor characterstics biassing and amplification, lab 9Bipolar junction transistor characterstics biassing and amplification, lab 9
Bipolar junction transistor characterstics biassing and amplification, lab 9kehali Haileselassie
 
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array ProbeInspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array ProbeEddyfi
 
The avanti group sharp turn for electronics company
The avanti group sharp turn for electronics companyThe avanti group sharp turn for electronics company
The avanti group sharp turn for electronics companyApplecherr McDougal
 
Bahasa indonesia teks laporan hasil observasi
Bahasa indonesia teks laporan hasil observasiBahasa indonesia teks laporan hasil observasi
Bahasa indonesia teks laporan hasil observasiSri Utanti
 
High-Speed Remote-Field Testing in Carbon Steel Tubing
High-Speed Remote-Field Testing in Carbon Steel TubingHigh-Speed Remote-Field Testing in Carbon Steel Tubing
High-Speed Remote-Field Testing in Carbon Steel TubingEddyfi
 
Texture powerpoint final
Texture powerpoint finalTexture powerpoint final
Texture powerpoint finalkphan22
 
Inspecting In-Service Storage Tank Annular Rings for Corrosion
Inspecting In-Service Storage Tank Annular Rings for CorrosionInspecting In-Service Storage Tank Annular Rings for Corrosion
Inspecting In-Service Storage Tank Annular Rings for CorrosionEddyfi
 
MASALAH EKONOMI
MASALAH EKONOMIMASALAH EKONOMI
MASALAH EKONOMISri Utanti
 
Defect Detection & Prevention in Cast Turbine Wheels
Defect Detection & Prevention in Cast Turbine WheelsDefect Detection & Prevention in Cast Turbine Wheels
Defect Detection & Prevention in Cast Turbine WheelsEddyfi
 
JLL JF 100 Excercise Bike Manual
JLL JF 100 Excercise Bike ManualJLL JF 100 Excercise Bike Manual
JLL JF 100 Excercise Bike ManualJLL Fitness
 
JLL Electronics Treadmills Magzine
JLL Electronics Treadmills MagzineJLL Electronics Treadmills Magzine
JLL Electronics Treadmills MagzineJLL Fitness
 

Viewers also liked (19)

TeraSort
TeraSortTeraSort
TeraSort
 
TestDFSIO
TestDFSIOTestDFSIO
TestDFSIO
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Keeping Pressure Vessels Safe with the Sharck™  Probe
Keeping Pressure Vessels Safe with the Sharck™  ProbeKeeping Pressure Vessels Safe with the Sharck™  Probe
Keeping Pressure Vessels Safe with the Sharck™  Probe
 
Assigment 6
Assigment 6Assigment 6
Assigment 6
 
что такое Smm в 2013 году на примере
что такое Smm в 2013 году на примеречто такое Smm в 2013 году на примере
что такое Smm в 2013 году на примере
 
Engineering Mechanics Statics design problem # 5.4 concrete chut by Kehali...
Engineering Mechanics Statics  design problem  # 5.4  concrete chut by Kehali...Engineering Mechanics Statics  design problem  # 5.4  concrete chut by Kehali...
Engineering Mechanics Statics design problem # 5.4 concrete chut by Kehali...
 
Bipolar junction transistor characterstics biassing and amplification, lab 9
Bipolar junction transistor characterstics biassing and amplification, lab 9Bipolar junction transistor characterstics biassing and amplification, lab 9
Bipolar junction transistor characterstics biassing and amplification, lab 9
 
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array ProbeInspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
 
The avanti group sharp turn for electronics company
The avanti group sharp turn for electronics companyThe avanti group sharp turn for electronics company
The avanti group sharp turn for electronics company
 
Bahasa indonesia teks laporan hasil observasi
Bahasa indonesia teks laporan hasil observasiBahasa indonesia teks laporan hasil observasi
Bahasa indonesia teks laporan hasil observasi
 
High-Speed Remote-Field Testing in Carbon Steel Tubing
High-Speed Remote-Field Testing in Carbon Steel TubingHigh-Speed Remote-Field Testing in Carbon Steel Tubing
High-Speed Remote-Field Testing in Carbon Steel Tubing
 
Texture powerpoint final
Texture powerpoint finalTexture powerpoint final
Texture powerpoint final
 
Inspecting In-Service Storage Tank Annular Rings for Corrosion
Inspecting In-Service Storage Tank Annular Rings for CorrosionInspecting In-Service Storage Tank Annular Rings for Corrosion
Inspecting In-Service Storage Tank Annular Rings for Corrosion
 
OBA.BY
OBA.BYOBA.BY
OBA.BY
 
MASALAH EKONOMI
MASALAH EKONOMIMASALAH EKONOMI
MASALAH EKONOMI
 
Defect Detection & Prevention in Cast Turbine Wheels
Defect Detection & Prevention in Cast Turbine WheelsDefect Detection & Prevention in Cast Turbine Wheels
Defect Detection & Prevention in Cast Turbine Wheels
 
JLL JF 100 Excercise Bike Manual
JLL JF 100 Excercise Bike ManualJLL JF 100 Excercise Bike Manual
JLL JF 100 Excercise Bike Manual
 
JLL Electronics Treadmills Magzine
JLL Electronics Treadmills MagzineJLL Electronics Treadmills Magzine
JLL Electronics Treadmills Magzine
 

Similar to Benchmarking Hadoop and Big Data

LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Ganesh Raju
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...DevOps.com
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsDirecti Group
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
 
Bodo Value Guide.pdf
Bodo Value Guide.pdfBodo Value Guide.pdf
Bodo Value Guide.pdfGregHanchin1
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part isqlserver.co.il
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Community
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 

Similar to Benchmarking Hadoop and Big Data (20)

LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Bodo Value Guide.pdf
Bodo Value Guide.pdfBodo Value Guide.pdf
Bodo Value Guide.pdf
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 

More from Nicolas Poggi

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsNicolas Poggi
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLNicolas Poggi
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)Nicolas Poggi
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloudNicolas Poggi
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Nicolas Poggi
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performanceNicolas Poggi
 

More from Nicolas Poggi (8)

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performance
 

Recently uploaded

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 

Benchmarking Hadoop and Big Data

  • 1. @BDOOP_BCN Benchmarking Hadoop by Nicolas Poggi @ni_po June 2, 2015
  • 3. What is BDOOP about? ● A group to share on Data ● Scalability ● Performance ● Configurations ● Cluster design ● Benchmarking ● …a/couple of beer/s! • Having sysadmins in mind ● Also POs ● Not a group to learn • Java • Mapreduce programming • Hadoop base concepts
  • 4. BDOOP Group Objectives ● Create a local community to ● Learn Big Data ● performance and scalability ● Share ● day-to-day problems and solutions ● Present your work and findings ● Have talks from renown experts ● > Your objective here <
  • 6. Hadoop design  Hadoop designed to solve complex data  Structured and non structured  With [close to] linear scalability  Simplifying the programming model  From MPI, OpenMP, CUDA, …  Operates as a blackbox for data analysts Image source: Hadoop, the definitive guide
  • 7. Hadoop attributes  Fault tolerant  from commodityhardware  Built in redundancy  via replication  Automatic scales out / down  With [almost] linear scalability  Move computation to data  minimize communication  Share nothing architecture
  • 8. Hadoop highly-scalable but…  Not a high-performance solution!  Requires  Design,  Clusters, topology clusters  Setup,  OS, Hadoop config  and tuning required  Iterative approach  Time consuming  And extensive benchmarking!
  • 9. Hadoop parameters  > 100+ tunable parameters  mapred.map/reduce.tasks.speculative.execution  obscure and interrelated  io.sort.mb 100 (300)  io.sort.record.percent 5% (15%)  io.sort.spill.percent 80% (95 – 100%)  Number of Mappers and Reducers  Rule of thumb 0.5 - 2 per CPU core
  • 10. Hadoop ecosystem  Large and spread  Dominated by big players  Custom patches  Default values not ideal  Product claims  Cloud vs. On-premise  IaaS  PaaS  EMR, HDInsight  Needs standardization and auditing! DATA
  • 12. Workload (jobs)  All jobs are different!  Different requirements  CPU bound  Memory bound  I/O bound  … a bit of all  Different tuning for each  Needs benchmarking! Terasort K-means Wordcount Sample mappers and reducer for 3 popular benchmarks:
  • 13. One for all config? Vertical line:Average performance forthisworkloadacrossconfigurations Valuesto the right: above average Valuesto the left: below average Is there one software configurationiterationthat fits everybody? Configurations Good for Terasort but bad for Wordcount Good for Terasort but bad for Wordcount Good for Wordcount but very bad for Terasort
  • 14. Example of SSD impact to Execution time  Impact of SSDs to running time of Terasort SSDs HDDs Configurations SSD SATA
  • 15. Too many choices? Remote volumes - - Rotational HDDs JBODs Large VMs Small VMs GbEthernet InfiniBand RAID Cost Performance On-Premise Cloud And where is my system configurationpositionedon each of these axes? Highavailability Replication + +
  • 17. Why benchmark?  Validate assumptions  Reproduce bad behavior  Debugging  Measure performance and scale  Simulate higher load  Find bottlenecks/ limits  Plan for growth  Test different  SW and HW Source: Based on High Performance MySQL, benchmarking MySQL chapter
  • 18. Benchmarking stakeholders and use cases  End-user / consumer  Compare products  Developer  Profiling  CI / QA  Sysadmin / architect  Cluster sizing  SW and HW vendors  Product claims  Marketing  Researcher  …
  • 19. Big Data Vs  Volume  Velocity  Variety  Structured, semi, unstructured data  Different types of data (genres)  Veracity  Value Sample scale factorfrom TPCx-HS
  • 20. Data generation  Real vs. Synthetic  Random data vs. repeatable  Datageneration time  Paralle  Datadistribution  Flat or uniformly distributed  Gaussian (normal distribution, skew)
  • 21. Issues Benchmarking Big Data  Big Scale  Single node vs Multiple nodes  10MB vs 10TB  On-metal vs. virtualized vs. cloud  Non-deterministic/ Randomness  Need to average multiple runs  How long to benchmark  Systemwarm-up  Distributed systems  Failures?
  • 22. Types of benchmarks and Standards  Micro benchmarks  HDFSIO  Functional  Terasort, ETL  Genre-specific  Graph 500  Application level  BigBench  TPC (implementation) vs SPEC (reference)
  • 23. TPC vs. SPEC models  Specification based  Performance, price, energy in one benchmark  End-to-end  Multiple tests (ACID, load)  Independent review  Full disclosure  TPC Technology Conference  Kit based  Performance and energy in separate benchmarks  Server-centric  Single test  Peer review  Summary disclosure  SPEC Research Group, ICPE Source: From presentation by Meikel Poess, 1stWBDB, May 2012
  • 24. Data Benchmarks Classical SQL OLAP DB Big Data  First there was TPC-H  Classical SQL OLAP benchmark  MRBench for M/R  On top of Hive or Impala for Hadoop  Then sorting  Terasort  Unofficial standard  Now part of TPCx-HS  Hadoop samples  Wordcount, grep,terasort,DFSIO  YCSB  From Yahoo!  For NoSQL, HBASE implementation  GridMix  CALDA  HiBench  SWIM  BigBench  based on TCP-DS + ML  30 queries  BigDataBench  33 workloads  TPCx-HS
  • 25. Comparisonof popularHadoop benchmarks Spec[1 ] App domains Workload types Workload s Scalable data sets[2] Diverse implem[3] Multi- tenancy[4] Subset[5] Simulator [6] BigDataBench Y Five Four[7] Thirty- three [8] Eight[9] Y Y Y Y BigBench Y One Three Ten Three N N N N CloudSuite N N/A Two Eight Three N N N Y HiBench N N/A Two Ten Three N N N N CALDA Y N/A One Five N/A Y N N N YCSB Y N/A One Six N/A Y N N N LinkBench Y N/A One Ten N/A Y N N N AMP Benchma rks Y N/A One Four N/A Y N N N The Differences of BigDataBench from Other Benchmarks Suites. Source: BigDataBench homepage
  • 26. What to measure and metrics  Job execution time  Throughput  Units / time  Framework overhead  # of spills  Scalability  Concurrency  Abstract metrics  CPU  MEM  DISK  IOPS, latency, bandwidth  NET  Latency bandwidth  TPCx-HS performance metric (HSph@SF)
  • 28. Project ALOJA online repository  Entry point for explore the results collected from the executions,  Provides insights on the obtained results through continuouslyevolving data views.  Online results at: http://hadoop.bsc.es
  • 29. ALOJA Platform: Evolution and status  Benchmarking, Repository, and Analytics tools for Big Data  Composed of open-source  Benchmarking, provisioning and orchestration tools,  high-level system performance metric collection,  low-level Hadoop instrumentation based on BSC Tools  and Web based data analytics tools  Andrecommendations  Online Big Data Benchmark repository of:  20,000+ runs (from HiBench)  Sharable, comparable, repeatable, verifiable executions  Abstracting and leveraging tools for BD benchmarking  Not reinventing the wheel but,  most current BD tools designed for production, not for benchmarking  leverages current compatible tools and projects  Dev VM toolset and sandbox  via Vagrant Big Data Benchmarking Online Repository Analytics
  • 30. Workflow in ALOJA Cluster(s) definition • VM sizes • # nodes • OS, disks • Capabilities Execution plan • Start cluster • Exec Benchmarks • Gather results • Cleanup Import data • Convert perf metric • Parse logs • Import into DB Evaluate data • Data views in Vagrant VM • Or http://hadoop.bsc.es PA and KD •Predictive Analytics •Knowledge Discovery Historic Repo
  • 31. 34 Benchmarks Execution comparisons  You can compare, side by side, all execution parameters:  CPU, Memory, Network, Disk, Hadoop parameters…. Sample: http://hadoop.bsc.es/perfcharts?execs[]=91144
  • 32. HiBench suiteHiBench : A Benchmark Suite for Hadoop HiBench A Comprehensive & Realistic Benchmark Suite Enhanced DFSIO Micro Benchmarks Web Search Sort WordCount TeraSort Nutch Indexing Page Rank Machine Learning Bayesian Classification K-Means Clustering HDFS Code at: https://github.com/intel-hadoop/HiBench
  • 33. Job resource requirements 1/2 Source: Intel HiBench
  • 34. Job resource requirements 2/2 Source: Intel HiBench
  • 35. Impact of SW configurations in Speedup Number of mappers Compression algorithm No comp. ZLIB BZIP2 snappy 4m 6m 8m 10m Speedup (higher is better) Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  • 36. Impact of HW configurationsin Speedup Disks and Network Cloud remote volumes Local only 1 Remote 2 Remotes 3 Remotes 3 Remotes /tmp local 2 Remotes /tmp local 1 Remotes /tmp local HDD-ETH HDD-IB SSD-ETH SDD-IB Speedup (higher is better) Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  • 37. Cost/Performance Scalability  Terasort (100GB) Sample from: http://hadoop.bsc.es/nodeseval Execution time Execution cost
  • 38. InfiniBand + SDD (LOCAL) GbE SDD + (LOCAL) CLOUD (local disk/tmpand HDFS) CLOUD (/tmpinLocal Disk, HDFSin Blob storage 1-3 devices) CLOUD (/tmpandHDFSin Blob storage 1-3 devices) InfiniBand + SATA disks (LOCAL) GbE+ SATA disks (LOCAL) Price Performance Cost-effectiveness On-premise vs. Cloud) Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  • 39. Common Benchmarking pitfalls  Scalability  Assuming near scalability  Compare apples to apples  if benchmarking HW change HW  but leave SW the same  Terasort in v1 != Terasort in v2  Test for Big Data use large data  stress the system  If results are too good to be true, they probably aren't  Don’t believe in miracles  Expect vendor lies Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain
  • 40. Resources  ALOJA Benchmarking platform and online repository  http://hadoop.bsc.es/  Big Data Benchmarking Community (BDBC) mailing list  (~200 members from ~80organizations)  http://clds.sdsc.edu/bdbc/community  Workshop Big Data Benchmarking (WBDB)  Next: http://clds.sdsc.edu/wbdb2015.ca  SPEC Research Big Data working group  http://research.spec.org/working-groups/big-data-working-group.html  Slides and video:  Michael Frank on Big Data benchmarking  http://www.tele-task.de/archive/podcast/20430/  Tilmann Rabl Big Data Benchmarking Tutorial  http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl
  • 41. @BDOOP_BCN Benchmarking Hadoop by Nicolas Poggi @ni_po June 2, 2015