SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Hadoop and HBase on the Cloud:
A Case Study on Performance and Isolation.
Konstantin V. Shvachko Jagane Sundar
February 26, 2013
 Founders of AltoStor and AltoScale
 Jagane: WANdisco, CTO and VP Engineering of Big Data
– Director of Hadoop Performance and Operability at Yahoo!
– Big data, cloud, virtualization, and networking experience
 Konstantin: WANdisco, Chief Architect
– Hadoop, HDFS at Yahoo! & eBay
– Efficient data structures and algorithms for large-scale distributed storage systems
– Giraffa - file system with distributed metadata & data utilizing HDFS and HBase.
Hosted on Apache Extra
/ page 2
Authors
 The Hadoop Distributed File System
(HDFS)
– Reliable storage layer
– NameNode – namespace and block
management
– DataNodes – block replica container
 MapReduce – distributed computation
framework
– Simple computational model
– JobTracker – job scheduling, resource
management, lifecycle coordination
– TaskTracker – task execution module
 Analysis and transformation of very large
amounts of data using commodity
servers
/ page 3
A reliable, scalable, high performance distributed computing system
What is Apache Hadoop
NameNode
DataNode
TaskTracker
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
Block
Task
 Table: big, sparse, loosely structured
– Collection of rows, sorted by row keys
– Rows can have arbitrary number of
columns
 Table is split Horizontally into Regions
– Dynamic Table partitioning
– Region Servers serve regions to
applications
 Columns grouped into Column families
– Vertical partition of tables
 Distributed Cache: Regions are loaded
in nodes’ RAM
– Real-time access to data
/ page 4
A distributed key-value storage for real-time access to semi-structured data
What is Apache HBase
DataNode DataNode DataNode
NameNode JobTracker
RegionServer RegionServerRegionServer
TaskTracker TaskTrackerTaskTracker
HBase
Master
/ page 5
A Parallel Computational Model and Distributed Framework
What is MapReduce
dogs C, 3
like
cats
V, 1
C, 2 V, 2
C, 3 V, 1
C, 8
V, 4
 I/O utilization
– Can run with the speed of spinning drives
– Examples: DFSIO, Terasort (well tuned)
 Network utilization – optimized by design
– Data locality. Tasks executed on nodes where input data resides.
No massive transfers
– Block replication of 3 requires two data transfers
– Map writes transient output locally
– Shuffle requires cross-node transfers
 CPU utilization
1. IO bound workloads preclude from using more cpu time
2. Cluster provisioning:
peak-load performance vs. average utilization tradeoff
/ page 6
Low Average CPU Utilization on Hadoop Clusters
What is the Problem
 Computation of Pi
– pure CPU workload, no input or output data
– Enormous amount of FFTs computing amazingly large numbers
– Record Pi run over-heated the datacenter
 Well tuned Terasort is CPU intensive
 Compression – marginal utilization gain
 Production clusters run cold
1. IO bound workloads
2. Conservative provisioning of cluster resources to meet strict SLAs
/ page 7
CPU Load
Two quadrillionth (1015)
digit of π is 0
 72 GB - total RAM / node
– 4 GB – DataNode
– 2 GB – TaskTracker
– 16 GB – RegionServer
– 2 GB – per individual task: 25 task slots (17 maps and 8 reduces)
 Average utilization vs peak-load performance
– Oversubscription (28 task slots)
– Better average utilization
– MR Tasks can starve HBase RegionServers
 Better Isolation of resources → Aggressive resource allocation
/ page 8
Rule of thumb
Cluster Provisioning Dilemma
 Goal: Eliminate disk IO contention
 Faster non-volatile storage devices improve IO performance
– Advantage in random reads
– Similar performance for sequential IOs
 More RAM: HBase caching
/ page 9
With non-spinning storage
Increasing IO Rate
 DFSIO benchmark measures average throughput for IO operations
– Write
– Read (sequential)
– Append
– Random Read (new)
 MapReduce job
– Map: same operation write or read for all mappers. Measures throughput
– Single reducer: aggregates the performance results
 Random Reads (MAPREDUCE-4651)
– Random Read DFSIO randomly chooses an offset
– Backward Read DFSIO reads files in reverse order
– Skip Read DFSIO reads seeks ahead after every portion read
– Avoid read-ahead buffering
– Similar results for all three random read modifications
/ page 10
Standard Hadoop Benchmark measuring HDFS performance
What is DFSIO
 Four node cluster: Hadoop 1.0.3 HBase 0.92.1
– 1 master-node: NameNode, JobTracker
– 3 slave node: DataNode, TaskTracker
 Node configuraiton
– Intel 8 core processor with hyper-threading
– 24 GB RAM
– Four 1TB 7200 rpm SATA drives
– 1 Gbps network interfaces
 DFSIO dataset
– 72 files of size 10 GB each
– Total data read: 7GB
– Single read size: 1 MB
– Concurrent readers: from 3 to 72
/ page 11
DFSIO
Benchmarking Environment
0
200
400
600
800
1000
1200
1400
1600
1800
3 12 24 48 72
AggregateThroughput(MB/sec)
Concurrent Readers
Disks
Flash
/ page 12
Increasing Load with Random Reads
Random Reads
 YCSB allows to define a mix of read / write operations,
measure latency and throughput
– Compares different database: relational and no-SQL
– Data is represented as a table of records with number of fixed fields
– Unique key identifies each record
 Main operations
– Insert: Insert a new record
– Read: Read a record
– Update: Update a record by replacing the value of one field
– Scan: Scan a random number of consequent records, starting at a random record
key
/ page 13
Yahoo! Cloud Serving Benchmark
What is YCSB
 Four node cluster
– 1 master-node: NameNode, JobTracker, HBase Master, Zookeeper
– 3 slave node: DataNode, TaskTracker, RegionServer
– Physical master node
– 2 to 4 VMs on a slave node. Max 12 VMs
 YCSB datasets of two different sizes: 10 and 30 million records
– dstat collects system resource metrics: CPU, memory usage, disk and network stats
/ page 14
YCSB
Benchmarking Environment
/ page 15
YCSB Workloads
Workloads Insert % Read % Update % Scan %
Data Load 100
Reads with heavy insert load 55 45
Short range scans: workload E 5 95
/ page 16
Random reads and Scans substantially faster with flash
Average Workloads Throughput
0
10
20
30
40
50
60
70
80
90
100
Data Load
Reads with
Inserts Short range
Scans
Throughput(%Ops/sec)
Disks
Flash
0
500
1000
1500
2000
2500
3000
64 256 512 1024
Throughput(Ops/sec)
Concurrent Threads
Physical nodes
6 Vms
9 Vms
12 Vms
/ page 17
Adding one VM per node increases overall performance 20% on average
Short range Scans: Throughput
/ page 18
Latency grows linearly with number of threads on physical nodes
Short range Scans: Latency
0
200
400
600
800
1000
1200
64 256 512 1024
AverageLatency(ms)
Concurrent Threads
Physical nodes
6 Vms
9 Vms
12 Vms
/ page 19
Virtualized Cluster drastically increases CPU utilization
CPU Utilization comparison
4%3% 1%
92%
CPU Physical nodes
user system wait idle
55%23%
1%
21%
CPU Virtualized cluster
user system wait idle
 Physical node cluster generates
very light CPU load – 92% idle
 With VMs the CPU can be drawn close to
100% at peaks
/ page 20
Latency of reads on mixed workload: 45% reads and 55% inserts
Reads with Inserts
0
100
200
300
400
500
600
700
10 million records
30 million records
ReadLatency(ms)
Disks
Flash
 HDFS
– Sequential IO is handled well by the disk storage
– Flash substantially outperforms disks on workloads with random reads
 HBase write-only workload provides marginal improvement for flash
 Using multiple VMs / node provides 100% peak utilization of HW resources
– CPU utilization on physical-node clusters is a fraction of its capacity
 Combination of Flash Storage and Virtualization implies
high performance of HBase for
Random Read and Reads Mixed with writes workloads
 Virtualization serves two main functions:
– Resource utilization by running more server processes per node
– Resource isolation by designating certain percentage of resources to each server
and not letting them starve each other
/ page 21
VMs allow to utilize Random Read advantage of flash for Hadoop
Conclusions
Thank you
Konstantin V. Shvachko Jagane Sundar

Contenu connexe

Tendances

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Tendances (20)

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop
HadoopHadoop
Hadoop
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
HDFS
HDFSHDFS
HDFS
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 

Similaire à 02.28.13 WANdisco ApacheCon 2013

[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed_Hat_Storage
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 

Similaire à 02.28.13 WANdisco ApacheCon 2013 (20)

Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep Dive
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 

Plus de WANdisco Plc

Forrester On Using Subversion to Optimize Globally Distributed Development
Forrester On Using Subversion to Optimize Globally Distributed DevelopmentForrester On Using Subversion to Optimize Globally Distributed Development
Forrester On Using Subversion to Optimize Globally Distributed DevelopmentWANdisco Plc
 
03.13.13 WANDisco SVN Training: Advanced Branching & Merging
03.13.13 WANDisco SVN Training: Advanced Branching & Merging03.13.13 WANDisco SVN Training: Advanced Branching & Merging
03.13.13 WANDisco SVN Training: Advanced Branching & MergingWANdisco Plc
 
02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVN02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVNWANdisco Plc
 
02.19.13 WANDisco SVN Training: Branching Options for Development
02.19.13 WANDisco SVN Training: Branching Options for Development02.19.13 WANDisco SVN Training: Branching Options for Development
02.19.13 WANDisco SVN Training: Branching Options for DevelopmentWANdisco Plc
 
uberSVN introduction by WANdisco
uberSVN introduction by WANdiscouberSVN introduction by WANdisco
uberSVN introduction by WANdiscoWANdisco Plc
 
WANdisco Subversion Support Services
WANdisco Subversion Support ServicesWANdisco Subversion Support Services
WANdisco Subversion Support ServicesWANdisco Plc
 
Make Subversion Agile
Make Subversion AgileMake Subversion Agile
Make Subversion AgileWANdisco Plc
 
Subversion in 2010 and Beyond
Subversion in 2010 and BeyondSubversion in 2010 and Beyond
Subversion in 2010 and BeyondWANdisco Plc
 
Forrester Research on Optimizing Globally Distributed Software Development Us...
Forrester Research on Optimizing Globally Distributed Software Development Us...Forrester Research on Optimizing Globally Distributed Software Development Us...
Forrester Research on Optimizing Globally Distributed Software Development Us...WANdisco Plc
 
Forrester Research on Globally Distributed Development Using Subversion
Forrester Research on Globally Distributed Development Using SubversionForrester Research on Globally Distributed Development Using Subversion
Forrester Research on Globally Distributed Development Using SubversionWANdisco Plc
 

Plus de WANdisco Plc (12)

Forrester On Using Subversion to Optimize Globally Distributed Development
Forrester On Using Subversion to Optimize Globally Distributed DevelopmentForrester On Using Subversion to Optimize Globally Distributed Development
Forrester On Using Subversion to Optimize Globally Distributed Development
 
03.13.13 WANDisco SVN Training: Advanced Branching & Merging
03.13.13 WANDisco SVN Training: Advanced Branching & Merging03.13.13 WANDisco SVN Training: Advanced Branching & Merging
03.13.13 WANDisco SVN Training: Advanced Branching & Merging
 
02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVN02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVN
 
02.19.13 WANDisco SVN Training: Branching Options for Development
02.19.13 WANDisco SVN Training: Branching Options for Development02.19.13 WANDisco SVN Training: Branching Options for Development
02.19.13 WANDisco SVN Training: Branching Options for Development
 
uberSVN introduction by WANdisco
uberSVN introduction by WANdiscouberSVN introduction by WANdisco
uberSVN introduction by WANdisco
 
Subversion Zen
Subversion ZenSubversion Zen
Subversion Zen
 
WANdisco Subversion Support Services
WANdisco Subversion Support ServicesWANdisco Subversion Support Services
WANdisco Subversion Support Services
 
Make Subversion Agile
Make Subversion AgileMake Subversion Agile
Make Subversion Agile
 
Why Svn
Why SvnWhy Svn
Why Svn
 
Subversion in 2010 and Beyond
Subversion in 2010 and BeyondSubversion in 2010 and Beyond
Subversion in 2010 and Beyond
 
Forrester Research on Optimizing Globally Distributed Software Development Us...
Forrester Research on Optimizing Globally Distributed Software Development Us...Forrester Research on Optimizing Globally Distributed Software Development Us...
Forrester Research on Optimizing Globally Distributed Software Development Us...
 
Forrester Research on Globally Distributed Development Using Subversion
Forrester Research on Globally Distributed Development Using SubversionForrester Research on Globally Distributed Development Using Subversion
Forrester Research on Globally Distributed Development Using Subversion
 

Dernier

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

02.28.13 WANdisco ApacheCon 2013

  • 1. Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation. Konstantin V. Shvachko Jagane Sundar February 26, 2013
  • 2.  Founders of AltoStor and AltoScale  Jagane: WANdisco, CTO and VP Engineering of Big Data – Director of Hadoop Performance and Operability at Yahoo! – Big data, cloud, virtualization, and networking experience  Konstantin: WANdisco, Chief Architect – Hadoop, HDFS at Yahoo! & eBay – Efficient data structures and algorithms for large-scale distributed storage systems – Giraffa - file system with distributed metadata & data utilizing HDFS and HBase. Hosted on Apache Extra / page 2 Authors
  • 3.  The Hadoop Distributed File System (HDFS) – Reliable storage layer – NameNode – namespace and block management – DataNodes – block replica container  MapReduce – distributed computation framework – Simple computational model – JobTracker – job scheduling, resource management, lifecycle coordination – TaskTracker – task execution module  Analysis and transformation of very large amounts of data using commodity servers / page 3 A reliable, scalable, high performance distributed computing system What is Apache Hadoop NameNode DataNode TaskTracker JobTracker DataNode TaskTracker DataNode TaskTracker Block Task
  • 4.  Table: big, sparse, loosely structured – Collection of rows, sorted by row keys – Rows can have arbitrary number of columns  Table is split Horizontally into Regions – Dynamic Table partitioning – Region Servers serve regions to applications  Columns grouped into Column families – Vertical partition of tables  Distributed Cache: Regions are loaded in nodes’ RAM – Real-time access to data / page 4 A distributed key-value storage for real-time access to semi-structured data What is Apache HBase DataNode DataNode DataNode NameNode JobTracker RegionServer RegionServerRegionServer TaskTracker TaskTrackerTaskTracker HBase Master
  • 5. / page 5 A Parallel Computational Model and Distributed Framework What is MapReduce dogs C, 3 like cats V, 1 C, 2 V, 2 C, 3 V, 1 C, 8 V, 4
  • 6.  I/O utilization – Can run with the speed of spinning drives – Examples: DFSIO, Terasort (well tuned)  Network utilization – optimized by design – Data locality. Tasks executed on nodes where input data resides. No massive transfers – Block replication of 3 requires two data transfers – Map writes transient output locally – Shuffle requires cross-node transfers  CPU utilization 1. IO bound workloads preclude from using more cpu time 2. Cluster provisioning: peak-load performance vs. average utilization tradeoff / page 6 Low Average CPU Utilization on Hadoop Clusters What is the Problem
  • 7.  Computation of Pi – pure CPU workload, no input or output data – Enormous amount of FFTs computing amazingly large numbers – Record Pi run over-heated the datacenter  Well tuned Terasort is CPU intensive  Compression – marginal utilization gain  Production clusters run cold 1. IO bound workloads 2. Conservative provisioning of cluster resources to meet strict SLAs / page 7 CPU Load Two quadrillionth (1015) digit of π is 0
  • 8.  72 GB - total RAM / node – 4 GB – DataNode – 2 GB – TaskTracker – 16 GB – RegionServer – 2 GB – per individual task: 25 task slots (17 maps and 8 reduces)  Average utilization vs peak-load performance – Oversubscription (28 task slots) – Better average utilization – MR Tasks can starve HBase RegionServers  Better Isolation of resources → Aggressive resource allocation / page 8 Rule of thumb Cluster Provisioning Dilemma
  • 9.  Goal: Eliminate disk IO contention  Faster non-volatile storage devices improve IO performance – Advantage in random reads – Similar performance for sequential IOs  More RAM: HBase caching / page 9 With non-spinning storage Increasing IO Rate
  • 10.  DFSIO benchmark measures average throughput for IO operations – Write – Read (sequential) – Append – Random Read (new)  MapReduce job – Map: same operation write or read for all mappers. Measures throughput – Single reducer: aggregates the performance results  Random Reads (MAPREDUCE-4651) – Random Read DFSIO randomly chooses an offset – Backward Read DFSIO reads files in reverse order – Skip Read DFSIO reads seeks ahead after every portion read – Avoid read-ahead buffering – Similar results for all three random read modifications / page 10 Standard Hadoop Benchmark measuring HDFS performance What is DFSIO
  • 11.  Four node cluster: Hadoop 1.0.3 HBase 0.92.1 – 1 master-node: NameNode, JobTracker – 3 slave node: DataNode, TaskTracker  Node configuraiton – Intel 8 core processor with hyper-threading – 24 GB RAM – Four 1TB 7200 rpm SATA drives – 1 Gbps network interfaces  DFSIO dataset – 72 files of size 10 GB each – Total data read: 7GB – Single read size: 1 MB – Concurrent readers: from 3 to 72 / page 11 DFSIO Benchmarking Environment
  • 12. 0 200 400 600 800 1000 1200 1400 1600 1800 3 12 24 48 72 AggregateThroughput(MB/sec) Concurrent Readers Disks Flash / page 12 Increasing Load with Random Reads Random Reads
  • 13.  YCSB allows to define a mix of read / write operations, measure latency and throughput – Compares different database: relational and no-SQL – Data is represented as a table of records with number of fixed fields – Unique key identifies each record  Main operations – Insert: Insert a new record – Read: Read a record – Update: Update a record by replacing the value of one field – Scan: Scan a random number of consequent records, starting at a random record key / page 13 Yahoo! Cloud Serving Benchmark What is YCSB
  • 14.  Four node cluster – 1 master-node: NameNode, JobTracker, HBase Master, Zookeeper – 3 slave node: DataNode, TaskTracker, RegionServer – Physical master node – 2 to 4 VMs on a slave node. Max 12 VMs  YCSB datasets of two different sizes: 10 and 30 million records – dstat collects system resource metrics: CPU, memory usage, disk and network stats / page 14 YCSB Benchmarking Environment
  • 15. / page 15 YCSB Workloads Workloads Insert % Read % Update % Scan % Data Load 100 Reads with heavy insert load 55 45 Short range scans: workload E 5 95
  • 16. / page 16 Random reads and Scans substantially faster with flash Average Workloads Throughput 0 10 20 30 40 50 60 70 80 90 100 Data Load Reads with Inserts Short range Scans Throughput(%Ops/sec) Disks Flash
  • 17. 0 500 1000 1500 2000 2500 3000 64 256 512 1024 Throughput(Ops/sec) Concurrent Threads Physical nodes 6 Vms 9 Vms 12 Vms / page 17 Adding one VM per node increases overall performance 20% on average Short range Scans: Throughput
  • 18. / page 18 Latency grows linearly with number of threads on physical nodes Short range Scans: Latency 0 200 400 600 800 1000 1200 64 256 512 1024 AverageLatency(ms) Concurrent Threads Physical nodes 6 Vms 9 Vms 12 Vms
  • 19. / page 19 Virtualized Cluster drastically increases CPU utilization CPU Utilization comparison 4%3% 1% 92% CPU Physical nodes user system wait idle 55%23% 1% 21% CPU Virtualized cluster user system wait idle  Physical node cluster generates very light CPU load – 92% idle  With VMs the CPU can be drawn close to 100% at peaks
  • 20. / page 20 Latency of reads on mixed workload: 45% reads and 55% inserts Reads with Inserts 0 100 200 300 400 500 600 700 10 million records 30 million records ReadLatency(ms) Disks Flash
  • 21.  HDFS – Sequential IO is handled well by the disk storage – Flash substantially outperforms disks on workloads with random reads  HBase write-only workload provides marginal improvement for flash  Using multiple VMs / node provides 100% peak utilization of HW resources – CPU utilization on physical-node clusters is a fraction of its capacity  Combination of Flash Storage and Virtualization implies high performance of HBase for Random Read and Reads Mixed with writes workloads  Virtualization serves two main functions: – Resource utilization by running more server processes per node – Resource isolation by designating certain percentage of resources to each server and not letting them starve each other / page 21 VMs allow to utilize Random Read advantage of flash for Hadoop Conclusions
  • 22. Thank you Konstantin V. Shvachko Jagane Sundar