SlideShare une entreprise Scribd logo
1  sur  35
Capacity Planning in Multi-tenant
Hadoop, HBase and Storm Deployments
PRESENTED BY Amrit Lal and Sumeet Singh ⎪ April 02, 2014
2 0 1 4 H a d o o p S u m m i t , A m s t e r d a m , N e t h e r l a n d s
Introduction
2 2014 Hadoop Summit, Amsterdam, Netherlands
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
Amrit Lal
Product Manager
Hadoop and Big Data Platforms
Cloud Engineering Group
§  Product Manager at Yahoo engaged in building
high class and robust Hadoop infrastructure
services
§  Eight years of experience across HSBC, Oracle
and Google in developing products and
platforms for high growth enterprises
§  M.B.A. from Carnegie Mellon University701 First Avenue,
Sunnyvale, CA 94089 USA
@amritasshwar
§  Manages Hadoop products team at Yahoo!
§  Responsible for Product Management, Strategy
and Customer Engagements
§  Managed Cloud Services products team and
headed Strategy functions for the Cloud
Platform Group at Yahoo
§  M.B.A. from UCLA and M.S. from
Rensselaer(RPI)
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Agenda
3 2014 Hadoop Summit, Amsterdam, Netherlands
The Need for Capacity Planning1
Big Data Platform Deployment Models2
Resource Drivers and Data Sources3
Capacity Models and Tools4
SLA Management5
0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers(DataNode)
Year
Servers Storage
Multi-tenant Apache Hadoop Platform Evolution
4 2014 Hadoop Summit, Amsterdam, Netherlands
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-tenancy,
and SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase, Storm,
Hive etc.
Increased
User-base
with partitioned
namespaces
Apache H2.x
(Low latency,
Util, HA etc.)
Hosted Apps Growth on Apache Hadoop
5 2014 Hadoop Summit, Amsterdam, Netherlands
272
288
306
330
336
357
368
382
407
449
460
495
260
310
360
410
460
510
Q1-11 Q2-11 Q3-11 Q4-11 Q1-12 Q2-12 Q3-12 Q4-12 Q1-13 Q2-13 Q3-13 Q4-13
NumberofNewProjects
New Customer Apps On-boarded
67 projects
in 2011
52 projects
in 2012
113 projects
in 2013
Multi-tenant Apache HBase Growth at Yahoo
6 2014 Hadoop Summit, Amsterdam, Netherlands
1,140
33.6 PB
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0
200
400
600
800
1,000
1,200
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
DataStored(inPB)
NumberofServers(RegionServer)
Zero to “20” Use Cases (60,000 Regions) in a Year
Region Servers Storage
Multi-tenant Apache Storm Growth at Yahoo
7 2014 Hadoop Summit, Amsterdam, Netherlands
Zero to “175” Production Topologies in a Year
760
175
0
20
40
60
80
100
120
140
160
180
200
0
100
200
300
400
500
600
700
800
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
NumberofTopologies
NumberofServers(Supervisor)
Supervisor Topologies
Multi-tenancy
Release
Where Does Capacity Planning Fit
8 2014 Hadoop Summit, Amsterdam, Netherlands
Phased
Environment
Production
On-boarding
Capacity
Planning
Architecture
Validation
Technology
Choice
Project Lifecycle Support
Big Data Platform Technology Stack at Yahoo
9 2014 Hadoop Summit, Amsterdam, Netherlands
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
Relevant for Capacity Planning
Deployment Model
10 2014 Hadoop Summit, Amsterdam, Netherlands
DataNode NodeManager
NameNode RM
DataNode RegionServer
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Relevant for Capacity Planning
Capacity Drivers That Matter
11 2014 Hadoop Summit, Amsterdam, Netherlands
Data (Storage) Volume of data to be stored and processed
Memory Container for direct and faster access to stored data
CPU Cores (and threads) available for processing
Throughput Number of transactions per second
Latency
Time taken to complete a request or operation ((includes
processing, disk and network I/O time)
Drivers Measure
Apache Hadoop Resources
12 2014 Hadoop Summit, Amsterdam, Netherlands
Data (Storage) Data stored in HDFS (disk)
Memory
Map and Reduce containers
(in H 0.23/ 2.0)
CPU
YARN-2 for Capacity Scheduler,
Yahoo is not using it yet
Throughput
Latency
Time taken for the jobs to
complete
§ Freq., size, retention, # files
§ Rep. factor
§ Map memory
§ Reduce memory
§ N/A
§ Individual job run times
§ Time to finish all jobs (when
run in parallel) – peak usage
Drivers Measure
Data processed/ second with
concurrent Mappers and Reducers
§ Total data processed
§ Maps and Reduces to run
(simple or complex DAGs)
IntheorderofimportanceforHadoop
Working Through a Use Case
13 2014 Hadoop Summit, Amsterdam, Netherlands
Pig Mail needs to process 30 TB of data
everyday in about 6 hours so that it can
develop algorithms that can detect spam
more effectively. A Pig script will parse the
data in sequential phases to finally
materialize the features of the mail that
decides if the mail is a SPAM.
1
3
2-L 2-R
Stage 1
Stage 2
Stage 3
Pig DAG
ILLUSTRATIVE
Data (Storage)
14 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Pig Mail Project Info – User Input
Data upload frequency Once daily
Data added per upload 1 TB / day
Data retention (Input) 30 days
Data output 50 GB
Data retention 1 day
Anticipated growth in data volume (3-6 months) 20%
Step 2: # Servers Based on Storage (default values at hdfs-site.xml)
HDFS replication factor
dfs.replication
Default: <3>
HDFS required (30 + 0.05) x 1.2 x 3 = 108 TB
Suggested server config (based on total cost) C-xxx/48/4000 (four 4 TB disks)
Storage available per server
12 TB out of 16 TB (rest for OS, temp, swap etc.)
dfs.datanode.du.reserved, <107374182400> 1 TB
Servers required 108 / 12 = 9 servers
Step 3: Namespace Needed (default values at hdfs-site.xml)
HDFS block size
dfs.blocksize
Default: <134217728> 128 MB
Average file size 1.5 X 128 MB = 200 MB (assumed)
Namespace for files 108 TB / 200 MB = 540,000
Memory
15 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Cluster/ Node Level Info (configured values at yarn-site.xml) – Admins Only
Max memory on the node for containers
yarn.nodemanager.resource.memory-mb
Conf: <45056> (44G out of 48G, rest for the OS)
Virtual to physical memory
yarn.nodemanager.vmem-pmem-ratio
Default: <2.1> (2:1 virtual to exceed physical by)
Min allocable memory for containers
yarn.scheduler.minimum-allocation-mb
Default: <512> (0.5G)
Max allocable memory for containers
yarn.scheduler.maximum-allocation-mb
Default: <8192> (8G)
Step 2: Container Level Info (default values at mapred-site.xml)
Map task container size
mapreduce.map.memory.mb
Default: <1536> (1.5G)
Reduce task container size
mapreduce.reduce.memory.mb
Default: <2048> (2G)
MR AppMaster memory size
yarn.app.mapreduce.am.resource.mb
Default: <1536> (1.5G)
Map task JVM heap size
mapreduce.map.java.opts
Default: Xmx1024m
Reduce task JVM heap size
mapreduce.reduce.java.opts
Default: Xmx1536m
Map and Reduce container sizes are determined by users developing the app based on memory needs of the tasks
Throughput
16 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Estimating Number of Mappers
Upper bound on input splits mapreduce.input.fileinputformat.split.maxsize
Lower bound on input splits mapreduce.input.fileinputformat.split.minsize
Number of mappers
Number of input splits
(e.g. 8,192 maps = 1 TB of data / 128M split size)
Step 2 A: Estimating Number of Reducers
Limit on the input size to reducers
mapreduce.reduce.input.limit
Default: <10737418240> (10G)
Fixed number of reducers mapreduce.job.reduces
Number of reducers Min (fixed reducers, total input size / reducer size)
Step 2 B: Estimating Number of Reducers (Pig and Hive)
Pig
Min (fixed reducers, pig.exec.reducers.max,
total input size / pig.exec.reducers.bytes.per.reducer)
Default: <max 999, reducer bytes 1GB>
Hive
Min (fixed reducers, hive.exec.reducers.max ,
total input size / hive.exec.reducers.bytes.per.reducer)
Default: < max 999, reducer bytes 1GB>
Throughput and Latency
17 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Sample Run (with a tenth of data on a sandbox cluster)
Stages # Map Map Size Map Time # Reduce Reduce Size Reduce Time
Stage 1 100 1.5 GB 10 Min 50 2 GB 5 Min
Stage 2 - L 50 1.5 GB 10 Min 20 2 GB 10 Min
Stage 2 - R 30 1.5 GB 5 Min 10 2 GB 5 Min
Stage 3 70 1.5 GB 5 Min 30 2 GB 5 Min
Notes:
§  SLOT_MILLIS_MAPS and SLOT_MILLIS_REDUCES from Job Counters gives the time spent
§  TOTAL_LAUNCHED_MAPS and TOTAL_LAUNCHED_REDUCES from Job Counters gives # Map and # Reduce
§  Reduce time includes the Sort and Shuffle time. Shuffle Time is Data per Reducer / est. 4 MB/s (bandwidth for
data transfer from Map to Reduce)
§  Add 10% for speculative execution (failed/killed task attempts)
Step 2: Mappers and Reducers for SLA and Full Dataset
Stages Mins SLA Share # Map # Reduce
Map
Total
Reduce
Total
Total
Mem.
#
Servers
Stage 1 15 / 45 Min 120 / 360 Min 138
(100 x 11) / 8
69
(50 x 11) / 8
207 GB 138 GB 345 GB 8
Project Pig Mail Capacity Ask = MAX (Compute <8 Servers>, Storage <9 Servers>) = 9 Servers
Capacity Calculations Tools
18 2014 Hadoop Summit, Amsterdam, Netherlands
Apache HBase Resources
19 2014 Hadoop Summit, Amsterdam, Netherlands
Throughput
Supported frequency of data read
or written in a second (for a given
record size)
Latency
Time taken for the read, write or
scan operations to complete
Memory
BlockCache; data that needs to
be served through cache
Data (Storage)
CPU N/A
§ Number of reads, writes or
scans per second per server
§ Read or write time in ms
(typically) per record
§ % of data read from cache
§ MemStore / BlockCache
ratio, RegionServer heap
§ N/A
Drivers Measure
Total data stored in HDFS (disk)
§ Avg. record size x avg.
number of records stored
IntheorderofimportanceforHBase
Working Through a Use Case
20 2014 Hadoop Summit, Amsterdam, Netherlands
Awesome eCommerce needs to
process about 200 M records daily
somewhere between 6:00 - 10:00 AM to
update product information. About 50%
of the data is related to existing
products where price may need to be
updated by comparing current with the
new offer price. Remaining 50% of the
offer is new products and will be written
without price comparison.
There are three separate tables for
product, price and offers with 3 KB avg.
record size. Writes are in the order of
500 Million records and reads 250
Million across each of the three tables.
ILLUSTRATIVE
Throughput & Latency
21 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Project Info – User Input
Active reads/writes per day 4 Hrs.
Avg. writes / day (all three tables) 1,500 M
Avg. reads / day (all three tables) 750 M
Average record size 3 KB
Records cached / warmed on start 50%
Step 2: # Servers Based on Write Throughput
Peak concurrent writes required 1,500 M x 3 KB / (4 x 3,600 sec) = ~ 300 MB / sec
Peak write throughput per RegionServer 45 MB / sec (based on performance benchmarks)
Servers required 300 / 45 = 7 RegionServer
Step 3: # Servers Based on Read Throughput
Peak concurrent reads required 750 M x 3 KB / (4x3600 sec) = ~160 MB / sec
Peak cold random read throughput 10 MB / sec (based on performance benchmarks)
Peak hot random read throughput 200 MB / sec (based on performance benchmarks)
RegionServer for cold reads 160 x 50% / 10 = 8
RegionServer for hot read 160 x 50% / 200 = 1
Servers required Max (8,1) = 8 RegionServer
Performance benchmarks were conducted by simulating HBase workloads through YCSB on dedicated servers
Memory
22 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: RegionServer Info (configured values at hbase-site.xml and hbase-env.sh) – Admins Only
Max memory available per Region Server
C-xxx/64/4000
<64 GB>
Heap size of the Region Server JVM
export HBASE_HEAPSIZE = 59392 (58 GB)
Default: <1000> (1000 MB)
Memory allocated to BlockCache
hfile.block.cache.size = 0.8 (80%)
Default: <0.4> (40% of Heap)
Memory allocated to Memstore
hbase.regionserver.global.memstore.size = 0.2 (20%)
Default: <0.4> (40% of Heap)
Step 2: Servers required to serve from block cache
Total records 200 M
Average record size 3 KB
Total data served 200 M x 3 KB = 0.55 TB
Total data served through BlockCache 0.55 TB x 50% = 0.28 TB
Loading factor in the (LRU) BlockCache (in HBase 0.94) 85 %
Total BlockCache available per RegionServer 58 GB x 0.8 x 85% = 40 GB
Servers required 0.28 TB / 40 GB = 7 RegionServer
Block cache allocation is dependent on the mix of reads and writes access patterns. Remainder of LRU is used by
other resident users such as catalog tables, hfiles indexes, bloom filters
Data
23 2014 Hadoop Summit, Amsterdam, Netherlands
Step 2: # Servers Based on data served
Raw disk space to JVM heap / RegionServer 10 GB / 128 MB x 3 x 0.2 = 48
Raw disk space available / RegionServer 48 x 58 GB x 0.2 = 0.56 TB
Total data served through tables 0.55 TB
Total raw data served 0.55 TB x 3 = 1.65 TB
Servers required 1.65 / 0.56 = 3 servers
Step 1: RegionServer Info (configured values at hbase-site.xml & hbase-env.sh) – Admins Only
Max memory available per RegionServer C-xxx/64/4000 (four 4 TB disks) = 64 GB
Heap size of the RegionServer JVM
export HBASE_HEAPSIZE = 59392 (58 GB)
Default: <1000> (1000 MB)
Region size
hbase.hregion.max.filesize = 10737418240
Default: <10737418240> (10 GB)
Memory allocated to MemStore
hbase.regionserver.global.memstore.size = 0.2 (20%)
Default: <0.4> (40% of Heap)
Memstore flush size
hbase.hregion.memstore.flush.size= 134217728
Default: <134217728> (128 MB)
HDFS replication factor
dfs.replication = 3
Default: <3>
Project Awesome eCommerce Ask = MAX (Write <7 RS>, Read <8 RS>, Cached<7 RS >, Data <3 RS>) = 8 RS
Capacity Calculations Tools
24 2014 Hadoop Summit, Amsterdam, Netherlands
Apache Storm Resources
25 2014 Hadoop Summit, Amsterdam, Netherlands
Throughput
Events processed per second or
parallel workers
Memory
Worker/ Slot memory for spouts
and bolts
CPU
CPU threads needed for workers/
executors
Latency
Data (Storage) N/A
§ # events, # messages / sec
§ Tuples / sec
§ Spout and bolt JVM size
§ Message and Tuple size
§ Cores for spout and bolt
processes, inter and intra
§ Inter and Intra worker
comm.
§ N/A
Drivers Measure
Time taken for processing the
input stream of events
§ Execute / complete latency
IntheorderofimportanceforStorm
Working Through a Use Case
26 2014 Hadoop Summit, Amsterdam, Netherlands
Wonder Search wants to index editorial
content in near real-time for users to be able
to search content. The editorial content is
available in Apache HBase.
Spout: Scans HBase since the last scan till
current time to get the editorial content.
Bolt 1: Build the index and store it back in
HBase.
Bolt 2: Push the index for serving.
ILLUSTRATIVE
Throughput and Latency
27 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Supervisor Level Info (configured values at storm.yaml or multitenant-scheduler.yaml) – Admins Only
Incoming (worker) messages queue size topology.receiver.buffer.size, Default: <8>
Outgoing (worker) messages queue size topology.transfer.buffer.size, Default: <1024>
Incoming (executor) tuple queue size topology.executor.receive.buffer.size, Default: <1024>
Outgoing (executor) tuple queue size topology.executor.send.buffer.size, Default: <1024>
Slots available per supervisor
supervisor.slots.ports
<24>, hyper-threaded cores for dual hex-core machines
Multi-tenant scheduler (user isolation scheduler)
multitenant.scheduler.user.pools: <users> : <# nodes>,
topology.isolate.machines: <Number of Nodes>
Step 2: # Servers Based on Throughput
Events processed with single spout per worker 1,000 messages / sec
Target throughput required 8,000 messages / sec
Number of spout executors required 8,000 / 1,000 = 8 (across 8 slots)
Number of tuple executed across 1st bolt (5 executors) 10,000 tuples / sec
Total executors required for 1st bolt 8 x 5 = 40 (across 40 slots)
Number of tuples executed across 2nd bolt (5 executors) 15,000 tuples / sec
Total executors required for 2nd Bolt 8 x 5 = 40 (across 40 slots)
Total slots based on executors 8 + 40 + 40 = 88 Slots
Number of supervisors required 88 / 24 = 4 servers
CPU vs. Throughput
28 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Track CPU usage with JVM tools (jmap/ jstack)
Max CPU cores per supervisor C-xxx/48/4000 (12 physical cores)
CPU usage for 1000 messages / sec
4 physical cores (32.12%)
Includes 1 spout and 5 bolt executors each for bolts 1
and 2, and CPU usage for inter-messaging (ZeroMQ or
Netty)
Equal CPU division between spout and bolt executor
(assumed)
Executor CPU needs = 4 / (1+5+5) = 4/11 cores
Total workers
TOPOLOGY_WORKERS
Config#setNumWorkers
Tasks per component
TOPOLOGY_TASKS
ComponentConfigurationDeclarer#setNumTasks()
Step 2: Extrapolate for Target Throughput (linear increase)
Target spout executors 8, TopologyBuilder#setSpout()
Target bolt executors 40, TopologyBuilder#setBolt()
CPU needed for spout executors 8 x 4/11 = 3 cores
CPU needed for 1st bolt executors 40 x 4/11 = 15 cores
CPU needed for 2nd bolt executors 40 x 4/11 = 15 cores
CPU need for the topology 3 + 15 + 15 = 33 cores
Total supervisors needed 33 /12 = 3 servers
Memory vs. Throughput
29 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Supervisor Level Info
Max memory available per supervisor node
C-xxx/48/4000 <48 GB>
(Usable 42G out of 48G, rest for the OS)
Step 2: # Servers Based on Memory needs
Events processed across spout executors 8,000 messages / sec
Avg. event or message size 3 MB
Data processed per second across spout executors 8,000 x 3 MB = 24 GB / sec
Events processed per second across 1st bolt executors 10,000 x 8 = 80,000 tuples / sec
Average tuple size 100 KB
Data processed per second across 1st bolt executors 80,000 tuples / sec x 100 KB = 8 GB / sec
Data processed per second across 2nd bolt executors 15,000 x 8 tuples / sec x 100 KB = 12 GB / sec
Total data processed 24 GB / sec + 8 GB / sec + 12 GB / sec = 44 GB / sec
Number of Supervisors required to process data 44 / 42 = 2 server
Project Wonder Search Ask = MAX (Throughput <4 Servers>, CPU <3 Servers>, Memory <2 Server >= 4 Servers
Capacity Calculation Tools
30 2014 Hadoop Summit, Amsterdam, Netherlands
On-going SLA Management
31 2014 Hadoop Summit, Amsterdam, Netherlands
queue 2
queue 1
queue 3
queue 4
queue 5
queue 6
queue 7
queue 8
queue 11
queue 9
queue 10
SLA Dashboard on Hadoop Analytics Warehouse
Growing with YARN
32 2014 Hadoop Summit, Amsterdam, Netherlands
HDFS (File System)
YARN (Resource Manager)
MapReduce
(Batch)
Spark
(Iterative)
Storm
(Stream)
HBaseGiraph
R, OpenMPI,
Indexing etc.
Coming soon
on YARN
Available
today
…
New Services on YARN
Tez
(DAGs)
Near Future for Capacity Planning
33 2014 Hadoop Summit, Amsterdam, Netherlands
Hadoop HBase Storm
§  CPU as a resource
§  Container reuse
§  Long-running jobs
§  Other potential
resources such as disk,
network, GPUs etc.
§  Tez as the execution
engine
§  Spark-on-YARN etc.
§  BlockCache
implementations
§  LRU
§  Slab
§  Bucket
§  Short circuit reads
§  Bloom filters and co-
processors
§  HBase-on-YARN
§  Storm-on-YARN
§  More experience with
multi-tenancy
Acknowledgement
34 2014 Hadoop Summit, Amsterdam, Netherlands
Hadoop Capacity Planning
Nathan Roberts Hadoop Core Architect
Koji Noguchi Software Engineer
Viraj Bhat Software Engineer
Ryota Egashiri Software Engineer
Balaji Narayan Service Engineer
Anish Matthew Service Engineer
Rajiv Chittajallu SE Architect
HBase Capacity Planning
Francis Liu Software Engineer
Dheeraj Kapur Service Engineer
Storm Capacity Planning
Bobby Evans Software Engineer
Dheeraj Kapur Service Engineer
Thank You
@ s u m e e t k s i n g h

Contenu connexe

Tendances

Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuningAnil Reddy
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 

Tendances (20)

February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
10c introduction
10c introduction10c introduction
10c introduction
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 

En vedette

Hive - Apache hadoop Bigdata training by Desing Pathshala
Hive - Apache hadoop Bigdata training by Desing PathshalaHive - Apache hadoop Bigdata training by Desing Pathshala
Hive - Apache hadoop Bigdata training by Desing PathshalaDesing Pathshala
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in SparkDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNTsuyoshi OZAWA
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Benchmark and Metrics
Benchmark and MetricsBenchmark and Metrics
Benchmark and MetricsYuta Imai
 
Dynamic Resource Allocation in Apache Spark
Dynamic Resource Allocation in Apache SparkDynamic Resource Allocation in Apache Spark
Dynamic Resource Allocation in Apache SparkYuta Imai
 
Spark at Scale
Spark at ScaleSpark at Scale
Spark at ScaleYuta Imai
 
Deep Learning On Apache Spark
Deep Learning On Apache SparkDeep Learning On Apache Spark
Deep Learning On Apache SparkYuta Imai
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 

En vedette (20)

Hive - Apache hadoop Bigdata training by Desing Pathshala
Hive - Apache hadoop Bigdata training by Desing PathshalaHive - Apache hadoop Bigdata training by Desing Pathshala
Hive - Apache hadoop Bigdata training by Desing Pathshala
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARN
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Benchmark and Metrics
Benchmark and MetricsBenchmark and Metrics
Benchmark and Metrics
 
Dynamic Resource Allocation in Apache Spark
Dynamic Resource Allocation in Apache SparkDynamic Resource Allocation in Apache Spark
Dynamic Resource Allocation in Apache Spark
 
Spark at Scale
Spark at ScaleSpark at Scale
Spark at Scale
 
Deep Learning On Apache Spark
Deep Learning On Apache SparkDeep Learning On Apache Spark
Deep Learning On Apache Spark
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 

Similaire à Capacity Planning for Multi-tenant Hadoop, HBase and Storm Deployments

What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 

Similaire à Capacity Planning for Multi-tenant Hadoop, HBase and Storm Deployments (20)

What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Resume - Narasimha Rao B V (TCS)
Resume - Narasimha  Rao B V (TCS)Resume - Narasimha  Rao B V (TCS)
Resume - Narasimha Rao B V (TCS)
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 

Plus de Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

Plus de Sumeet Singh (12)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Dernier

KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptJohnWilliam111370
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 

Dernier (20)

KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 

Capacity Planning for Multi-tenant Hadoop, HBase and Storm Deployments

  • 1. Capacity Planning in Multi-tenant Hadoop, HBase and Storm Deployments PRESENTED BY Amrit Lal and Sumeet Singh ⎪ April 02, 2014 2 0 1 4 H a d o o p S u m m i t , A m s t e r d a m , N e t h e r l a n d s
  • 2. Introduction 2 2014 Hadoop Summit, Amsterdam, Netherlands Sumeet Singh Senior Director, Product Management Hadoop and Big Data Platforms Cloud Engineering Group Amrit Lal Product Manager Hadoop and Big Data Platforms Cloud Engineering Group §  Product Manager at Yahoo engaged in building high class and robust Hadoop infrastructure services §  Eight years of experience across HSBC, Oracle and Google in developing products and platforms for high growth enterprises §  M.B.A. from Carnegie Mellon University701 First Avenue, Sunnyvale, CA 94089 USA @amritasshwar §  Manages Hadoop products team at Yahoo! §  Responsible for Product Management, Strategy and Customer Engagements §  Managed Cloud Services products team and headed Strategy functions for the Cloud Platform Group at Yahoo §  M.B.A. from UCLA and M.S. from Rensselaer(RPI) 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
  • 3. Agenda 3 2014 Hadoop Summit, Amsterdam, Netherlands The Need for Capacity Planning1 Big Data Platform Deployment Models2 Resource Drivers and Data Sources3 Capacity Models and Tools4 SLA Management5
  • 4. 0 50 100 150 200 250 300 350 400 450 500 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 RawHDFSStorage(inPB) NumberofServers(DataNode) Year Servers Storage Multi-tenant Apache Hadoop Platform Evolution 4 2014 Hadoop Summit, Amsterdam, Netherlands Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi-tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23 YARN) New Services (HBase, Storm, Hive etc. Increased User-base with partitioned namespaces Apache H2.x (Low latency, Util, HA etc.)
  • 5. Hosted Apps Growth on Apache Hadoop 5 2014 Hadoop Summit, Amsterdam, Netherlands 272 288 306 330 336 357 368 382 407 449 460 495 260 310 360 410 460 510 Q1-11 Q2-11 Q3-11 Q4-11 Q1-12 Q2-12 Q3-12 Q4-12 Q1-13 Q2-13 Q3-13 Q4-13 NumberofNewProjects New Customer Apps On-boarded 67 projects in 2011 52 projects in 2012 113 projects in 2013
  • 6. Multi-tenant Apache HBase Growth at Yahoo 6 2014 Hadoop Summit, Amsterdam, Netherlands 1,140 33.6 PB 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 0 200 400 600 800 1,000 1,200 Q1-13 Q2-13 Q3-13 Q4-13 Q1-14 DataStored(inPB) NumberofServers(RegionServer) Zero to “20” Use Cases (60,000 Regions) in a Year Region Servers Storage
  • 7. Multi-tenant Apache Storm Growth at Yahoo 7 2014 Hadoop Summit, Amsterdam, Netherlands Zero to “175” Production Topologies in a Year 760 175 0 20 40 60 80 100 120 140 160 180 200 0 100 200 300 400 500 600 700 800 Q1-13 Q2-13 Q3-13 Q4-13 Q1-14 NumberofTopologies NumberofServers(Supervisor) Supervisor Topologies Multi-tenancy Release
  • 8. Where Does Capacity Planning Fit 8 2014 Hadoop Summit, Amsterdam, Netherlands Phased Environment Production On-boarding Capacity Planning Architecture Validation Technology Choice Project Lifecycle Support
  • 9. Big Data Platform Technology Stack at Yahoo 9 2014 Hadoop Summit, Amsterdam, Netherlands Compute Services Storage Infrastructure Services HivePig Oozie HDFS ProxyGDM YARN MapReduce HDFS HBase Zookeeper Support Shop Monitoring Starling Messaging Service HCatalog Storm SparkTez Relevant for Capacity Planning
  • 10. Deployment Model 10 2014 Hadoop Summit, Amsterdam, Netherlands DataNode NodeManager NameNode RM DataNode RegionServer NameNode HBase Master Nimbus Supervisor Administration, Management and Monitoring ZooKeeper Pools HTTP/HDFS/GDM Load Proxies Applications and Data Data Feeds Data Stores Oozie Server HS2/ HCat Relevant for Capacity Planning
  • 11. Capacity Drivers That Matter 11 2014 Hadoop Summit, Amsterdam, Netherlands Data (Storage) Volume of data to be stored and processed Memory Container for direct and faster access to stored data CPU Cores (and threads) available for processing Throughput Number of transactions per second Latency Time taken to complete a request or operation ((includes processing, disk and network I/O time) Drivers Measure
  • 12. Apache Hadoop Resources 12 2014 Hadoop Summit, Amsterdam, Netherlands Data (Storage) Data stored in HDFS (disk) Memory Map and Reduce containers (in H 0.23/ 2.0) CPU YARN-2 for Capacity Scheduler, Yahoo is not using it yet Throughput Latency Time taken for the jobs to complete § Freq., size, retention, # files § Rep. factor § Map memory § Reduce memory § N/A § Individual job run times § Time to finish all jobs (when run in parallel) – peak usage Drivers Measure Data processed/ second with concurrent Mappers and Reducers § Total data processed § Maps and Reduces to run (simple or complex DAGs) IntheorderofimportanceforHadoop
  • 13. Working Through a Use Case 13 2014 Hadoop Summit, Amsterdam, Netherlands Pig Mail needs to process 30 TB of data everyday in about 6 hours so that it can develop algorithms that can detect spam more effectively. A Pig script will parse the data in sequential phases to finally materialize the features of the mail that decides if the mail is a SPAM. 1 3 2-L 2-R Stage 1 Stage 2 Stage 3 Pig DAG ILLUSTRATIVE
  • 14. Data (Storage) 14 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: Pig Mail Project Info – User Input Data upload frequency Once daily Data added per upload 1 TB / day Data retention (Input) 30 days Data output 50 GB Data retention 1 day Anticipated growth in data volume (3-6 months) 20% Step 2: # Servers Based on Storage (default values at hdfs-site.xml) HDFS replication factor dfs.replication Default: <3> HDFS required (30 + 0.05) x 1.2 x 3 = 108 TB Suggested server config (based on total cost) C-xxx/48/4000 (four 4 TB disks) Storage available per server 12 TB out of 16 TB (rest for OS, temp, swap etc.) dfs.datanode.du.reserved, <107374182400> 1 TB Servers required 108 / 12 = 9 servers Step 3: Namespace Needed (default values at hdfs-site.xml) HDFS block size dfs.blocksize Default: <134217728> 128 MB Average file size 1.5 X 128 MB = 200 MB (assumed) Namespace for files 108 TB / 200 MB = 540,000
  • 15. Memory 15 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: Cluster/ Node Level Info (configured values at yarn-site.xml) – Admins Only Max memory on the node for containers yarn.nodemanager.resource.memory-mb Conf: <45056> (44G out of 48G, rest for the OS) Virtual to physical memory yarn.nodemanager.vmem-pmem-ratio Default: <2.1> (2:1 virtual to exceed physical by) Min allocable memory for containers yarn.scheduler.minimum-allocation-mb Default: <512> (0.5G) Max allocable memory for containers yarn.scheduler.maximum-allocation-mb Default: <8192> (8G) Step 2: Container Level Info (default values at mapred-site.xml) Map task container size mapreduce.map.memory.mb Default: <1536> (1.5G) Reduce task container size mapreduce.reduce.memory.mb Default: <2048> (2G) MR AppMaster memory size yarn.app.mapreduce.am.resource.mb Default: <1536> (1.5G) Map task JVM heap size mapreduce.map.java.opts Default: Xmx1024m Reduce task JVM heap size mapreduce.reduce.java.opts Default: Xmx1536m Map and Reduce container sizes are determined by users developing the app based on memory needs of the tasks
  • 16. Throughput 16 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: Estimating Number of Mappers Upper bound on input splits mapreduce.input.fileinputformat.split.maxsize Lower bound on input splits mapreduce.input.fileinputformat.split.minsize Number of mappers Number of input splits (e.g. 8,192 maps = 1 TB of data / 128M split size) Step 2 A: Estimating Number of Reducers Limit on the input size to reducers mapreduce.reduce.input.limit Default: <10737418240> (10G) Fixed number of reducers mapreduce.job.reduces Number of reducers Min (fixed reducers, total input size / reducer size) Step 2 B: Estimating Number of Reducers (Pig and Hive) Pig Min (fixed reducers, pig.exec.reducers.max, total input size / pig.exec.reducers.bytes.per.reducer) Default: <max 999, reducer bytes 1GB> Hive Min (fixed reducers, hive.exec.reducers.max , total input size / hive.exec.reducers.bytes.per.reducer) Default: < max 999, reducer bytes 1GB>
  • 17. Throughput and Latency 17 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: Sample Run (with a tenth of data on a sandbox cluster) Stages # Map Map Size Map Time # Reduce Reduce Size Reduce Time Stage 1 100 1.5 GB 10 Min 50 2 GB 5 Min Stage 2 - L 50 1.5 GB 10 Min 20 2 GB 10 Min Stage 2 - R 30 1.5 GB 5 Min 10 2 GB 5 Min Stage 3 70 1.5 GB 5 Min 30 2 GB 5 Min Notes: §  SLOT_MILLIS_MAPS and SLOT_MILLIS_REDUCES from Job Counters gives the time spent §  TOTAL_LAUNCHED_MAPS and TOTAL_LAUNCHED_REDUCES from Job Counters gives # Map and # Reduce §  Reduce time includes the Sort and Shuffle time. Shuffle Time is Data per Reducer / est. 4 MB/s (bandwidth for data transfer from Map to Reduce) §  Add 10% for speculative execution (failed/killed task attempts) Step 2: Mappers and Reducers for SLA and Full Dataset Stages Mins SLA Share # Map # Reduce Map Total Reduce Total Total Mem. # Servers Stage 1 15 / 45 Min 120 / 360 Min 138 (100 x 11) / 8 69 (50 x 11) / 8 207 GB 138 GB 345 GB 8 Project Pig Mail Capacity Ask = MAX (Compute <8 Servers>, Storage <9 Servers>) = 9 Servers
  • 18. Capacity Calculations Tools 18 2014 Hadoop Summit, Amsterdam, Netherlands
  • 19. Apache HBase Resources 19 2014 Hadoop Summit, Amsterdam, Netherlands Throughput Supported frequency of data read or written in a second (for a given record size) Latency Time taken for the read, write or scan operations to complete Memory BlockCache; data that needs to be served through cache Data (Storage) CPU N/A § Number of reads, writes or scans per second per server § Read or write time in ms (typically) per record § % of data read from cache § MemStore / BlockCache ratio, RegionServer heap § N/A Drivers Measure Total data stored in HDFS (disk) § Avg. record size x avg. number of records stored IntheorderofimportanceforHBase
  • 20. Working Through a Use Case 20 2014 Hadoop Summit, Amsterdam, Netherlands Awesome eCommerce needs to process about 200 M records daily somewhere between 6:00 - 10:00 AM to update product information. About 50% of the data is related to existing products where price may need to be updated by comparing current with the new offer price. Remaining 50% of the offer is new products and will be written without price comparison. There are three separate tables for product, price and offers with 3 KB avg. record size. Writes are in the order of 500 Million records and reads 250 Million across each of the three tables. ILLUSTRATIVE
  • 21. Throughput & Latency 21 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: Project Info – User Input Active reads/writes per day 4 Hrs. Avg. writes / day (all three tables) 1,500 M Avg. reads / day (all three tables) 750 M Average record size 3 KB Records cached / warmed on start 50% Step 2: # Servers Based on Write Throughput Peak concurrent writes required 1,500 M x 3 KB / (4 x 3,600 sec) = ~ 300 MB / sec Peak write throughput per RegionServer 45 MB / sec (based on performance benchmarks) Servers required 300 / 45 = 7 RegionServer Step 3: # Servers Based on Read Throughput Peak concurrent reads required 750 M x 3 KB / (4x3600 sec) = ~160 MB / sec Peak cold random read throughput 10 MB / sec (based on performance benchmarks) Peak hot random read throughput 200 MB / sec (based on performance benchmarks) RegionServer for cold reads 160 x 50% / 10 = 8 RegionServer for hot read 160 x 50% / 200 = 1 Servers required Max (8,1) = 8 RegionServer Performance benchmarks were conducted by simulating HBase workloads through YCSB on dedicated servers
  • 22. Memory 22 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: RegionServer Info (configured values at hbase-site.xml and hbase-env.sh) – Admins Only Max memory available per Region Server C-xxx/64/4000 <64 GB> Heap size of the Region Server JVM export HBASE_HEAPSIZE = 59392 (58 GB) Default: <1000> (1000 MB) Memory allocated to BlockCache hfile.block.cache.size = 0.8 (80%) Default: <0.4> (40% of Heap) Memory allocated to Memstore hbase.regionserver.global.memstore.size = 0.2 (20%) Default: <0.4> (40% of Heap) Step 2: Servers required to serve from block cache Total records 200 M Average record size 3 KB Total data served 200 M x 3 KB = 0.55 TB Total data served through BlockCache 0.55 TB x 50% = 0.28 TB Loading factor in the (LRU) BlockCache (in HBase 0.94) 85 % Total BlockCache available per RegionServer 58 GB x 0.8 x 85% = 40 GB Servers required 0.28 TB / 40 GB = 7 RegionServer Block cache allocation is dependent on the mix of reads and writes access patterns. Remainder of LRU is used by other resident users such as catalog tables, hfiles indexes, bloom filters
  • 23. Data 23 2014 Hadoop Summit, Amsterdam, Netherlands Step 2: # Servers Based on data served Raw disk space to JVM heap / RegionServer 10 GB / 128 MB x 3 x 0.2 = 48 Raw disk space available / RegionServer 48 x 58 GB x 0.2 = 0.56 TB Total data served through tables 0.55 TB Total raw data served 0.55 TB x 3 = 1.65 TB Servers required 1.65 / 0.56 = 3 servers Step 1: RegionServer Info (configured values at hbase-site.xml & hbase-env.sh) – Admins Only Max memory available per RegionServer C-xxx/64/4000 (four 4 TB disks) = 64 GB Heap size of the RegionServer JVM export HBASE_HEAPSIZE = 59392 (58 GB) Default: <1000> (1000 MB) Region size hbase.hregion.max.filesize = 10737418240 Default: <10737418240> (10 GB) Memory allocated to MemStore hbase.regionserver.global.memstore.size = 0.2 (20%) Default: <0.4> (40% of Heap) Memstore flush size hbase.hregion.memstore.flush.size= 134217728 Default: <134217728> (128 MB) HDFS replication factor dfs.replication = 3 Default: <3> Project Awesome eCommerce Ask = MAX (Write <7 RS>, Read <8 RS>, Cached<7 RS >, Data <3 RS>) = 8 RS
  • 24. Capacity Calculations Tools 24 2014 Hadoop Summit, Amsterdam, Netherlands
  • 25. Apache Storm Resources 25 2014 Hadoop Summit, Amsterdam, Netherlands Throughput Events processed per second or parallel workers Memory Worker/ Slot memory for spouts and bolts CPU CPU threads needed for workers/ executors Latency Data (Storage) N/A § # events, # messages / sec § Tuples / sec § Spout and bolt JVM size § Message and Tuple size § Cores for spout and bolt processes, inter and intra § Inter and Intra worker comm. § N/A Drivers Measure Time taken for processing the input stream of events § Execute / complete latency IntheorderofimportanceforStorm
  • 26. Working Through a Use Case 26 2014 Hadoop Summit, Amsterdam, Netherlands Wonder Search wants to index editorial content in near real-time for users to be able to search content. The editorial content is available in Apache HBase. Spout: Scans HBase since the last scan till current time to get the editorial content. Bolt 1: Build the index and store it back in HBase. Bolt 2: Push the index for serving. ILLUSTRATIVE
  • 27. Throughput and Latency 27 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: Supervisor Level Info (configured values at storm.yaml or multitenant-scheduler.yaml) – Admins Only Incoming (worker) messages queue size topology.receiver.buffer.size, Default: <8> Outgoing (worker) messages queue size topology.transfer.buffer.size, Default: <1024> Incoming (executor) tuple queue size topology.executor.receive.buffer.size, Default: <1024> Outgoing (executor) tuple queue size topology.executor.send.buffer.size, Default: <1024> Slots available per supervisor supervisor.slots.ports <24>, hyper-threaded cores for dual hex-core machines Multi-tenant scheduler (user isolation scheduler) multitenant.scheduler.user.pools: <users> : <# nodes>, topology.isolate.machines: <Number of Nodes> Step 2: # Servers Based on Throughput Events processed with single spout per worker 1,000 messages / sec Target throughput required 8,000 messages / sec Number of spout executors required 8,000 / 1,000 = 8 (across 8 slots) Number of tuple executed across 1st bolt (5 executors) 10,000 tuples / sec Total executors required for 1st bolt 8 x 5 = 40 (across 40 slots) Number of tuples executed across 2nd bolt (5 executors) 15,000 tuples / sec Total executors required for 2nd Bolt 8 x 5 = 40 (across 40 slots) Total slots based on executors 8 + 40 + 40 = 88 Slots Number of supervisors required 88 / 24 = 4 servers
  • 28. CPU vs. Throughput 28 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: Track CPU usage with JVM tools (jmap/ jstack) Max CPU cores per supervisor C-xxx/48/4000 (12 physical cores) CPU usage for 1000 messages / sec 4 physical cores (32.12%) Includes 1 spout and 5 bolt executors each for bolts 1 and 2, and CPU usage for inter-messaging (ZeroMQ or Netty) Equal CPU division between spout and bolt executor (assumed) Executor CPU needs = 4 / (1+5+5) = 4/11 cores Total workers TOPOLOGY_WORKERS Config#setNumWorkers Tasks per component TOPOLOGY_TASKS ComponentConfigurationDeclarer#setNumTasks() Step 2: Extrapolate for Target Throughput (linear increase) Target spout executors 8, TopologyBuilder#setSpout() Target bolt executors 40, TopologyBuilder#setBolt() CPU needed for spout executors 8 x 4/11 = 3 cores CPU needed for 1st bolt executors 40 x 4/11 = 15 cores CPU needed for 2nd bolt executors 40 x 4/11 = 15 cores CPU need for the topology 3 + 15 + 15 = 33 cores Total supervisors needed 33 /12 = 3 servers
  • 29. Memory vs. Throughput 29 2014 Hadoop Summit, Amsterdam, Netherlands Step 1: Supervisor Level Info Max memory available per supervisor node C-xxx/48/4000 <48 GB> (Usable 42G out of 48G, rest for the OS) Step 2: # Servers Based on Memory needs Events processed across spout executors 8,000 messages / sec Avg. event or message size 3 MB Data processed per second across spout executors 8,000 x 3 MB = 24 GB / sec Events processed per second across 1st bolt executors 10,000 x 8 = 80,000 tuples / sec Average tuple size 100 KB Data processed per second across 1st bolt executors 80,000 tuples / sec x 100 KB = 8 GB / sec Data processed per second across 2nd bolt executors 15,000 x 8 tuples / sec x 100 KB = 12 GB / sec Total data processed 24 GB / sec + 8 GB / sec + 12 GB / sec = 44 GB / sec Number of Supervisors required to process data 44 / 42 = 2 server Project Wonder Search Ask = MAX (Throughput <4 Servers>, CPU <3 Servers>, Memory <2 Server >= 4 Servers
  • 30. Capacity Calculation Tools 30 2014 Hadoop Summit, Amsterdam, Netherlands
  • 31. On-going SLA Management 31 2014 Hadoop Summit, Amsterdam, Netherlands queue 2 queue 1 queue 3 queue 4 queue 5 queue 6 queue 7 queue 8 queue 11 queue 9 queue 10 SLA Dashboard on Hadoop Analytics Warehouse
  • 32. Growing with YARN 32 2014 Hadoop Summit, Amsterdam, Netherlands HDFS (File System) YARN (Resource Manager) MapReduce (Batch) Spark (Iterative) Storm (Stream) HBaseGiraph R, OpenMPI, Indexing etc. Coming soon on YARN Available today … New Services on YARN Tez (DAGs)
  • 33. Near Future for Capacity Planning 33 2014 Hadoop Summit, Amsterdam, Netherlands Hadoop HBase Storm §  CPU as a resource §  Container reuse §  Long-running jobs §  Other potential resources such as disk, network, GPUs etc. §  Tez as the execution engine §  Spark-on-YARN etc. §  BlockCache implementations §  LRU §  Slab §  Bucket §  Short circuit reads §  Bloom filters and co- processors §  HBase-on-YARN §  Storm-on-YARN §  More experience with multi-tenancy
  • 34. Acknowledgement 34 2014 Hadoop Summit, Amsterdam, Netherlands Hadoop Capacity Planning Nathan Roberts Hadoop Core Architect Koji Noguchi Software Engineer Viraj Bhat Software Engineer Ryota Egashiri Software Engineer Balaji Narayan Service Engineer Anish Matthew Service Engineer Rajiv Chittajallu SE Architect HBase Capacity Planning Francis Liu Software Engineer Dheeraj Kapur Service Engineer Storm Capacity Planning Bobby Evans Software Engineer Dheeraj Kapur Service Engineer
  • 35. Thank You @ s u m e e t k s i n g h