Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
Capacity Planning for Multi-tenant Hadoop, HBase and Storm Deployments
1. Capacity Planning in Multi-tenant
Hadoop, HBase and Storm Deployments
PRESENTED BY Amrit Lal and Sumeet Singh ⎪ April 02, 2014
2 0 1 4 H a d o o p S u m m i t , A m s t e r d a m , N e t h e r l a n d s
2. Introduction
2 2014 Hadoop Summit, Amsterdam, Netherlands
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
Amrit Lal
Product Manager
Hadoop and Big Data Platforms
Cloud Engineering Group
§ Product Manager at Yahoo engaged in building
high class and robust Hadoop infrastructure
services
§ Eight years of experience across HSBC, Oracle
and Google in developing products and
platforms for high growth enterprises
§ M.B.A. from Carnegie Mellon University701 First Avenue,
Sunnyvale, CA 94089 USA
@amritasshwar
§ Manages Hadoop products team at Yahoo!
§ Responsible for Product Management, Strategy
and Customer Engagements
§ Managed Cloud Services products team and
headed Strategy functions for the Cloud
Platform Group at Yahoo
§ M.B.A. from UCLA and M.S. from
Rensselaer(RPI)
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
3. Agenda
3 2014 Hadoop Summit, Amsterdam, Netherlands
The Need for Capacity Planning1
Big Data Platform Deployment Models2
Resource Drivers and Data Sources3
Capacity Models and Tools4
SLA Management5
4. 0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers(DataNode)
Year
Servers Storage
Multi-tenant Apache Hadoop Platform Evolution
4 2014 Hadoop Summit, Amsterdam, Netherlands
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-tenancy,
and SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase, Storm,
Hive etc.
Increased
User-base
with partitioned
namespaces
Apache H2.x
(Low latency,
Util, HA etc.)
6. Multi-tenant Apache HBase Growth at Yahoo
6 2014 Hadoop Summit, Amsterdam, Netherlands
1,140
33.6 PB
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0
200
400
600
800
1,000
1,200
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
DataStored(inPB)
NumberofServers(RegionServer)
Zero to “20” Use Cases (60,000 Regions) in a Year
Region Servers Storage
7. Multi-tenant Apache Storm Growth at Yahoo
7 2014 Hadoop Summit, Amsterdam, Netherlands
Zero to “175” Production Topologies in a Year
760
175
0
20
40
60
80
100
120
140
160
180
200
0
100
200
300
400
500
600
700
800
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
NumberofTopologies
NumberofServers(Supervisor)
Supervisor Topologies
Multi-tenancy
Release
8. Where Does Capacity Planning Fit
8 2014 Hadoop Summit, Amsterdam, Netherlands
Phased
Environment
Production
On-boarding
Capacity
Planning
Architecture
Validation
Technology
Choice
Project Lifecycle Support
9. Big Data Platform Technology Stack at Yahoo
9 2014 Hadoop Summit, Amsterdam, Netherlands
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
Relevant for Capacity Planning
10. Deployment Model
10 2014 Hadoop Summit, Amsterdam, Netherlands
DataNode NodeManager
NameNode RM
DataNode RegionServer
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Relevant for Capacity Planning
11. Capacity Drivers That Matter
11 2014 Hadoop Summit, Amsterdam, Netherlands
Data (Storage) Volume of data to be stored and processed
Memory Container for direct and faster access to stored data
CPU Cores (and threads) available for processing
Throughput Number of transactions per second
Latency
Time taken to complete a request or operation ((includes
processing, disk and network I/O time)
Drivers Measure
12. Apache Hadoop Resources
12 2014 Hadoop Summit, Amsterdam, Netherlands
Data (Storage) Data stored in HDFS (disk)
Memory
Map and Reduce containers
(in H 0.23/ 2.0)
CPU
YARN-2 for Capacity Scheduler,
Yahoo is not using it yet
Throughput
Latency
Time taken for the jobs to
complete
§ Freq., size, retention, # files
§ Rep. factor
§ Map memory
§ Reduce memory
§ N/A
§ Individual job run times
§ Time to finish all jobs (when
run in parallel) – peak usage
Drivers Measure
Data processed/ second with
concurrent Mappers and Reducers
§ Total data processed
§ Maps and Reduces to run
(simple or complex DAGs)
IntheorderofimportanceforHadoop
13. Working Through a Use Case
13 2014 Hadoop Summit, Amsterdam, Netherlands
Pig Mail needs to process 30 TB of data
everyday in about 6 hours so that it can
develop algorithms that can detect spam
more effectively. A Pig script will parse the
data in sequential phases to finally
materialize the features of the mail that
decides if the mail is a SPAM.
1
3
2-L 2-R
Stage 1
Stage 2
Stage 3
Pig DAG
ILLUSTRATIVE
14. Data (Storage)
14 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Pig Mail Project Info – User Input
Data upload frequency Once daily
Data added per upload 1 TB / day
Data retention (Input) 30 days
Data output 50 GB
Data retention 1 day
Anticipated growth in data volume (3-6 months) 20%
Step 2: # Servers Based on Storage (default values at hdfs-site.xml)
HDFS replication factor
dfs.replication
Default: <3>
HDFS required (30 + 0.05) x 1.2 x 3 = 108 TB
Suggested server config (based on total cost) C-xxx/48/4000 (four 4 TB disks)
Storage available per server
12 TB out of 16 TB (rest for OS, temp, swap etc.)
dfs.datanode.du.reserved, <107374182400> 1 TB
Servers required 108 / 12 = 9 servers
Step 3: Namespace Needed (default values at hdfs-site.xml)
HDFS block size
dfs.blocksize
Default: <134217728> 128 MB
Average file size 1.5 X 128 MB = 200 MB (assumed)
Namespace for files 108 TB / 200 MB = 540,000
15. Memory
15 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Cluster/ Node Level Info (configured values at yarn-site.xml) – Admins Only
Max memory on the node for containers
yarn.nodemanager.resource.memory-mb
Conf: <45056> (44G out of 48G, rest for the OS)
Virtual to physical memory
yarn.nodemanager.vmem-pmem-ratio
Default: <2.1> (2:1 virtual to exceed physical by)
Min allocable memory for containers
yarn.scheduler.minimum-allocation-mb
Default: <512> (0.5G)
Max allocable memory for containers
yarn.scheduler.maximum-allocation-mb
Default: <8192> (8G)
Step 2: Container Level Info (default values at mapred-site.xml)
Map task container size
mapreduce.map.memory.mb
Default: <1536> (1.5G)
Reduce task container size
mapreduce.reduce.memory.mb
Default: <2048> (2G)
MR AppMaster memory size
yarn.app.mapreduce.am.resource.mb
Default: <1536> (1.5G)
Map task JVM heap size
mapreduce.map.java.opts
Default: Xmx1024m
Reduce task JVM heap size
mapreduce.reduce.java.opts
Default: Xmx1536m
Map and Reduce container sizes are determined by users developing the app based on memory needs of the tasks
16. Throughput
16 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Estimating Number of Mappers
Upper bound on input splits mapreduce.input.fileinputformat.split.maxsize
Lower bound on input splits mapreduce.input.fileinputformat.split.minsize
Number of mappers
Number of input splits
(e.g. 8,192 maps = 1 TB of data / 128M split size)
Step 2 A: Estimating Number of Reducers
Limit on the input size to reducers
mapreduce.reduce.input.limit
Default: <10737418240> (10G)
Fixed number of reducers mapreduce.job.reduces
Number of reducers Min (fixed reducers, total input size / reducer size)
Step 2 B: Estimating Number of Reducers (Pig and Hive)
Pig
Min (fixed reducers, pig.exec.reducers.max,
total input size / pig.exec.reducers.bytes.per.reducer)
Default: <max 999, reducer bytes 1GB>
Hive
Min (fixed reducers, hive.exec.reducers.max ,
total input size / hive.exec.reducers.bytes.per.reducer)
Default: < max 999, reducer bytes 1GB>
17. Throughput and Latency
17 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Sample Run (with a tenth of data on a sandbox cluster)
Stages # Map Map Size Map Time # Reduce Reduce Size Reduce Time
Stage 1 100 1.5 GB 10 Min 50 2 GB 5 Min
Stage 2 - L 50 1.5 GB 10 Min 20 2 GB 10 Min
Stage 2 - R 30 1.5 GB 5 Min 10 2 GB 5 Min
Stage 3 70 1.5 GB 5 Min 30 2 GB 5 Min
Notes:
§ SLOT_MILLIS_MAPS and SLOT_MILLIS_REDUCES from Job Counters gives the time spent
§ TOTAL_LAUNCHED_MAPS and TOTAL_LAUNCHED_REDUCES from Job Counters gives # Map and # Reduce
§ Reduce time includes the Sort and Shuffle time. Shuffle Time is Data per Reducer / est. 4 MB/s (bandwidth for
data transfer from Map to Reduce)
§ Add 10% for speculative execution (failed/killed task attempts)
Step 2: Mappers and Reducers for SLA and Full Dataset
Stages Mins SLA Share # Map # Reduce
Map
Total
Reduce
Total
Total
Mem.
#
Servers
Stage 1 15 / 45 Min 120 / 360 Min 138
(100 x 11) / 8
69
(50 x 11) / 8
207 GB 138 GB 345 GB 8
Project Pig Mail Capacity Ask = MAX (Compute <8 Servers>, Storage <9 Servers>) = 9 Servers
19. Apache HBase Resources
19 2014 Hadoop Summit, Amsterdam, Netherlands
Throughput
Supported frequency of data read
or written in a second (for a given
record size)
Latency
Time taken for the read, write or
scan operations to complete
Memory
BlockCache; data that needs to
be served through cache
Data (Storage)
CPU N/A
§ Number of reads, writes or
scans per second per server
§ Read or write time in ms
(typically) per record
§ % of data read from cache
§ MemStore / BlockCache
ratio, RegionServer heap
§ N/A
Drivers Measure
Total data stored in HDFS (disk)
§ Avg. record size x avg.
number of records stored
IntheorderofimportanceforHBase
20. Working Through a Use Case
20 2014 Hadoop Summit, Amsterdam, Netherlands
Awesome eCommerce needs to
process about 200 M records daily
somewhere between 6:00 - 10:00 AM to
update product information. About 50%
of the data is related to existing
products where price may need to be
updated by comparing current with the
new offer price. Remaining 50% of the
offer is new products and will be written
without price comparison.
There are three separate tables for
product, price and offers with 3 KB avg.
record size. Writes are in the order of
500 Million records and reads 250
Million across each of the three tables.
ILLUSTRATIVE
21. Throughput & Latency
21 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Project Info – User Input
Active reads/writes per day 4 Hrs.
Avg. writes / day (all three tables) 1,500 M
Avg. reads / day (all three tables) 750 M
Average record size 3 KB
Records cached / warmed on start 50%
Step 2: # Servers Based on Write Throughput
Peak concurrent writes required 1,500 M x 3 KB / (4 x 3,600 sec) = ~ 300 MB / sec
Peak write throughput per RegionServer 45 MB / sec (based on performance benchmarks)
Servers required 300 / 45 = 7 RegionServer
Step 3: # Servers Based on Read Throughput
Peak concurrent reads required 750 M x 3 KB / (4x3600 sec) = ~160 MB / sec
Peak cold random read throughput 10 MB / sec (based on performance benchmarks)
Peak hot random read throughput 200 MB / sec (based on performance benchmarks)
RegionServer for cold reads 160 x 50% / 10 = 8
RegionServer for hot read 160 x 50% / 200 = 1
Servers required Max (8,1) = 8 RegionServer
Performance benchmarks were conducted by simulating HBase workloads through YCSB on dedicated servers
22. Memory
22 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: RegionServer Info (configured values at hbase-site.xml and hbase-env.sh) – Admins Only
Max memory available per Region Server
C-xxx/64/4000
<64 GB>
Heap size of the Region Server JVM
export HBASE_HEAPSIZE = 59392 (58 GB)
Default: <1000> (1000 MB)
Memory allocated to BlockCache
hfile.block.cache.size = 0.8 (80%)
Default: <0.4> (40% of Heap)
Memory allocated to Memstore
hbase.regionserver.global.memstore.size = 0.2 (20%)
Default: <0.4> (40% of Heap)
Step 2: Servers required to serve from block cache
Total records 200 M
Average record size 3 KB
Total data served 200 M x 3 KB = 0.55 TB
Total data served through BlockCache 0.55 TB x 50% = 0.28 TB
Loading factor in the (LRU) BlockCache (in HBase 0.94) 85 %
Total BlockCache available per RegionServer 58 GB x 0.8 x 85% = 40 GB
Servers required 0.28 TB / 40 GB = 7 RegionServer
Block cache allocation is dependent on the mix of reads and writes access patterns. Remainder of LRU is used by
other resident users such as catalog tables, hfiles indexes, bloom filters
23. Data
23 2014 Hadoop Summit, Amsterdam, Netherlands
Step 2: # Servers Based on data served
Raw disk space to JVM heap / RegionServer 10 GB / 128 MB x 3 x 0.2 = 48
Raw disk space available / RegionServer 48 x 58 GB x 0.2 = 0.56 TB
Total data served through tables 0.55 TB
Total raw data served 0.55 TB x 3 = 1.65 TB
Servers required 1.65 / 0.56 = 3 servers
Step 1: RegionServer Info (configured values at hbase-site.xml & hbase-env.sh) – Admins Only
Max memory available per RegionServer C-xxx/64/4000 (four 4 TB disks) = 64 GB
Heap size of the RegionServer JVM
export HBASE_HEAPSIZE = 59392 (58 GB)
Default: <1000> (1000 MB)
Region size
hbase.hregion.max.filesize = 10737418240
Default: <10737418240> (10 GB)
Memory allocated to MemStore
hbase.regionserver.global.memstore.size = 0.2 (20%)
Default: <0.4> (40% of Heap)
Memstore flush size
hbase.hregion.memstore.flush.size= 134217728
Default: <134217728> (128 MB)
HDFS replication factor
dfs.replication = 3
Default: <3>
Project Awesome eCommerce Ask = MAX (Write <7 RS>, Read <8 RS>, Cached<7 RS >, Data <3 RS>) = 8 RS
25. Apache Storm Resources
25 2014 Hadoop Summit, Amsterdam, Netherlands
Throughput
Events processed per second or
parallel workers
Memory
Worker/ Slot memory for spouts
and bolts
CPU
CPU threads needed for workers/
executors
Latency
Data (Storage) N/A
§ # events, # messages / sec
§ Tuples / sec
§ Spout and bolt JVM size
§ Message and Tuple size
§ Cores for spout and bolt
processes, inter and intra
§ Inter and Intra worker
comm.
§ N/A
Drivers Measure
Time taken for processing the
input stream of events
§ Execute / complete latency
IntheorderofimportanceforStorm
26. Working Through a Use Case
26 2014 Hadoop Summit, Amsterdam, Netherlands
Wonder Search wants to index editorial
content in near real-time for users to be able
to search content. The editorial content is
available in Apache HBase.
Spout: Scans HBase since the last scan till
current time to get the editorial content.
Bolt 1: Build the index and store it back in
HBase.
Bolt 2: Push the index for serving.
ILLUSTRATIVE
27. Throughput and Latency
27 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Supervisor Level Info (configured values at storm.yaml or multitenant-scheduler.yaml) – Admins Only
Incoming (worker) messages queue size topology.receiver.buffer.size, Default: <8>
Outgoing (worker) messages queue size topology.transfer.buffer.size, Default: <1024>
Incoming (executor) tuple queue size topology.executor.receive.buffer.size, Default: <1024>
Outgoing (executor) tuple queue size topology.executor.send.buffer.size, Default: <1024>
Slots available per supervisor
supervisor.slots.ports
<24>, hyper-threaded cores for dual hex-core machines
Multi-tenant scheduler (user isolation scheduler)
multitenant.scheduler.user.pools: <users> : <# nodes>,
topology.isolate.machines: <Number of Nodes>
Step 2: # Servers Based on Throughput
Events processed with single spout per worker 1,000 messages / sec
Target throughput required 8,000 messages / sec
Number of spout executors required 8,000 / 1,000 = 8 (across 8 slots)
Number of tuple executed across 1st bolt (5 executors) 10,000 tuples / sec
Total executors required for 1st bolt 8 x 5 = 40 (across 40 slots)
Number of tuples executed across 2nd bolt (5 executors) 15,000 tuples / sec
Total executors required for 2nd Bolt 8 x 5 = 40 (across 40 slots)
Total slots based on executors 8 + 40 + 40 = 88 Slots
Number of supervisors required 88 / 24 = 4 servers
28. CPU vs. Throughput
28 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Track CPU usage with JVM tools (jmap/ jstack)
Max CPU cores per supervisor C-xxx/48/4000 (12 physical cores)
CPU usage for 1000 messages / sec
4 physical cores (32.12%)
Includes 1 spout and 5 bolt executors each for bolts 1
and 2, and CPU usage for inter-messaging (ZeroMQ or
Netty)
Equal CPU division between spout and bolt executor
(assumed)
Executor CPU needs = 4 / (1+5+5) = 4/11 cores
Total workers
TOPOLOGY_WORKERS
Config#setNumWorkers
Tasks per component
TOPOLOGY_TASKS
ComponentConfigurationDeclarer#setNumTasks()
Step 2: Extrapolate for Target Throughput (linear increase)
Target spout executors 8, TopologyBuilder#setSpout()
Target bolt executors 40, TopologyBuilder#setBolt()
CPU needed for spout executors 8 x 4/11 = 3 cores
CPU needed for 1st bolt executors 40 x 4/11 = 15 cores
CPU needed for 2nd bolt executors 40 x 4/11 = 15 cores
CPU need for the topology 3 + 15 + 15 = 33 cores
Total supervisors needed 33 /12 = 3 servers
29. Memory vs. Throughput
29 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Supervisor Level Info
Max memory available per supervisor node
C-xxx/48/4000 <48 GB>
(Usable 42G out of 48G, rest for the OS)
Step 2: # Servers Based on Memory needs
Events processed across spout executors 8,000 messages / sec
Avg. event or message size 3 MB
Data processed per second across spout executors 8,000 x 3 MB = 24 GB / sec
Events processed per second across 1st bolt executors 10,000 x 8 = 80,000 tuples / sec
Average tuple size 100 KB
Data processed per second across 1st bolt executors 80,000 tuples / sec x 100 KB = 8 GB / sec
Data processed per second across 2nd bolt executors 15,000 x 8 tuples / sec x 100 KB = 12 GB / sec
Total data processed 24 GB / sec + 8 GB / sec + 12 GB / sec = 44 GB / sec
Number of Supervisors required to process data 44 / 42 = 2 server
Project Wonder Search Ask = MAX (Throughput <4 Servers>, CPU <3 Servers>, Memory <2 Server >= 4 Servers
32. Growing with YARN
32 2014 Hadoop Summit, Amsterdam, Netherlands
HDFS (File System)
YARN (Resource Manager)
MapReduce
(Batch)
Spark
(Iterative)
Storm
(Stream)
HBaseGiraph
R, OpenMPI,
Indexing etc.
Coming soon
on YARN
Available
today
…
New Services on YARN
Tez
(DAGs)
33. Near Future for Capacity Planning
33 2014 Hadoop Summit, Amsterdam, Netherlands
Hadoop HBase Storm
§ CPU as a resource
§ Container reuse
§ Long-running jobs
§ Other potential
resources such as disk,
network, GPUs etc.
§ Tez as the execution
engine
§ Spark-on-YARN etc.
§ BlockCache
implementations
§ LRU
§ Slab
§ Bucket
§ Short circuit reads
§ Bloom filters and co-
processors
§ HBase-on-YARN
§ Storm-on-YARN
§ More experience with
multi-tenancy
34. Acknowledgement
34 2014 Hadoop Summit, Amsterdam, Netherlands
Hadoop Capacity Planning
Nathan Roberts Hadoop Core Architect
Koji Noguchi Software Engineer
Viraj Bhat Software Engineer
Ryota Egashiri Software Engineer
Balaji Narayan Service Engineer
Anish Matthew Service Engineer
Rajiv Chittajallu SE Architect
HBase Capacity Planning
Francis Liu Software Engineer
Dheeraj Kapur Service Engineer
Storm Capacity Planning
Bobby Evans Software Engineer
Dheeraj Kapur Service Engineer