4. Need to Scale
▪ Scale to Petabytes for Near Real-time
▪ Multi-tenant clusters still growing
▪ Web Crawl Cache
› ~2.3PB Table
› Batch Processing workload
› 80GB regions -> 20GB regions
5. Region
▪ Subset of a table’s key space
▪ Unit of work
▪ Load distribution
▪ Availability
6. Unit of Work
▪ Map reduce Split per Region
› Parallelism
› Compute/Recovery Time
› Skew
▪ Filters & Coprocessors
› Region boundaries
› Sparse filters -> scan timeouts
• 30mins to scan 80GB region
▪ Custom Applications
› Storm Grouping, etc
7. Load Distribution
▪ Load balancing granularity
▪ Fast as slowest region server
▪ Tasks per Region server (ie MapReduce)
› Limit running tasks (MAPREDUCE-5583)
8. Compaction
▪ Optimization for reads
▪ Less files to read the better
▪ Contend for I/O
▪ Cache Misses
▪ Write amplification
▪ Too Many Store files
› Blocked flushes (90 secs)
9. Regions and Compaction
▪ Optimization for reads
▪ Less files to read the better
▪ Contend for I/O
▪ Cache Misses
▪ Write amplification
11. What size then?
▪ As a general rule keep regions small-ish
▪ HDFS block size? (not there yet)
12. Scaling Region Count
▪ Master Region Management
› Creation, Assign, Balance, etc.
› Meta table
▪ Metadata
› HDFS scalability
› Zookeeper
› Region Server density
13. ZK Region Assignment
▪ Master orchestrates region assignment
▪ Region mapping tracked by master
memory, meta table and zookeeper
znodes
RS
Master
Zookeeper
Meta
Region 1
Region 2
RS
14. 1. Master tries to assign region
2. RS transitions the region to open
3. Masters updates its in memory state
4. RS persists region state to META
Region transition example
RS
Master
Zookeeper
Meta
3
1
2
4
Region 1
Region 2
RS
15. Observations with 1M regions
▪ Complex
› 3 way communication
› Split brain problem
▪ Zookeeper
› More storage
› Operations like listing a znode is
not efficient
RS
Master
Zookeeper
Meta
Region 1
Region 2
RS
16. ▪ Assignment
› ZK less assignment (HBASE-11059)
› No involvement of ZK
› Region assignment is controlled by
Master
› Better API’s - E.g scanning meta vs ls
on znode
▪ Unlock region states (HBASE-11290)
› Reduce CPU utilization
Enhancements - Assignment
Meta region
RS
Master Region 1
Region 2
RS
18. Single HOT meta
▪ Assignment info is persisted to meta
▪ 7GB in size for 1M
▪ Meta cannot split
▪ Large compactions
▪ Longer failover times
RS
Meta
Master Region 1
Region 2
RS
19. ▪ Split meta (HBASE-11288)
› Distributed IO load
› Distributed caching
› Shorter scan time
› Distributed compaction
Master
Meta region
User region
User region
Meta region
RS
Meta region
User region
RS
Enhancements – Split Meta
20. Performance comparison
Split size: 200 MB
Meta split across 10 servers. Each server has 5 meta regions
Assignment time for 3M regions
Single Meta Split Meta
18 mins 10 mins
21. Scaling namenode operations
Longer time to create all region dirs under a single table dir
Namenode limitation to hold maximum 6.3 million files
TableDir
RegionDir1 RegionDir2 RegionDirN...
25. Region dir creation time - 4k buckets
1M regions 5M 10M
normal table 20 mins 4 hours 23 mins Doesn’t finish
humongous table 15 mins 48 secs 1 hour 27 mins 2hr 53 mins
Performance results