This document provides an overview of important concepts for operating HBase, including:
- HBase stores data in columns families stored as files on disk and writes to memory before flushing to disk.
- Manual and automatic splitting of regions is covered, as well as challenges of improper splitting.
- Tools for monitoring, debugging, and visualizing HBase operations are discussed.
- Key lessons focus on proper data modeling, extensive monitoring, and understanding the whole Hadoop ecosystem.
2. Outline
● HBase internals
● Overview of HBase utilities
● HBase split visualisation with Hannibal
● Challenges & lessons learned
● Resources to get started
2
3. About me
● Software Architect @ Sentric
● Founder and organizer of the Swiss Big
Data User Group
http://www.bigdata-usergroup.ch
● Contact:
christian.guegi@sentric.ch
http://www.sentric.ch
@chrisgugi
3
5. Data Model
● A sparse, multi-dimensional, sorted map
● Table consist of rows, each has a row key
● Each row may have any number of columns
● Rows are sorted lexicographically based on row key
● Column = Column Family : Column Qualifier
– Cell → {rowkey, column, timestamp}
[Bigtable: A Distributed Storage System for Structured Data]
● Region: contiguous set of sorted rows
● Region: unit of distribution and availability 5
6. Physical Data Organization
Region
content Column Family anchor Column Family
Store Store
(WAL on HFDS)
Memstore Memstore
HLog
HFile HFile HFile
(on HDFS) (on HDFS) (on HDFS)
● Column families are stored separately on disk
– Unit of access control with different patterns
● Writes are held (sorted) in memory until flush
● Sorted on disk in predictable order
– By row key, column key, descending timestamp 6
7. Flushes and Compaction
● Flushing/compaction per Region
– One thread (CompactSplitThread) per region
server
● Minor compaction
– Merges two or more HFiles into one
● Major compaction
– Picks up all HFiles in the region, merges them and
removes deleted k/v
● Regions are split when grown too large
7
8. System Architecture
HBase API
RegionServer
Master
HFile Memstore
Write-Ahead Log
HDFS ZooKeeper
[HBase: The Definitive Guide]
8
9. Key Design & Distribution
● Bad idea: continuous number or timestamp
(sequential row keys)
– RegionServer hot-spotting
● Better: use hash function and/or composite
key
– Distribute keys over random regions
– Uniform reads/writes across key space
● Proper key design is very essential
– E.g. reversed URL (Bigtable paper)
9
11. Useful Tools
● hbck – checks and fixes table integrity and
region consistency
● HFile – examine contents of HFile
● HLog – examine contents of HLog file
● OfflineMetaRepair – rebuild meta table
from file system
● HBase web interfaces
– Master
– RegionsServer
11
12. Monitoring Tools
● Ganglia
● Nagios
● OpenTSDB
● …
All tools use metrics provided through JMX
12
13. Manual Splitting
● Via master web interface
– Split
● HBase shell split command
● RegionSplitter
– Create table with pre-split regions
– Rolling split of all regions on existing table
– . /bin/hbase
org.apache.hadoop.hbase.util.RegionSplitter
13
14. Disable Automatic Splitting
● Determined by hbase.hregion.max.filesize
● Set to max. 100GB
● OK, but:
– How do I monitor my region growth?
– Where do I split when I have irregular data
growth?
14
16. Hannibal
● Open source, project on github
– https://github.com/sentric/hannibal
● Web based
● Implemented in Scala
● Compatible with HBase 0.90
● Support > 0.92 added soon
● Check it out!
16
17. How well are regions balanced
over the cluster?
17
18. How well are the regions split for
the table?
18
20. Future Plans
● HBase 0.92 client API changes allow to
query Compaction-State on Regions
through HBaseAdmin → differentiate major
from minor compactions
● Add tool to find best region-key for irregular
data growth
● Expose metrics through JMX
20
22. Challenges
● Everyone is still learning
● Some issues only appear at scale
– At scale, nothing works as advertised
● Production cluster configuration
– Hardware issues
– Tuning cluster configuration to our work loads
● HBase stability
● Monitoring health of HBase
22
23. Lessons Learned
● Schema & key design
– What’s queried together should be stored together
● Monitoring/Operational tooling is most important
● Forget “emergency actions”, it takes some time
● You need DevOps in production
● Huge know-how curve, you need to know the whole
ecosystem
– Hadoop, HDFS, Map/Red, ZooKeeper
23
24. Resources to get started
● https://github.com/sentric/hannibal
● http://hbase.apache.org/book.html
● https://github.com/jmhsieh/hbase-repair-
scripts
● http://www.sentric.ch/blog/best-practice-
why-monitoring-hbase-is-important
● HBase: The Definitive Guide
24