This document provides an overview of HBase architecture and advanced usage topics. It discusses course credit requirements, HBase architecture components like storage, write path, read path, files, region splits and more. It also covers advanced topics like secondary indexes, search integration, transactions and bloom filters. The document emphasizes that HBase uses log-structured merge trees for efficient data handling and operates at the disk transfer level rather than disk seek level for performance. It also provides details on various classes involved in write-ahead logging.
3. COURSE CREDIT
• Show up, 30 scores
• Ask question, each question earns 5 scores
• Quiz, 40 scores, Pls see TCExam
• 70 scores will pass this course
• Each course credit will be calculated once for
each course finished
• The course credit will be sent to you and your
supervisor by mail
3
4. ARCHITECTURE
• Seek V.S. Transfer • KeyValue Format
• Storage • Write-Ahead Log
• Write Path • Read Path
• Files • Regions Lookup
• Region Splits • Region Life Cycle
• Compactions • Replication
• HFile Format
4
5. SEEK V.S. TRANSFER
• HBase use Log-Structure Merge Trees (LSM-Trees)
data structure as it’s underlying Store File operation
mechanism
• Derived from B+ Trees
• Easy to handle data with optimized layout
• WAL Log
• MemStore
• Operates at the Disk Transfer
• B+ Trees
• Many RDBMSs use B+ Trees
• Use OPTIMIZATION process periodically
• Operates at the Disk Seek
5
6. SEEK V.S. TRANSFER
• Disk Transfer
• Moving data between the disk surface and the host system
• CPU, RAM, and disk size double every 18–24 months
• Disk Seek
• Measures the time it takes the head assembly on the actuator
arm to travel to the track of the disk where the data will be
read or written
• Seek time remains nearly constant at around a 5% increase in
speed per year
• Conclusion
• At scale seek, is inefficient compared to transfer
https://www.research.ibm.com/haifa/Workshops/ir2005/papers/DougCutti
ng-Haifa05.pdf 6
10. WRITE PATH
1. A Write to a
2. Write to
region server
WAL log
3. Write to a
corresponding
MemStore after WAL
log persistent
4. Flush a new Hfile if
MemStore size reach the
threshold
10
11. FILES
• Root-Level files
• Table-Level files
• Region-Level files
• A txt file for reference
11
12. REGION SPLITS
• Split one region to two half-size regions
• Triggered while
• hbase.hregion.max.filesize reached, default is 256MB
• Hbase Shell split, HBaseAdmin.split(…)
• Following Steps the Region server will take…
• Create a folder called “split” under parent region folder
• Close the parent region, so it can not service any request
• Prepare two new daughter regions (with multiple threads),
inside the split folder, including…
• region folder structure, reference Hfile, etc
• Move this two daughter regions into Table folder if above
steps completed
12
13. REGION SPLITS
• Here is an example of how this looks in the .META.
Table
row: testtable,row-500,1309812163930.d9ffc3a5cd016ae58e23d7a6cb937949.
column=info:regioninfo, timestamp=1309872211559, value=REGION => {NAME
=>
'testtable,row-500,1309812163930.d9ffc3a5cd016ae58e23d7a6cb937949.
TableName => 'testtable', STARTKEY => 'row-500', ENDKEY => 'row-700',
ENCODED => d9ffc3a5cd016ae58e23d7a6cb937949, OFFLINE => true,
SPLIT => true,}
column=info:splitA, timestamp=1309872211559, value=REGION => {NAME =>
'testtable,row-500,1309872211320.d5a127167c6e2dc5106f066cc84506f8.
TableName => 'testtable', STARTKEY => 'row-500', ENDKEY => 'row-550',
ENCODED => d5a127167c6e2dc5106f066cc84506f8,}
column=info:splitB, timestamp=1309872211559, value=REGION => {NAME =>
'testtable,row-550,1309872211320.de27e14ffc1f3fff65ce424fcf14ae42.
TableName => [B@62892cc5', STARTKEY => 'row-550', ENDKEY => 'row-700',
ENCODED => de27e14ffc1f3fff65ce424fcf14ae42,}
13
14. REGION SPLITS
• The name of the reference file is another random
number, but with the hash of the referenced region
as a postfix
/hbase/testtable/d5a127167c6e2dc5106f066cc84506f8/colfam1/
6630747383202842155.d9ffc3a5cd016ae58e23d7a6cb937949
14
15. COMPACTIONS
• The store files are monitored by a background
thread
• The flushes of memstores slowly build up an
increasing number of on-disk files
• The compaction process will combine them to a
few, larger files
• This goes on until
• The largest of these files exceeds the configured maximum
store file size and triggers a region split
• Type
• Minor
• Major
15
16. COMPACTIONS
• Compaction check triggered while…
• A memstore has been flushed to disk
• The compact or major_compact shell commands/API calls
• A background thread
• Called CompactionChecker
• Each region server runs a single instance
hbase.server.thread.wakefrequency X
hbase.server.thread.wakefrequency.multiplier (default set to 1000)
• Run it less often than the other thread-based tasks
16
17. COMPACTIONS - MINOR
• Rewriting the last few files into one larger one
• The number of files is set with the
hbase.hstore.compaction.min property
• Default is 3
• Needs to be at least 2 or more
• A number too large…
• Would delay minor compactions
• Also would require more resources and take longer
• The maximum number of files is set with
hbase.hstore.compaction.max property
• Default is 10
17
18. COMPACTIONS - MINOR
• The all files that are under the limit, up to the total
number of files per compaction allowed
• hbase.hstore.compaction.min.size property
• Any file larger than the maximum compaction size is
always excluded
• hbase.hstore.compaction.max.size property
• Default is Long.MAX_VALUE
18
19. COMPACTIONS - MAJOR
• Compact all files into a single file
• Also drop predicate deletion KeyValues
• Action is Delete
• Version
• TTL
• Triggered while…
• major_compact shell command/majorCompact() API call
• hbase.hregion.majorcompaction property
• Default is 24 hours
• hbase.hregion.majorcompaction.jitter property
• Default is 0.2
• Without the jitter, all stores would run a major compaction at the
same time, every 24 hours
• Minor compactions might be promoted to major
compactions
• Due to only affect store files whose size less than the
configured maximum files per compaction
19
20. HFILE FORMAT
• The actual storage files are implemented by the
HFile class
• Store HBase’s data efficiently
• Blocks
• Fixed size
• Trailer, File Info
• Others are variable size
20
21. HFILE FORMAT
• Default block size is 64KB
• Some recommendation written in API docs
• block size between 8KB to 1MB for general usage
• Larger block size is preferred for sequential access usecase
• Smaller block size is preferred for random access usecase
• Require more memory to hold the block index
• May be slower to create (leads more FS I/O flushes)
• The smallest possible block size would be around 20KB-30KB
• Each block contains
• A magic header
• A number of serialized KeyValue instances
21
22. HFILE FORMAT
• Each block is about as large as the configured
block size
• In practice, it is not an exact science
• Store a KeyValue that is larger than the block size, the writer
has to accept this
• Even with smaller values, the check for the block size is done
after the last value was written
• The majority of blocks will be slightly larger
• Using a compression algorithm
• Will not have much control over block size
• final store file contain the same number of blocks, but the
total size will be smaller since each block is smaller
22
23. HFILE FORMAT –
HFILE BLOCK SIZE V.S. HDFS BLOCK SIZE
• Default HDFS block size is 64 MB
• Which 1,024 times the HFile default block size (64KB)
• HBase stores its files transparently into a filesystem
• No correlation between these two block types
• It is just a coincidence
• HDFS also does not know what HBase stores
23
24. HFILE FORMAT –
HFILE CLASS
• Access an HFile directly
• hadoop fs –cat <hfile>
• hbase org.apache.hadoop.hbase.io.hfile.HFile –f
<hfile> -m –v- p
• Actual data stored as serialized KeyValue instances
• HFile.Reader properties and the trailer block details
• File info block values
24
25. KEYVALUE FORMAT
• Each KeyValue in the HFile is a low-level byte array
• Fixed-length Numbers
• Key Length
• Value Length
• If you deal with small values
• Try to keep the key small
• Choose a short row and column key
• family name with a single byte and the qualifier equally short
• Compression should help mitigate the overwhelming key size
problem
• The sorting of all KeyValues in the store file helps to keep
similar keys close together
25
26. WRITE-AHEAD LOG
• Region servers keep data in-memory until enough is
collected to warrant a flush to disk
• Avoiding the creation of too many very small files
• The data resides in memory it is volatile, not persistent
• Write-Ahead Logging
• A common approach to solve above issue, even in most of
RDBMSs
• Each update (edit) is written to a log, then to real persistent
data store
• The server then has the liberty to batch or aggregate the
data in memory as needed
26
27. WRITE-AHEAD LOG
• The WAL is the lifeline that is needed when disaster
strikes
• The WAL records all changes to the data
• If the server crashes
• WAL can effectively replay the log to get everything up to
where the server should have been just before the crash
• if writing the record to the WAL fails
• The whole operation must be considered a failure
• The actual WAL log resides on HDFS
• HDFS is a replicated filesystem
• Any other server can open the log and start replaying the
edits
27
30. WRITE-AHEAD LOG –
OTHER CLASSES
• LogSyncer Class
• HTableDescriptor.setDeferredLogFlush(boolean)
• Default is false
• Every update to WAL log will be synced into filesystem
• Set to true
• Background process instead
• hbase.regionserver.optionallogflushinterval property
• Default is 1 second
• There is a chance of data loss in case of a server failure
• Can only applies to user tables, not catalog tables (-ROOT-
, .META.)
30
31. WRITE-AHEAD LOG –
OTHER CLASSES
• LogRoller Class
• Takes care of rolling logfiles at certain intervals
• hbase.regionserver.logroll.period property
• Default is 1 hour
• Other parameters
• hbase.regionserver.hlog.blocksize property
• Default is 32MB
• hbase.regionserver.logroll.multiplier property
• Default is 0.95
• Rotate logs when they are at 95% of the block size
• Logs are rotated
• A certain amount of time has passed
• Considered full
• Whatever comes first
31
33. WRITE-AHEAD LOG –
DURABILITY
• WAL Log
• Sync them for every edit
• Set the log flush times to be as low as you want
• Still dependent on the underlying filesystem
• Especially the HDFS
• Use Hadoop 0.21.0 or later
• Or a special 0.20.x with append support patches
• I used 0.20.203 before
• Otherwise, you can very well face data loss !!
33
34. READ PATH
ColFam2
Due to timestamp and Bloom
34
filter exclusion process
35. REGION LOOKUPS
• Catalog Tables
• -ROOT-
• Refer to all regions in the .META. table
• .META.
• Refer to all regions in all user tables
• A Three Level B+ tree-like lookup scheme
• A node stored in ZooKeeper
• Contains the location of the root table’s region
• Lookup of a matching meta region from the -ROOT- table
• Retrieval of the user table region from the .META. table
35
38. ZOOKEEPER
• ZooKeeper as HBase distributed coordination
service
• Use HBase shell
• hbase zkcli
Znode Description
/hbase/hbaseid Cluster ID, as stored in the hbase.id file on HDFS
/hbase/master Holds the master server name
/hbase/replication Contains replication details
/hbase/root-region- Server name of the region server hosting the -ROOT-
server regions
38
39. ZOOKEEPER
Znode Description
/hbase/rs The root node for all region servers to list themselves when
they start. It is used to track server failures.
/hbase/shutdow Is used to track the cluster state. It contains the time when
n the cluster was started, and is empty when it was shut down
/hbase/splitlog All log-splitting-related coordination. States including
unassigned, owned and RESCAN
/hbase/table Disabled tables are added to this znode
/hbase/unassig Used by the AssignmentManager, to track region states
ned across the entire cluster. It contains znodes for those regions
39
that are not open, but are in a transitional state.
40. REPLICATION
• A way to copy data between HBase deployments
• It can serve as a
• Disaster recovery solution
• Provide higher availability at the HBase layer
• (HBase cluster) Master-push
• One master cluster can replicate to any number of slave
clusters, and each region server will participate to replicate
its own stream of edits
• Eventual consistency
40
43. KEY DESIGN
• Two fundamental key structures
• Row Key
• Column Key
• A column family name + a column qualifier
• Use these keys
• to solve commonly found problems when designing storage
solutions
• Logical V.S. Physical layout
43
46. KEY DESIGN –
TALL-NARROW V.S. FLAT-WIDE TABLES
• Tall-narrow table layout
• A table with few columns but many rows
• Flat-wide table layout
• Has fewer rows but many columns
• Tall-narrow table layout is recommended
• Due to a single row could outgrow the maximum file/region
size and work against the region split facility under Flat-wide
table design
46
47. KEY DESIGN –
TALL-NARROW V.S. FLAT-WIDE TABLES
• A email system as example
• Flat-wide layout
<userId> : <colfam> : <messageId> : <timestamp> : <email-message>
12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi Lars, ..."
12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and ..."
12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It ..."
12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are ..."
• Tall-narrow
<userId>-<messageId> : <colfam> : <qualifier> : <timestamp> : <email-message>
12345-5fc38314-e290-ae5da5fc375d : data : : 1307097848 : "Hi Lars, ..."
12345-725aae5f-d72e-f90f3f070419 : data : : 1307099848 : "Welcome, and ..."
12345-cc6775b3-f249-c6dd2b1a7467 : data : : 1307101848 : "To Whom It ..."
12345-dcbee495-6d5e-6ed48124632c : data : : 1307103848 : "Hi, how are ..."
Empty Qualifier !! 47
48. PARTIAL KEY SCANS
• Make sure to pad the value of each field in
composite row key, to ensure the right sorting order
you expected 48
49. PARTIAL KEY SCANS
• Set startRow and stopRow
• Set startRow with exact user ID
• Scan.setStartRow(…)
• Set stopRow with user ID + 1
• Scan.setStopRow(…)
• Control the sorting order
• Long.MAX_VALUE - <date-as-long>
• String s = "Hello,";
for (int i = 0; i < s.length(); i++) {
print(Integer.toString(s.charAt(i) ^ 0xFF, 16));
}
b7 9a 93 93 90 d3
49
50. PAGINATION
• Use Filters
• PageFilter
• ColumnPaginationFilter
• Steps
1. Open a scanner at the start row
2. Skip offset rows
3. Read the next limit rows and return to the caller
4. Close the scanner
• Usecase
• On web-based email client
• Read first 1 ~ 50 emails, then 51 ~ 100, etc
50
51. TIME SERIES DATA
• Dealing with stream processing of event
• Most common use case is time series data
• Data could be coming from
• A sensor in a power grid
• A stock exchange
• A monitoring system for computer systems
• Their row key represents the event time
• The sequential, monotonously increasing nature of
time series data
• Causes all incoming data to be written to the same region
• Hot spot issue
51
52. TIME SERIES DATA
• Overcome this problem
• By prefixing the row key with a nonsequential prefix
• Common choices
• Salting
• Field swap/promotion
• Randomization
52
53. TIME SERIES DATA - SALTING
• Use a salting prefix to the key that guarantees a
spread of all rows across all region servers
byte prefix = (byte) (Long.hashCode(timestamp) %
<number of regionservers>);
byte[] rowkey = Bytes.add(Bytes.toBytes(prefix),
Bytes.toBytes(timestamp);
• Which results
0myrowkey-1
0myrowkey-4
1myrowkey-2
1myrowkey-5
...
53
54. TIME SERIES DATA - SALTING
• Access to a range of rows must be fanned out
• Read with <number of region servers> get or scan
calls
• Is it good or not good ?
• Use multiple threads to read this data from distinct servers
• Need more further study on the access pattern and try run
54
55. TIME SERIES DATA –
SALTING USECASE
• A open source crash reporter named Socorro from
Mozilla organization
• For Firefox and Thunderbird
• Reports are subsequently read and analyzed by the Mozilla
development team
• Technologies
• Python-based client code
• Communicates with the HBase cluster using Thrift
Mozilla wiki for Socorro - https://wiki.mozilla.org/Socorro 55
56. TIME SERIES DATA –
SALTING USECASE
• How the client is merging the previously salted,
sequential keys when doing a scan operation
for salt in '0123456789abcdef':
salted_prefix = "%s%s" % (salt,prefix)
scanner = self.client.scannerOpenWithPrefix(table, salted_prefix, columns)
iterators.append(salted_scanner_iterable(self.logger,self.client,
self._make_row_nice,salted_prefix,scanner))
56
57. TIME SERIES DATA –
FIELD SWAP/PROMOTION
• Use the composite row key concept
• Move the timestamp to a secondary position in the row key
• If you already have a row key with more than one
field
• Swap them
• If you have only the timestamp as the current row
key
• Promote another field from the column keys into the row
key
• Promote even the value
• You can only access data, especially time ranges,
for a given swapped or promoted field
57
58. TIME SERIES DATA –
FIELD SWAP/PROMOTION USECASE
• OpenTSDB
• A time series database
• Store metrics about servers and
services, gathered by external
collection agents
• All of the data is stored in HBase
• System UI enables users to query
various metrics, combining
and/or downsampling them—all
in real time
• The schema promotes the
metric ID into the row key
• <metric-id><base-timestamp>...
http://opentsdb.net/ 58
59. TIME SERIES DATA –
FIELD SWAP/PROMOTION USECASE
• Example
OpenTSDB Schema - http://opentsdb.net/schema.html
59
61. TIME-ORDERED RELATIONS
• You can also store related, time-ordered data
• By using the columns of a table
• Since all of the columns are sorted per column
family
• Treat this sorting as a replacement for a secondary index
• For a small number of indexes, you can create a column
family for them
• If the large amount of indexes, you shall consider the
Secondary-Indexes approaches in later of this ppt
• HBase currently (0.95) does not do well with
anything above two or three column families
• Due to flushing and compactions are done on a per Region
basis
• Can make for a bunch of needless i/o loading
http://hbase.apache.org/book/number.of.cfs.html
61
62. TIME-ORDERED RELATIONS – EXAMPLE
• Colum name = <indexId> + “-” + <value>
• Column value
• Key in data column family
• Redundant values from data column family for performance
• Denormalization
… //data
12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi Lars, ..."
12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and ..."
12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It ..."
12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are ..."
... //ascending index for from email address
12345 : index : idx-from-asc-mary@foobar.com : 1307099848 : 725aae5f-d72e...
12345 : index : idx-from-asc-paul@foobar.com : 1307103848 : dcbee495-6d5e...
12345 : index : idx-from-asc-pete@foobar.com : 1307097848 : 5fc38314-e290...
12345 : index : idx-from-asc-sales@ignore.me : 1307101848 : cc6775b3-f249...
...// descending index for email subjects
12345 : index : idx-subject-desc-xa8x90x8dx93x9bxde :
1307103848 : dcbee495-6d5e-6ed48124632c
12345 : index : idx-subject-desc-xb7x9ax93x93x90xd3 :
62
1307099848 : 725aae5f-d72e-f90f3f070419
63. SECONDARY INDEXES
• HBase has no native support for secondary indexes
• There are use cases that need them
• Look up a cell with not just the primary coordinates
• The row key, column family name and qualifier
• But also an alternative coordinate
• Scan a range of rows from the main table, but ordered by the
secondary index
• Secondary indexes store a mapping between the
new coordinates and the existing ones
63
64. SECONDARY INDEXES -
CLIENT-MANAGED
• Moving the responsibility into the application layer
• Combines a data table and one (or more)
lookup/mapping tables
• Write data
• Into the data table, also updates the lookup tables
• Read data
• Either a direct lookup in the main table
• A lookup in secondary index table, then retrieve data from
main table
64
65. SECONDARY INDEXES -
CLIENT-MANAGED
• Atomicity
• No cross-row atomicity
• Writing to the secondary index tables first, then write to the
data table at the end of the operation
• Use asynchronous, regular pruning jobs
• It is hardcoded in your application
• Needs to evolve with overall schema changes, and new
requirements
65
66. SECONDARY INDEXES -
INDEXED-TRANSACTIONAL HBASE
• Indexed-Transactional HBase (ITHBase) project
• It extends HBase by adding special implementations of the
client and server-side classes
• Extension
• The core extension is the addition of transactions
• It guarantees that all secondary index updates are consistent
• Most client and server classes are replaced by ones that
handle indexing support
• Drawbacks
• May not support the latest version of HBase available
• Adds a considerable amount of synchronization overhead
that results in decreased performance
https://github.com/hbase-trx/hbase-transactional-tableindexed 66
67. SECONDARY INDEXES -
INDEXED HBASE
• Indexed HBase (IHBase)
• Forfeits the use of separate tables for each index but
maintains them purely in memory
• this approach is very fast than previous one
• Indexes related
• The indexes are generated when
• A region is opened for the first time
• A memstore is flushed to disk
• The index is never out of sync, and no explicit transactional
control is necessary
• Drawbacks
• It is quite intrusive, requires additional JAR and a config file
• It needs extra resources, it trades memory for extra I/O
requirements
• It may not be available for the latest version of HBase
https://github.com/ykulbak/ihbase 67
68. SECONDARY INDEXES -
COPROCESSOR
• Implement an indexing solution based on
coprocessors
• Using the server-side hooks, e.g. RegionObserver
• Use coprocessor to load the indexing layer for every region,
which would subsequently handle the maintenance of the
indexes
• Use of the scanner hooks to transparently iterate over a
normal data table, or an index-backed view on the same
• Currently in development
• JIRA ticket
• https://issues.apache.org/jira/browse/HBASE-2038
68
69. SEARCH INTEGRATION
• Using indexes
• Still confined to the available keys user-predefined
• Search-based lookup
• Use arbitrary nature of keys
• Often backed by full search engine integration
• Following are a few possible approaches
69
70. SEARCH INTEGRATION -
CLIENT-MANAGED
• Example Facebook inbox search
• The schema is built roughly like this
• Every row is a single inbox, that is, every user has a
single row in the search table
• The columns are the terms indexed from the
messages
• The versions are the message IDs
• The values contain additional information, such as
the position of the term in the document
<inbox>:<COL_FAM_1>:<term>:<messageId>:<additionalInfo>
70
71. SEARCH INTEGRATION -
LUCENE
• Apache Lucene
• Lucene Core
• Provides Java-based indexing and search technology
• Solr
• High performance search server built using Lucene Core
• Steps
1. HBase only stores the data
2. BuildTableIndex class scans an entire data table and
builds the Lucene indexes
3. End up as directories/files on HDFS
4. These indexes can be downloaded to a Lucene-based
server for locally use
5. A search performed via Lucene, will return row keys for
next random lookup into data table for specific value
71
72. SEARCH INTEGRATION -
COPROCESSORS
• Currently in development
• Similar to the use of Coprocessors to build
secondary indexes
• Complement a data table with Lucene-based
search functionality
• Ticket in JIRA
• https://issues.apache.org/jira/browse/HBASE-3529
72
73. TRANSACTION
• It is a immature aspect of HBase
• Due to it is a compliant for CAP theorem
• Here are a two possible solutions
• Transactional HBase
• Comes with the aforementioned ITHBase
• Zookeeper
• Comes with a lock recipe that can be used to implement a
two-phase commit protocol
• http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recip
es_twoPhasedCommit
73
74. BLOOM FILTERS
• Problem
• Cell count
• 16,384 blocks = 64KB block size / 1GB store file size
• 5,000,000 (million) cell amount = 200 bytes cell size / 1GB store
file size
• Block index => index start row key of each block
• Store file
• A number of store files within one column family
• Allowing you to improve lookup times. Since they
add overhead in terms of storage and memory,
they are turned off by default.
74
76. BLOOM FILTERS –
DO WE NEED IT ?
• If possible, you should try to use the row-level Bloom
filter
• A good balance between the additional space
requirements and the gain in performance
• Only resort to the more costly row+column Bloom
filter
• Gain no advantage from using the row-level one
76