More Related Content Similar to HBase Consistency and Performance Improvements (20) More from DataWorks Summit (20) HBase Consistency and Performance Improvements1. June
13,
2012
HBase Consistency and
Performance Improvements
Esteban
Gu+errez,
Gregory
Chanan
{esteban,
gchanan}@cloudera.com
2. HBase Consistency
• ACID guarantees within a single row
• “Any row returned by the scan will be a
consistent view (i.e. that version of the
complete row existed at some point in
time)”[1]
[1] http://hbase.apache.org/acid-semantics.html
2
©2012 Cloudera, Inc. All Rights Reserved.
3. HBase Consistency Issues
• Write Consistency Issues
• Read Consistency Issues
3
©2012 Cloudera, Inc. All Rights Reserved.
4. Write Consistency
HBASE-4552
• Importing Multiple CFs HFiles
is not an atomic operation
4
©2012 Cloudera, Inc. All Rights Reserved.
5. Write Consistency
HBASE-4552
• Importing Multiple CFs HFiles
was not an atomic operation
is
5
©2012 Cloudera, Inc. All Rights Reserved.
6. Write Consistency
HBASE-4552
HRegion.bulkLoadHFile()
HFile1: HFile2: HFile3: HFile4:
Row 1
fam1:col1 fam2:col2 fam3:col3 fam4:col4
val1
T1 Scan
T2 Scan val1 val2
T3 Scan
val1 val2 val3
T4 Scan
val1 val2 val3 val4
< HBase 0.90.5
6
©2012 Cloudera, Inc. All Rights Reserved.
7. Write Consistency
HBASE-4552
HRegion.bulkLoadHFiles()
HFile1: HFile2: HFile3: HFile4:
Row 1
fam1:col1 fam2:col2 fam3:col3 fam4:col4
T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>>
familyPaths) {!
...!
startRegionOperation(); ç lock.writeLock().lock()!
T2 Scan } finally {!
closeBulkRegionOperation(); !
}!
T3 Scan
...!
!
T4 Scan
≥ HBase 0.90.5
7
©2012 Cloudera, Inc. All Rights Reserved.
8. Write Consistency
HBASE-4552
HRegion.bulkLoadHFiles()
HFile1: HFile2: HFile3: HFile4:
Row 1
fam1:col1 fam2:col2 fam3:col3 fam4:col4
T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>>
familyPaths) {!
...!
startRegionOperation(); !
T2 Scan } finally {!
closeBulkRegionOperation(); ç lock.writeLock().unlock()!
}!
T3 Scan
...!
!
T4 Scan
≥ HBase 0.90.5
8
©2012 Cloudera, Inc. All Rights Reserved.
9. Write Consistency
HBASE-4552
HRegion.bulkLoadHFiles()
HFile1: HFile2: HFile3: HFile4:
Row 1
fam1:col1 fam2:col2 fam3:col3 fam4:col4
T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>>
familyPaths) {!
...!
startRegionOperation(); !
T2 Scan } finally {!
closeBulkRegionOperation(); !
}!
T3 Scan
...!
!
T4 Scan val1 val2 val3 val4
≥ HBase 0.90.5
9
©2012 Cloudera, Inc. All Rights Reserved.
10. Read Consistency
HBASE-2856
• Seen only twice in the
wilderness
• Hard to detect if application
monitoring is not
implemented
10
©2012 Cloudera, Inc. All Rights Reserved.
11. Read Consistency
HBASE-2856
• Table size ≈ 50 M records
• Large number of CFs
• New records are continuously added to
the table
• Concurrent MR Jobs on the same table
• Cluster has to meet strict SLAs
11
©2011 Cloudera, Inc. All Rights Reserved.
12. Read Consistency
HBASE-2856
Symptoms
Run 1
… … …
SPLIT_RAW_FILES …
Map-Reduce Framework
Map output records 500,000
12
©2011 Cloudera, Inc. All Rights Reserved.
13. Read Consistency
HBASE-2856
Symptoms
Run 1 Run 2
… … … …
SPLIT_RAW_FILES … …
Map-Reduce Framework
Map output records 500,000 499,997
13
©2011 Cloudera, Inc. All Rights Reserved.
14. Read Consistency
HBASE-2856
Symptoms
Run 1 Run 2 Run 3
… … … … …
SPLIT_RAW_FILES … … …
Map-Reduce Framework
Map output records 500,000 499,997 500,001
14
©2011 Cloudera, Inc. All Rights Reserved.
15. Read Consistency
HBASE-2856
Symptoms
Run 1 Run 2 Run 3
… … … … …
SPLIT_RAW_FILES … … …
Map-Reduce Framework
Map output records 500,000 499,997 500,001
cf1:col1 cf2:col2 cf3:col3
cf1:col1
cf2:col2 cf3:col3
cf1:col1
15
©2011 Cloudera, Inc. All Rights Reserved.
16. Read Consistency
HBASE-2856
Symptoms
Run 1 Run 2 Run 3
… … … … …
SPLIT_RAW_FILES … … …
Map-Reduce Framework
Map output records 500,000 499,997 500,001
cf1:col1 cf2:col2 cf3:col3
cf1:col1
cf2:col2 cf3:col3
cf1:col1
Scale testing shows between 0.5% to 2% of inconsistent results between runs
16
©2011 Cloudera, Inc. All Rights Reserved.
17. Read Consistency
HBASE-2856
Impact
• Result is used to update user facing
records
• Customer is not happy
17
©2011 Cloudera, Inc. All Rights Reserved.
18. Read Consistency
HBASE-2856
Impact
• Result is used to update user facing
records
• Customer is not happy
— “Where is my data?”
18
©2011 Cloudera, Inc. All Rights Reserved.
19. Read Consistency
HBASE-2856
Workarounds
• Re-try scan if not all CFs are present
• Re-submit job if any inconsistency is found
19
©2011 Cloudera, Inc. All Rights Reserved.
20. Read Consistency
HBASE-2856
Workarounds
• Re-try scan if not all CFs are present
• Re-submit job if any inconsistency is found
• Sometimes that is not possible
20
©2011 Cloudera, Inc. All Rights Reserved.
21. Read Consistency
HBASE-2856
Workarounds
• Re-try scan if not all CFs are present
• Re-submit job if any inconsistency is found
• Sometimes that is not possible SLAs!
21
©2011 Cloudera, Inc. All Rights Reserved.
22. MVCC
• HBase maintains ACID semantics using
Multiversion Concurrency Control
• Instead of overwriting state, create a new
version of object with timestamp
Timestamp Row fam1:col1 fam2:col2
t1 row1 val1 val1
22
©2012 Cloudera, Inc. All Rights Reserved.
23. MVCC
• HBase maintains ACID semantics using
Multiversion Concurrency Control
• Instead of overwriting state, create a new
version of object with timestamp
Timestamp Row fam1:col1 fam2:col2
t2 row1 val2 val2
t1 row1 val1 val1
• Reads never have to block
• Note this timestamp is not externally visible!
Internally called “memStoreTs”
23
©2012 Cloudera, Inc. All Rights Reserved.
24. HBase Write Path
1. Write to WAL (per RegionServer)
2. Write to In-Memory Sorted Map (MemStore)
(per Region+ColumnFamily)
3. Flush MemStore to disk as HFile when
MemStore hits configurable
hbase.hregion.memstore.flush.size
24
©2012 Cloudera, Inc. All Rights Reserved.
25. Internals / Bug
Now that we know the internals – what
could go wrong?
25
©2012 Cloudera, Inc. All Rights Reserved.
26. Putting it together
Let’s go back to the beginning…
MemStore
Timestamp Row fam1:col1 fam2:col2
t1 row1 val1 val1
26
©2012 Cloudera, Inc. All Rights Reserved.
27. Putting it together
Let’s go back to the beginning…
MemStore
Timestamp Row fam1:col1 fam2:col2
t1 row1 val1 val1
And start a scan.
27
©2012 Cloudera, Inc. All Rights Reserved.
28. Putting it together
Let’s go back to the beginning…
MemStore
Timestamp Row fam1:col1 fam2:col2
t2 row1 val2 val2
t1 row1 val1 val1
And start a scan.
And concurrently put.
28
©2012 Cloudera, Inc. All Rights Reserved.
29. Putting it together
Let’s go back to the beginning…
MemStore
Timestamp Row fam1:col1 fam2:col2
t2 row1 val2 val2
t1 row1 val1 val1
And start a scan. HFile
And concurrently put. Row fam2:col2:
Which causes a flush. row1 val2
row1 val1
29
©2012 Cloudera, Inc. All Rights Reserved.
30. Putting it together
Now, scan needs to make sense of this…
MemStore
Ts Row fam1:col1
t2 row1 val2
t1 row1 val1
HFile
Row fam2:col2:
row1 val2
row1 val1
30
©2012 Cloudera, Inc. All Rights Reserved.
31. Putting it together
Now, scan needs to make sense of this…
MemStore
Ts Row fam1:col1
t2 row1 val2
t1 row1 val1
HFile
Row fam2:col2:
row1 val2
row1 val1
But HFile has no timestamp!
31
©2012 Cloudera, Inc. All Rights Reserved.
32. Putting it together
Now, scan needs to make sense of this…
MemStore
Ts Row fam1:col1
t2 row1 val2
t1 row1 val1
HFile Inconsistent Result
Row fam2:col2: Row fam1:col1 fam2:col2
row1 val2 row1 val1 val2
row1 val1
But HFile has no timestamp!
32
©2012 Cloudera, Inc. All Rights Reserved.
33. Solution
Store the timestamp in the Hfile
MemStore HFile
Ts Row fam1:col1 Ts Row fam2:col2:
t2 row1 val2 t2 row1 val2
t1 row1 val1 t1 row1 val1
Correct Result
Row fam1:col1 fam2:col2
row1 val1 val2
Now we have all the information we need
33
©2012 Cloudera, Inc. All Rights Reserved.
34. Consistency
• Only some of the consistency issues in 0.90
– e.g. HBASE-5121: MajorCompaction may
affect scan's correctness
• Solution: Upgrade to 0.92 or 0.94
34
©2012 Cloudera, Inc. All Rights Reserved.
35. HBase 0.94
“Performance Release”
35
©2012 Cloudera, Inc. All Rights Reserved.
36. Performance Improvements in 0.94
• HBASE-5047 Support checksums in HBase block cache
• HBASE-5199 Delete out of TTL store files before
compaction selection
• HBASE-4608 HLog Compression
• HBASE-4465 Lazy-seek optimization for StoreFile
scanners
36
©2012 Cloudera, Inc. All Rights Reserved.
37. Performance Improvements in 0.94
• HBASE-5047 Support checksums in HBase block cache
• HBASE-5199 Delete out of TTL store files before
compaction selection
• HBASE-4608 HLog Compression
• HBASE-4465 Lazy-seek optimization for StoreFile
scanners
37
©2012 Cloudera, Inc. All Rights Reserved.
38. HBASE-5047
• HDFS stores checksum is separate file
HFile Checksum
• So each file read actually requires two disk iops
• HBase often bottlenecked by random disk ipos
38
©2012 Cloudera, Inc. All Rights Reserved.
39. HBASE-5047 Solution
• Solution: Store checksum in HFile block
HFile HFile Block
Chksum
Data
• On by default (“hbase.regionserver.checksum.verify”)
• Bytes per checksum (“hbase.hstore.bytes.per.checksum”) –
default is 16K
39
©2012 Cloudera, Inc. All Rights Reserved.
40. Performance Improvements in 0.94
• HBASE-5047 Support checksums in HBase block cache
• HBASE-5199 Delete out of TTL store files before
compaction selection
• HBASE-4608 HLog Compression
• HBASE-4465 Lazy-seek optimization for StoreFile
scanners
40
©2012 Cloudera, Inc. All Rights Reserved.
41. HBASE-5199
• User can specify TTL per column family
• If all values in the HFile are expired, delete HFile rather
than compact
• Off by default, turn on via
("hbase.store.delete.expired.storefile“)
41
©2012 Cloudera, Inc. All Rights Reserved.
42. Conclusion
• Most consistency issues fixed in 0.92/
CDH4
• Performance improvements in 0.94
• 0.94 is wire compatible with 0.92, so will
be in a CDH4 update
42
©2012 Cloudera, Inc. All Rights Reserved.
43. References
• HBase Acid Semantics,
http://hbase.apache.org/acid-semantics.html
• Apache HBase Meetup @ SU, Michael Stack.
http://files.meetup.com/
1350427/20120327hbase_meetup.pdf
• HBase Internals, Lars Hofhansl.
http://www.cloudera.com/resource/hbasecon-2012-
learning-hbase-internals/
43
©2012 Cloudera, Inc. All Rights Reserved.