2. WHO WE ARE
• We’ve built Continuuity Reactor: the world’s first scale-out
application server for Hadoop
• Fast, easy development, deployment and management of
Hadoop and HBase apps
• Continuuity team has years of experience in using and contributing
to Open Source, and we intend to continue doing so.
3. AGENDA
• Transactions over HBase: Why? What?
• Implementation: How?
• The approach
• Transaction Manager
• Transaction in batch jobs
4. THE REACTOR
• Continuuity Reactor is an app platform built on Hadoop and HBase
• Collect, Process, Store, and Query data.
• A Flow is a real-time processor with exactly-once guarantee
• A flow is composed of flowlets, connected via queues
• All processing happens with ACID guarantees in transactions
8. TRANSACTIONS: WHAT?
• Atomic - Entire transaction is committed as one
• Consistent - No partial state change due to failure
• Isolated - No dirty reads, transaction is only visible
after commit
• Durable - Once committed, data is persisted reliably
9. HBASE: QUICK “WHAT?”
• Open-source non-relational distributed column-
oriented database modeled after Google’s BigTable
• Data is stored in named Tables of Rows
• Row consists of Cells:
(row key, column family, column, timestamp) -> value
• Table is split into Regions, each served by a single node
in HBase cluster
10. WHAT ABOUT HBASE?
• Atomic operations on cell value:
checkAndPut, checkAndDelete, increment, append
• Atomic batch of operations on rows within region
• No cross region atomic operations support
• No cross table atomic operations support
• No multi-RPC atomic operations support
11. TRANSACTIONS: WHY ELSE?
• Complex data access patterns
• Secondary Indexes
• Design for greater performance
• Efficient batching of data operations
• Client-side buffer reliability fix
12. IMPLEMENTATION
• Multi-Version Concurrency Control
• Cell version (timestamp) = transaction ID
• All writes in the same transaction use the transaction ID as timestamp
• Reads exclude other, uncommitted transactions (for isolation)
• Optimistic Concurrency Control
• Conflict detection at commit of transaction
• Write Conflict: two overlapping transactions write the same row
• Rollback of one transaction in case of conflict (whichever commits later)
13. OPTIMISTIC CONCURRENCY
CONTROL
• Avoids cost of locking rows and tables
• No deadlocks or lock escalations
• Cost of conflict detection and possible rollback
is higher
• Good if conflicts are rare: short transaction,
disjoint partitioning of work
15. TRANSACTION LIFE CYCLE
time
out
try abort
failed
roll back
in HBase
write
to
HBase
do work
Client Tx Manager
none
complete V
abortsucceeded
in progress
start tx
start
start tx
commit
try commit check conflicts
RPC API
invalid X
invalidate
failed
42. TRANSACTION MANAGER
• Create new transactions
• Provides monotonically increasing write pointers
• Maintains all in-progress, committed, and invalid transactions
• Detect conflicts
• Transaction =
Write Pointer: Timestamp for HBase writes
Read pointer: Upper bound timestamp for reads
Excludes: List of timestamps to exclude from reads
43. TRANSACTION MANAGER
• Simple & Fast
• All required state is in-memory
• Single point of failure?
• Persist all state to a write-ahead log
• Secondary Tx Manager watches for failure of Primary
• Failover can happen quickly
46. TRANSACTION MANAGER
Tx Manager
Current State
in progress (-)
committed (+)
invalid
read point
write point
commit()
Tx Log
start, <write pt>
commit, <write pt>
HDFS
47. TRANSACTION SNAPSHOTS
• Write-ahead log provides persistence
• Guarantees point-in-time recovery
• Longer the log grows, longer recovery takes
• Periodically write snapshot of full transaction state
• Snapshot + all new logs provides full state
50. TRANSACTION SNAPSHOTS
Tx Log ATx Manager
in progress
committed
invalid
read point
write point
Current State
State Snapshot
in progress
committed
invalid
read point
write point
Tx Log B2
HDFS
51. TRANSACTION SNAPSHOTS
Tx Log ATx Manager
in progress
committed
invalid
read point
write point
Current State
State Snapshot
in progress
committed
invalid
read point
write point
Tx Log B
Tx Snapshot
in progress
committed
invalid
read point
write point
3
HDFS
52. TRANSACTION SNAPSHOTS
Tx Log ATx Manager
in progress
committed
invalid
read point
write point
Current State
State Snapshot
in progress
committed
invalid
read point
write point
Tx Log B
Tx Snapshot
in progress
committed
invalid
read point
write point
4
HDFS
54. TRANSACTION CLEANUP:
DATA JANITOR
• RegionObserver coprocessor
• Maintains in-memory snapshot of recent invalid &
in-progress sets
• Periodically updates from transaction snapshot in
HDFS
• Purges data from invalid transactions and older
versions on flush & compaction
55. HBase
TRANSACTION CLEANUP:
DATA JANITOR
Tx Snapshot
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
refresh
Data Janitor
(RegionObserver)
MemStore
preFlush()
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
Custom RegionScanner
HFile
Cell TS Value
56. HBase
TRANSACTION CLEANUP:
DATA JANITOR
Data Janitor
(RegionObserver)
HFile
Cell TS Value
Custom RegionScanner
MemStore
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
57. HBase
TRANSACTION CLEANUP:
DATA JANITOR
Data Janitor
(RegionObserver)
HFile
Cell TS Value
row1:col1 1004 12Custom RegionScanner
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
MemStore
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
58. HBase
TRANSACTION CLEANUP:
DATA JANITOR
Data Janitor
(RegionObserver)
Custom RegionScanner
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
MemStore
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
HFile
Cell TS Value
row1:col1 1004 12
59. HBase
TRANSACTION CLEANUP:
DATA JANITOR
Data Janitor
(RegionObserver)
Custom RegionScanner
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
MemStore
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
HFile
Cell TS Value
row1:col1 1004 12
1003 11
60. HBase
TRANSACTION CLEANUP:
DATA JANITOR
Data Janitor
(RegionObserver)
Custom RegionScanner
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
MemStore
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
HFile
Cell TS Value
row1:col1 1004 12
1003 11
61. HBase
TRANSACTION CLEANUP:
DATA JANITOR
Data Janitor
(RegionObserver)
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
Custom RegionScanner
preFlush()
MemStore
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
HFile
Cell TS Value
row1:col1 1004 12
1003 11
62. REDUNDANT VERSIONS CLEANUP
• Every write is new version of a cell
• Heavily updated values (e.g. counters) result
in lots of versions
• Cleanup: keep at most one version of a
cell visible to all transactions
• Plug-in cleanup into existing Data Janitor CP
63. TTL
• Timestamp of a value is overwritten
with tx id
• Breaks support for “native TTL”
feature in HBase
• Tx ids and timestamps cannot be
aligned: more than 1K tx/sec wanted
64. TTL
• Encode timestamp in tx id
• tx id = (current ts) * 1,000,000 + counter-
within-ms
• 1B txs/sec max
• long value overflow in ~200 years
• Plug-in cleanup into existing Data Janitor CP
• Apply TTL logic on reads
65. TRANSACTIONS IN BATCH
• Batch jobs are long running
• No changes should be visible before job is done
• Batch jobs are multi-step
• Individual step can be restarted
• All steps succeed or all fail semantics needed
• Every step retry should be isolated
• Uncommitted changes should be visible within same
step but not to other steps
66. TRANSACTIONS IN BATCH
• Batch jobs are long-running
• No fine-grained conflict detection: last started
among successfully committed wins
• Delayed rollback execution
• Batch jobs are multi-step
• For now: batch job uses single transaction, every
step re-uses it
• For future: nested transactions for steps
67. TX OVERHEAD
• 2 RPC calls, each faster than Put in HBase
• Can be improved with batching
• To avoid overhead of transactions where
appropriate the ACID guarantees can be relaxed
• No need to change data format
• When reading: read latest version of a cell
• When writing: use ts = (current ts) * 1M + counter
68. WHAT’S NEXT?
• Open Source
• Continue Scaling Tx Manager
• Transaction Groups?
• Integration across other transactional stores
69. QS?
Looking for the chance to work with a team that is defining
a new category within Big Data?
!
We are hiring!
http://continuuity.com/careers
careers[at]continuuity.com
!
!
Alex Baranau @abaranau alexb[at]continuuity.com