Transactions Over Apache HBase

TRANSACTIONS OVER HBASE
Alex Baranau @abaranau

Gary Helmling @gario

!
Continuuity

WHO WE ARE
• We’ve built Continuuity Reactor: the world’s ﬁrst scale-out
application server for Hadoop
• Fast, easy development, deployment and management of
Hadoop and HBase apps
• Continuuity team has years of experience in using and contributing
to Open Source, and we intend to continue doing so.

AGENDA
• Transactions over HBase: Why? What?
• Implementation: How?
• The approach
• Transaction Manager
• Transaction in batch jobs

THE REACTOR
• Continuuity Reactor is an app platform built on Hadoop and HBase
• Collect, Process, Store, and Query data.
• A Flow is a real-time processor with exactly-once guarantee
• A ﬂow is composed of ﬂowlets, connected via queues
• All processing happens with ACID guarantees in transactions

HBase
Table
PROCESSING IN A FLOW
...Queue ...
...
Flowlet
... ...

HBase
Table
WRAP IN TRANSACTION!
...Queue ...
...
Flowlet

TRANSACTIONS: WHAT?
• Atomic - Entire transaction is committed as one
• Consistent - No partial state change due to failure
• Isolated - No dirty reads, transaction is only visible
after commit
• Durable - Once committed, data is persisted reliably

HBASE: QUICK “WHAT?”
• Open-source non-relational distributed column-
oriented database modeled after Google’s BigTable
• Data is stored in named Tables of Rows
• Row consists of Cells: 
(row key, column family, column, timestamp) -> value
• Table is split into Regions, each served by a single node
in HBase cluster

WHAT ABOUT HBASE?
• Atomic operations on cell value:  
checkAndPut, checkAndDelete, increment, append
• Atomic batch of operations on rows within region
• No cross region atomic operations support
• No cross table atomic operations support
• No multi-RPC atomic operations support

TRANSACTIONS: WHY ELSE?
• Complex data access patterns
• Secondary Indexes
• Design for greater performance
• Efﬁcient batching of data operations
• Client-side buffer reliability ﬁx

IMPLEMENTATION
• Multi-Version Concurrency Control
• Cell version (timestamp) = transaction ID
• All writes in the same transaction use the transaction ID as timestamp
• Reads exclude other, uncommitted transactions (for isolation)
• Optimistic Concurrency Control
• Conflict detection at commit of transaction
• Write Conflict: two overlapping transactions write the same row
• Rollback of one transaction in case of conflict (whichever commits later)

OPTIMISTIC CONCURRENCY
CONTROL
• Avoids cost of locking rows and tables
• No deadlocks or lock escalations
• Cost of conﬂict detection and possible rollback
is higher
• Good if conﬂicts are rare: short transaction,
disjoint partitioning of work

ZooKeeper
TRANSACTIONS IN CONTEXT
Tx Manager

(standby)
HBase
Master 1
Master 2

RS 1
RS 2 RS 4
RS 3
Client 1
Client 2
Client N
Tx Manager

(active)

TRANSACTION LIFE CYCLE
time
out
try abort
failed
roll back
in HBase
write
to
HBase
do work
Client Tx Manager
none
complete V
abortsucceeded
in progress
start tx
start
start tx
commit
try commit check conﬂicts
RPC API
invalid X
invalidate
failed

HBase
CLIENT SIDE: TX AWARE
Cell TS Value
row1:col1 1001 10
Tx Manager
Client 1
Client 2
write = 1002
read = 1001

HBase
Cell TS Value
row1:col1 1001 10
Tx Manager
Client 1
start
write = 1002
read = 1001
Client 2
write = 1002
read = 1001

HBase
Cell TS Value
row1:col1 1001 10
Tx Manager
Client 1
start
write = 1002
read = 1001
Client 2
write = 1003
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
Tx Manager
Client 1
increment
write = 1002
read = 1001
Client 2
write = 1003
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
Tx Manager
Client 1
increment
write = 1002
read = 1001
Client 2
write = 1003
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
Tx Manager
Client 1 start
write = 1002
read = 1001
Client 2
write = 1003
read = 1001
inprogress=[1002]
write = 1003
read = 1001
excluded=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
Tx Manager
Client 1 start
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002, 1003]
write = 1003
read = 1001
excluded=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1
increment
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
write = 1003
read = 1001
excluded=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1
commit
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
write = 1003
read = 1001
excluded=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1
commit
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]
write = 1003
read = 1001
excluded=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1
commit
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1
conﬂict!
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1 rollback
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1003 11
Tx Manager
Client 1 rollback
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1003 11
Tx Manager
Client 1
abort
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1003 11
Tx Manager
Client 1
abort
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1003 11
Tx Manager
Client 1
abort
write = 1002
read = 1001
Client 2
write = 1004
read = 1003
inprogress=[]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1003 11
Tx Manager
Client 1 start
Client 2
write = 1005
read = 1003
inprogress=[]
write = 1004
read = 1003

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1003 11
Tx Manager
Client 1
read
Client 2
write = 1005
read = 1003
inprogress=[]
write = 1004
read = 1003

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1 rollback failed
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1
invalidate
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1
invalidate
write = 1002
read = 1001
Client 2
write = 1004
read = 1003
inprogress=[]
invalid=[1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1 start
Client 2
write = 1005
read = 1003
inprogress=[]
invalid=[1002]
write = 1004
read = 1003
exclude = [1002]

HBase
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1
read
Client 2
write = 1005
read = 1003
inprogress=[]
invalid=[1002]
write = 1004
read = 1003
exclude = [1002]
invisible!

TRANSACTION MANAGER
• Create new transactions
• Provides monotonically increasing write pointers
• Maintains all in-progress, committed, and invalid transactions
• Detect conﬂicts
• Transaction =
Write Pointer: Timestamp for HBase writes
Read pointer: Upper bound timestamp for reads
Excludes: List of timestamps to exclude from reads

TRANSACTION MANAGER
• Simple & Fast
• All required state is in-memory
• Single point of failure?
• Persist all state to a write-ahead log
• Secondary Tx Manager watches for failure of Primary
• Failover can happen quickly

TRANSACTION MANAGER
Tx Manager
Current State
in progress
committed
invalid
read point
write point
start()

TRANSACTION MANAGER
Tx Manager
Current State
in progress (+)
committed
invalid
read point
write point ++
start()
Tx Log
started, <write pt>
HDFS

TRANSACTION MANAGER
Tx Manager
Current State
in progress (-)
committed (+)
invalid
read point
write point
commit()
Tx Log
start, <write pt>
commit, <write pt>
HDFS

TRANSACTION SNAPSHOTS
• Write-ahead log provides persistence
• Guarantees point-in-time recovery
• Longer the log grows, longer recovery takes
• Periodically write snapshot of full transaction state
• Snapshot + all new logs provides full state

Tx Manager
Current State
Tx Log A
in progress
committed
invalid
read point
write point
HDFS

Tx Manager
Current State
Tx Log A
in progress
committed
invalid
read point
write point Tx Log B1
HDFS

Tx Log ATx Manager
in progress
committed
invalid
read point
write point
Current State
State Snapshot
in progress
committed
invalid
read point
write point
Tx Log B2
HDFS

Tx Log ATx Manager
in progress
committed
invalid
read point
write point
Current State
State Snapshot
in progress
committed
invalid
read point
write point
Tx Log B
Tx Snapshot
in progress
committed
invalid
read point
write point
3
HDFS

Tx Log ATx Manager
in progress
committed
invalid
read point
write point
Current State
State Snapshot
in progress
committed
invalid
read point
write point
Tx Log B
Tx Snapshot
in progress
committed
invalid
read point
write point
4
HDFS

HBase
TRANSACTION CLEANUP
Cell TS Value
row1:col1 1001 10
row1:col1 1002 11
row1:col1 1003 11
Tx Manager
Client 1 rollback failed
write = 1002
read = 1001
Client 2
write = 1004
read = 1001
inprogress=[1002]

TRANSACTION CLEANUP:
DATA JANITOR
• RegionObserver coprocessor
• Maintains in-memory snapshot of recent invalid &
in-progress sets
• Periodically updates from transaction snapshot in
HDFS
• Purges data from invalid transactions and older
versions on ﬂush & compaction

HBase
DATA JANITOR
Tx Snapshot
read point = 1003
write point = 1005
in progress = [1004]
committed = []
invalid = [1002]
refresh
Data Janitor

(RegionObserver)
MemStore
preFlush()
read point = 1003
write point = 1005
committed = []
invalid = [1002]
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
Custom RegionScanner
HFile
Cell TS Value

HBase
DATA JANITOR
Data Janitor

(RegionObserver)
HFile
Cell TS Value
MemStore
read point = 1003
write point = 1005
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
committed = []
invalid = [1002]
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11

HBase
DATA JANITOR
Data Janitor

(RegionObserver)
HFile
Cell TS Value
row1:col1 1004 12Custom RegionScanner
read point = 1003
write point = 1005
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
committed = []
invalid = [1002]
MemStore
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11

HBase
DATA JANITOR
Data Janitor

(RegionObserver)
read point = 1003
write point = 1005
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
committed = []
invalid = [1002]
MemStore
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
HFile
Cell TS Value
row1:col1 1004 12

HBase
DATA JANITOR
Data Janitor

(RegionObserver)
read point = 1003
write point = 1005
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
committed = []
invalid = [1002]
MemStore
preFlush()
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
HFile
Cell TS Value
row1:col1 1004 12
1003 11

HBase
DATA JANITOR
Data Janitor

(RegionObserver)
read point = 1003
write point = 1005
committed = []
invalid = [1002]
Tx Snapshot
read point = 1003
write point = 1005
committed = []
invalid = [1002]
preFlush()
MemStore
Cell TS Value
row1:col1 1004 12
1003 11
1002 11
HFile
Cell TS Value
row1:col1 1004 12
1003 11

REDUNDANT VERSIONS CLEANUP
• Every write is new version of a cell

• Heavily updated values (e.g. counters) result
in lots of versions
• Cleanup: keep at most one version of a
cell visible to all transactions
• Plug-in cleanup into existing Data Janitor CP

TTL
• Timestamp of a value is overwritten
with tx id
• Breaks support for “native TTL”
feature in HBase
• Tx ids and timestamps cannot be
aligned: more than 1K tx/sec wanted

TTL
• Encode timestamp in tx id
• tx id = (current ts) * 1,000,000 + counter-
within-ms
• 1B txs/sec max
• long value overﬂow in ~200 years
• Plug-in cleanup into existing Data Janitor CP
• Apply TTL logic on reads

TRANSACTIONS IN BATCH
• Batch jobs are long running
• No changes should be visible before job is done
• Batch jobs are multi-step
• Individual step can be restarted
• All steps succeed or all fail semantics needed
• Every step retry should be isolated
• Uncommitted changes should be visible within same
step but not to other steps

TRANSACTIONS IN BATCH
• Batch jobs are long-running
• No ﬁne-grained conﬂict detection: last started
among successfully committed wins
• Delayed rollback execution
• Batch jobs are multi-step
• For now: batch job uses single transaction, every
step re-uses it
• For future: nested transactions for steps

TX OVERHEAD
• 2 RPC calls, each faster than Put in HBase
• Can be improved with batching
• To avoid overhead of transactions where
appropriate the ACID guarantees can be relaxed
• No need to change data format
• When reading: read latest version of a cell
• When writing: use ts = (current ts) * 1M + counter

WHAT’S NEXT?
• Open Source
• Continue Scaling Tx Manager
• Transaction Groups?
• Integration across other transactional stores

QS?
Looking for the chance to work with a team that is deﬁning
a new category within Big Data?
!
We are hiring!
http://continuuity.com/careers
careers[at]continuuity.com
!
!
Alex Baranau @abaranau alexb[at]continuuity.com

Transactions Over Apache HBase

Recommended

Recommended

More Related Content

Similar to Transactions Over Apache HBase

Similar to Transactions Over Apache HBase (20)

Recently uploaded

Recently uploaded (20)

Transactions Over Apache HBase