2. Outline
• Why Castle?
• A [quick] tour of Castle
• Cassandra on Castle
• An aside into Memcache
• Cross-cluster snapshots and clones
Saturday, 24 September 2011
3. Before the Flood
1990
Small databases
BTree indexes
BTree File systems
RAID
Old hardware
Saturday, 24 September 2011
4. Two Revolutions
2010
Distributed, shared-nothing databases
Write-optimised indexes Write-optimised indexes
BTree file systems BTree file systems
RAID ... RAID
New hardware New hardware
Saturday, 24 September 2011
5. Bridging the Gap
2011
Distributed, shared-nothing databases
Castle Castle
...
New hardware New hardware
Saturday, 24 September 2011
6. Saturday, 24 September 2011
Shared memory interface
keys
Userspace
Acunu Kernel
values
In-kernel
async, shared
memory ring workloads
interface
shared buffers
userspace
Streaming interface
range key buffered key buffered
queries insert value insert get value get
interface
kernelspace
Doubling Arrays
insert Bloom filters
queues key
get x
arrays
range arrays
queries management
mapping layer
key
doubling array
insert merges
Arrays
key Version tree
insert btree
key
get
btree
range
modlist btree
mapping layer
queries value arrays
Cache
"Extent" layer
extent block
extent cache
freespace
allocator
prefetcher
manager
& mapper
cacheing layer
flusher
block mapping &
page cache
Linux Kernel
Block layer Memory manager
MM layers
linux's block &
7. Shared memory interface
Castle
keys
Userspace
Acunu Kernel
userspace
interface
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
• Like ZFS+BDB for Big Data
Streaming interface
interface
range key buffered key buffered
queries insert value insert get value get
• Opensource (GPLv2, MIT
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
for user libraries)
range arrays
queries management
key
insert merges
Arrays
• http://bitbucket.org/acunu
mapping layer
modlist btree
key Version tree
insert btree
• Loadable Kernel Module,
key
get
btree
range
queries value arrays
Cache
targeting CentOS’s 2.6.18
block mapping &
• http://www.acunu.com/
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
freespace
allocator
manager
flusher
& mapper
page cache
blogs/andy-twigg/why-
acunu-kernel/
linux's block &
Linux Kernel
MM layers
Block layer Memory manager
Saturday, 24 September 2011
8. The Interface
Shared memory interface
keys
Userspace
Acunu Kernel
userspace
interface
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
Streaming interface
interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range
queries
castle_{back,objects}.c
arrays
management
Saturday, 24 September 2011 key
9. The Interface
Tree of versions
Attachment
• Create, snapshot, clone
• Attach/detach
• Keys: any dimensional
• Values: any size
v0
• Simple get, put, delete
v1 v3
• Iterator, slice interfaces
v12 v13 v15
• Streaming interface
v16 v24
Saturday, 24 September 2011
10. The Interface
Shared memory interface
keys
Userspace
Acunu Kernel
userspace
interface
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
Streaming interface
interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range
queries
castle_{back,objects}.c
arrays
management
Saturday, 24 September 2011 key
11. interface
userspac
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
interface
Doubling Array Streaming interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
castle_{da,bloom}.c
Saturday, 24 September 2011
12. Doubling Array
Inserts
2 2 9
9
Buffer arrays in memory
until we have > B of them
Saturday, 24 September 2011
15. interface
userspac
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
interface
Doubling Array Streaming interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
castle_{da,bloom}.c
Saturday, 24 September 2011
16. Doubling Arrays
doubling array
mapping layer
“Mod-list” B-Tree
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
Cache
block mapping &
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
So how to do snapshots and clones?
freespace
manager
allocator
flusher
& mapper
page cache
castle_{btree,versions}.c
k&
Linux Kernel
s
Saturday, 24 September 2011
17. Copy-on-Write BTree
Idea:
• Apply path-copying [DSST] to
the B-tree
Problems:
• Space blowup: Each update may
rewrite an entire path
• Slow updates: as above
A log file system makes updates sequential, but relies on
random access and garbage collection (achilles heel!)
Saturday, 24 September 2011
18. Range
Update Space
Query
CoW B- O(logB Nv) O(Z/B)
O(N B logB Nv)
Tree random IOs random IOs
“BigTable” O((log N)/B) O(Z/B)
O(VN)
LevelDB
style DA sequential IOs sequential IOs
“Mod-list” O((log N)/B) O(Z/B)
Castle
in a DA sequential IOs sequential IOs
O(N)
Nv = #keys live (accessible) at version v
Saturday, 24 September 2011
19. Stratified B-Trees
• Retires Copy-On-Write B-Trees, the bedrock of
modern storage (Sun ZFS, NetApp WAFL, ...)
• Patent-pending, next-generation data structure
• Theoretically optimal, yet highly practical
Copy-on-write B-tree finally beaten.
Andy Twigg∗ , Andrew Byde∗ , Grzegorz Miło´∗ , Tim Moreton∗ , John Wilkes†∗ and Tom Wilkie∗
∗
s
Acunu, † Google http://goo.gl/INTb1
firstname@acunu.com
Abstract This paper presents some recent results on new con-
structions for B-trees that go beyond copy-on-write, that
A classic versioned data structure in storage and com- we call ‘stratified B-trees’. They solve two open prob-
puter science is the copy-on-write (CoW) B-tree – it un- lems: Firstly. they offer a fully-versioned B-tree with
derlies many of today’s file systems and databases, in- optimal space and the same lookup time as the CoW B-
cluding WAFL, ZFS, Btrfs and more. Unfortunately, it tree. Secondly, they are the first to offer other points on
doesn’t inherit the B-tree’s optimality properties; it has the Pareto optimal query/update tradeoff curve, and in
poor space utilization, cannot offer fast updates, and re- particular, our structures offer fully-versioned updates in
http://goo.gl/gzihe
lies on random IO to scale. Yet, nothing better has o(1) IOs, while using linear space. Experimental results
been developed since. We describe the ‘stratified B-tree’, indicate 100,000s updates/s on a large SATA disk, two
which beats the CoW B-tree in every way. In particu- orders of magnitude faster than a CoW B-tree.
lar, it is the first versioned dictionary to achieve optimal Since stratified B-trees subsume CoW B-trees (and in-
tradeoffs between space, query and update performance. deed all other known versioned external-memory dictio-
Therefore, we believe there is no longer a good reason to naries), we believe there is no longer a good reason to
use CoW B-trees for versioned data stores. use them for versioned data stores. Acunu is develop-
ing a commercial in-kernel implementation of stratified
B-tress, which we hope to release soon.
1 Introduction
Saturday, 24 September 2011
The B-tree was presented in 1972 [1], and it survives
20. Doubling Arrays
doubling array
mapping layer
“Mod-list” B-Tree
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
Cache
block mapping &
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
freespace
allocator
manager
flusher
& mapper
page cache
castle_{btree,versions}.c
k&
Linux Kernel
s
Saturday, 24 September 2011
21. Arrays
mapping layer
modlist btree
key Version tree
insert btree
Disk Layout: RDA
key
get
btree
range
queries value arrays
Cache
block mapping &
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
freespace
allocator
manager
flusher
& mapper
page cache
linux's block &
Linux Kernel
MM layers
Block layer Memory manager
castle_{cache,extent,freespace,rebuild}.c
Saturday, 24 September 2011
23. SSD tiering [taster]
• Why? Key to >cache random reads
• v1: SSD for metadata structures
• Redundancy provided by disk
• SSD for selected collection data (CFs)
• 10x write rate on SSDs than regular FSs
Saturday, 24 September 2011
24. Saturday, 24 September 2011
Shared memory interface
keys
Userspace
Acunu Kernel
values
In-kernel
async, shared
memory ring workloads
interface
shared buffers
userspace
Streaming interface
range key buffered key buffered
queries insert value insert get value get
interface
kernelspace
Doubling Arrays
insert Bloom filters
queues key
get x
arrays
range arrays
queries management
mapping layer
key
doubling array
insert merges
Arrays
key Version tree
insert btree
key
get
btree
range
modlist btree
mapping layer
queries value arrays
Cache
"Extent" layer
extent block
extent cache
freespace
allocator
prefetcher
manager
& mapper
cacheing layer
flusher
block mapping &
page cache
Linux Kernel
Block layer Memory manager
MM layers
linux's block &
25. Cassandra on Castle
• Eliminate all ‘storage heavy lifting’
• Extend ColumnFamilyStore
• Efficient JNI bindings to libcastle C library
• row, col, value, t: (row, col) -> (t,value)
• row, a|b|c|d, value, t:
(row, a, b, c, d, col) -> (t,value)
Saturday, 24 September 2011
26. Small random inserts
Inserting 3 billion rows
Acunu powered Cassandra -
‘standard’ Cassandra -
Saturday, 24 September 2011
27. Insert latency
While inserting 3 billion rows
Acunu powered Cassandra x
‘standard’ Cassandra +
Saturday, 24 September 2011
28. Small random range queries
Performed immediately after inserts
Acunu powered Cassandra -
‘standard’ Cassandra -
Saturday, 24 September 2011
29. Memcache + Cassandra
get/insert Cass client get/put memcached
Same data! 100k random
Replication logic inserts/sec! Replication logic
Text
Cassandra memcache Cassandra memcache
Castle Castle
...
H/W H/W
Saturday, 24 September 2011
30. v2: Cross-cluster versions
• Eventually consistent
• Spans data centers
• Tolerates node failure,
network partition
• High performance,
no space overhead
• Dev/Test/Staging on Prod
clusters
Saturday, 24 September 2011
31. So...
• Castle = ZFS + BDB for Big Data
• Cassandra on Castle runs apps unmodified
• Up to 100x throughput under load
• No GC pauses: very predictable latencies
• v2: Cross-cluster snapshot and clone
• SSD optimisation
Saturday, 24 September 2011
33. Questions?
Tim Moreton // @timmoreton
http://goo.gl/INTb1 http://goo.gl/gzihe
Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and
elephant logos are trademarks of the Apache Software Foundation.
Saturday, 24 September 2011