Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Cassandra & the Acunu Data Platform
1. Cassandra & the
Acunu Data
Platform
Tom Wilkie
Founder & VP Engineering
@tom_wilkie
2. Before the Flood
1990
Small databases
BTree indexes
BTree File systems
RAID
Old hardware
3. Two Revolutions
2010
Distributed, shared-nothing databases
Write-optimised indexes Write-optimised indexes
BTree file systems BTree file systems
RAID ... RAID
New hardware New hardware
4. Bridging the Gap
2011
Distributed, shared-nothing databases
Castle Castle
...
New hardware New hardware
17. interface
userspac
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
interface
Doubling Array Streaming interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
castle_{da,bloom}.c
18. B-Tree
B
• If node is full, split and
insert new node into
parent (recurse)
logB N
• For random inserts,
nodes placed randomly
on disk
19. Range Query
Update
(Size Z)
O(logB N) O(Z/B)
B-Tree random IOs random IOs
B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
20. Doubling Array
Inserts
2 2 9
9
Buffer arrays in memory
until we have > B of them
21. Doubling Array
Inserts
11 2 9 2 8 9 11
8 8 11
etc...
Similar to log-structured merge trees (LSM), cache-
oblivious lookahead array (COLA), ...
27. interface
userspac
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
interface
Doubling Array Streaming interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
castle_{da,bloom}.c
28. Doubling Arrays
doubling array
mapping layer
“Mod-list” B-Tree
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
Cache
block mapping &
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
freespace
allocator
manager
flusher
& mapper
page cache
castle_{btree,versions}.c
&
29. Copy-on-Write BTree
Idea:
• Apply path-copying [DSST] to
the B-tree
Problems:
• Space blowup: Each update may
rewrite an entire path
• Slow updates: as above
A log file system makes updates sequential, but relies on
random access and garbage collection (achilles heel!)
30. Range
Update Space
Query
CoW B- O(logB Nv) O(Z/B)
O(N B logB Nv)
Tree random IOs random IOs
Nv = #keys live (accessible) at version v
32. “BigTable” snapshots
v1 v2
• Inserts produce arrays
2
1 a 2
1 b 1 c • Snapshots increment ref
counts on arrays
• Merges product more
arrays, decrement ref
count on old arrays
33. “BigTable” snapshots
v1 v2
• Inserts produce arrays
1 a 1 b • Snapshots increment ref
counts on arrays
• Merges product more
1 a b c
arrays, decrement ref
count on old arrays
34. “BigTable” snapshots
v1 v2
• Inserts produce arrays
1 a 1 b • Snapshots increment ref
counts on arrays
• Merges product more
1 a b c
arrays, decrement ref
count on old arrays
• Space blowup
35. Range
Update Space
Query
CoW B- O(logB Nv) O(Z/B)
O(N B logB Nv)
Tree random IOs random IOs
“BigTable” O((log N)/B) O(Z/B)
O(VN)
style DA sequential IOs sequential IOs
Nv = #keys live (accessible) at version v
36. “Mod-list” BTree
Idea:
• Apply fat-nodes [DSST] to the
B-tree
• ie insert (key, version, value)
tuples, with special operations
Problems:
• Similar performance to a BTree
If you limit the #versions, can be constructed
sequentially, and embedded into a DA
37. Range
Update Space
Query
CoW B- O(logB Nv) O(Z/B)
O(N B logB Nv)
Tree random IOs random IOs
“BigTable” O((log N)/B) O(Z/B)
O(VN)
LevelDB
style DA sequential IOs sequential IOs
“Mod-list” O((log N)/B) O(Z/B)
CASTLE
in a DA sequential IOs sequential IOs
O(N)
Nv = #keys live (accessible) at version v
45. • Castle: like BDB, but for Big Data
• 2 orders of magnitude better
performance and predictability
• Part of the Acunu Data Platform
46. Questions?
Tom Wilkie
@tom_wilkie
tom@acunu.com
http://bitbucket.org/acunu
http://github.com/acunu
http://www.acunu.com/download
http://www.acunu.com/insights
47. References
[LSM] The Log-Structured Merge-Tree (LSM-Tree)
Patrick O'Neil, Edward Cheng, Dieter Gawlick,
Elizabeth O'Neil Stratified B-trees and versioned dictionaries, - Andy
http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton,
%20Log-Structured%20Merge-Tree%20%28LSM- John Wilkes, Tom Wilkie, HotStorage’11
Tree%29.pdf http://www.usenix.org/event/hotstorage11/tech/
final_files/Twigg.pdf
[COLA] Cache-Oblivious Streaming B-trees,
Michael A. Bender et al [RDA] Random duplicate storage strategies for
http://www.cs.sunysb.edu/~bender/newpub/ load balancing in multimedia servers, 2000, Joep
BenderFaFi07.pdf Aerts and Jan Korst and Sebastian Egner
http://www.win.tue.nl/~joep/IPL.ps
[DSST] Making Data Structures Persistent - J. R.
Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Apache, Apache Cassandra, Cassandra, Hadoop, and
Data Structures Persistent, Journal of Computer the eye and elephant logos are trademarks of the
and System Sciences,Vol. 38, No. 1, 1989 Apache Software Foundation.
http://www.cs.cmu.edu/~sleator/papers/making-
data-structures-persistent.pdf