Cassandra does what ? Code Mania 20121. CASSANDRA DOES WHAT?
CODE MANIA 2012
Aaron Morton, Apache Cassandra Committer
@aaronmorton
www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
13. Today.
Cluster
Data Model
Node
15. Store ‘foo’.
Node 1 - 'foo'
Node 4 - 'foo' Node 2 - 'foo'
Node 3 - 'foo'
19. Store ‘foo’ with Replication Factor 3.
Node 1 - 'foo'
Node 4 Node 2 - 'foo'
Node 3 - 'foo'
24. Partitioner...
RandomPartitioner
transforms Keys to Tokens
using MD5.
(Default Partitioner, there are others.)
26. 128 Bit Unsigned Integer Token.
170,141,183,460,46
9,231,731,687,303,7
15,884,105,728
29. Token Ranges.
Node 1
token: 0
76-0 1-25
Node 4 Node 2
token: 75 token: 25
Node 3
token: 50
30. Token Ranges.
Node Token Range From Range To
1 0 76 0
2 25 1 25
3 50 26 50
4 75 51 75
31. Locate Token Range.
Node 1
token: 0
'foo'
token: 90
Node 4 Node 2
token: 75 token: 25
Node 3
token: 50
34. SimpleStrategy with RF 3.
Node 1
token: 0
'foo'
token: 90
Node 4 Node 2
token: 75 token: 25
Node 3
token: 50
37. Multi DC Replication with RF 3 and RF 2.
Node 1 Node 1
token: 0 token: 1
'foo'
token: 90
Node 4 West DC Node 2 Node 4 East DC Node 2
token: 75 token: 25 token: 76 token: 26
Node 3 Node 3
token: 50 token: 51
40. PropertyFileSnitch.
Places nodes in a multiple
DCs and racks using
configuration.
(There are others.)
41. EC2Snitch.
Places nodes in a DC using
the AWS Region and a rack
using Availability Zone.
(There are others.)
45. The Client and the Coordinator.
Node 1
token: 0
'foo'
token: 90
Node 4 Node 2
token: 75 token: 25
Node 3
Client
token: 50
51. Node Down.
Node 1
token: 0
'foo'
token: 90
Node 4 Node 2
token: 75 token: 25
Node 3
Client
token: 50
59. Node Down with Hinted Handoff.
Node 1
'foo'
'foo'
token: 90
Node 4 Node 2
'foo' for #3 'foo'
Node 3
Client
61. Read ‘foo’.
Node 1
token: 0
'foo'
token: 90
Node 4 Node 2
token: 75 token: 25
Node 3
Client
token: 50
63. Read ‘foo’ at QUOURM.
Node 1
'foo'
'foo'
token: 90
Node 4 Node 2
'foo'
Node 3
Client
67. Differences in the ‘foo’ row.
Column Node 1 Node 2 Node 3
cromulent cromulent
purple <missing>
(timestamp 10) (timestamp 10)
embiggens embiggens debigulator
monkey
(timestamp 10) (timestamp 10) (timestamp 5)
tomato tomato tomacco
dishwasher
(timestamp 10) (timestamp 10) (timestamp 15)
68. Consistent Read.
Node 1 Node 1
cromulent
cromulent
Node 4 Node 2 Node 4 Node 2
<empty> cromulent
cromulent
Client Client
Node 3 Node 3
70. QUORUM with and without Read Repair.
Node 1 Node 1
Node 4 Node 2 Node 4 Node 2
Node 3 Node 3
Client Client
71. I can haz Consistency ?
R +W > N
(#Read Nodes + #Write Nodes > Replication Factor)
75. Today.
Cluster
Data Model
Node
76. Data Model so far.
Row Key: Column Column Column
(Incomplete.)
77. Data Model.
Keyspace
Column Family Column Family Column Family
Column Column Column
Row Key: Column Column Column
Column Column Column
(Excludes Super Columns.)
81. API...
Mutate
# pycassa - Python
>>> col_fam = pycassa.ColumnFamily(pool, 'ColumnFamily1')
>>> col_fam.insert('row_key', {'col_name': 'col_val'})
82. API...
Mutate
# Cassandra Query Language (CQL)
INSERT INTO ColumnFamily1 (KEY, col_name)
VALUES ('row_key', 'col_value');
83. API...
Delete
# pycassa - Python
>>> col_fam.remove('row_key')
>>> col_fam.remove('row_key', [‘col_name’])
84. API...
Delete
# Cassandra Query Language (CQL)
DELETE FROM ColumnFamily1 WHERE key IN
('row_key',);
DELETE col_name FROM ColumnFamily1 WHERE
key = 'row_key';
86. API...
Get, Multi-Get
# pycassa - Python
>>> col_fam.get('row_key')
{'col_name': 'col_val', 'col_name2': 'col_val2'}
>>> col_fam.multi_get(['row_key'], [‘col_name’])
{‘row_key’ : {'col_name': 'col_val'}}
87. API...
Get, Multi-Get
# Cassandra Query Language (CQL)
SELECT * FROM ColumnFamily1;
SELECT col_name FROM ColumnFamily1 WHERE
KEY IN (‘row_key’);
88. API...
Get Range*
# pycassa - Python
>>> col_fam.get_range(start='row_key')
{
'row_key' : {'col_name': 'col_val'},
'row_key50': {'col_name': 'col_val'},
'row_key2': {'col_name': 'col_val'}
}
89. API...
Get Range*
# Cassandra Query Language (CQL)
SELECT * FROM ColumnFamily1 WHERE KEY >=
‘row_key’;
91. Today.
Cluster
Data Model
Node
93. Write path...
Append to Write
Ahead Log.
(fsync every 10s by default, other options available)
94. Write path...
Merge Columns
into Memtable.
(Lock free, always in memory.)
97. (Later.)
Asynchronously flush
Memtable to new files.
(May be 10’s or 100’s of MB in size.)
99. SSTable files.
*-Data.db
*-Index.db
*-Filter.db
(Also *-Statistics.db and *-Digest.sha1)
100. SSTables.
SSTable 1 SSTable 2 SSTable 3 SSTable 4 SSTable 5
foo: foo: foo:
dishwasher (ts 10): frink (ts 20): dishwasher (ts 15):
tomato flayven tomacco
purple (ts 10): monkey (ts 10):
cromulent embiggins
101. Read Path...
Read columns from each
SSTable, then merge results.
(Roughly speaking.)
102. Read Path...
Use Bloom Filter to
determine if a row key does
not exist in a SSTable.
(In memory)
103. Bloom Filter says if a key is
definitely not present, or
present with a certain
probability.
(Default false positive rate is 0.0744%)
104. Read Path...
Search for prior key in
*-Index.db sample.
(In memory)
105. Read Path...
Scan *-Index.db from prior
key to find the search key and
its’ *-Data.db offset.
(On disk.)
107. Read purple, monkey, dishwasher.
Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter
Memory Index Sample Index Sample Index Sample Index Sample Index Sample
Disk
SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db
SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db
foo: foo: foo:
dishwasher (ts 10): frink (ts 20): dishwasher (ts 15):
tomato flayven tomacco
purple (ts 10): monkey (ts 10):
cromulent embiggins
108. Merge SSTables.
Column SSTable 1 SSTable 2 SSTable 4
cromulent
purple
(timestamp 10)
embiggens
monkey
(timestamp 10)
tomato tomacco
dishwasher
(timestamp 10) (timestamp 15)
109. Key Cache caches row key
position in *-Data.db file.
(Removes up to1disk seek per SSTable.)
110. Read with Key Cache.
Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter
Key Cache Key Cache Key Cache Key Cache Key Cache
Memory Index Sample Index Sample Index Sample Index Sample Index Sample
Disk
SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db
SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db
foo: foo: foo:
dishwasher (ts 10): frink (ts 20): dishwasher (ts 15):
tomato flayven tomacco
purple (ts 10): monkey (ts 10):
cromulent embiggins
112. Read with Row Cache.
Row Cache
Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter
Key Cache Key Cache Key Cache Key Cache Key Cache
Memory Index Sample Index Sample Index Sample Index Sample Index Sample
Disk
SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db
SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db
foo: foo: foo:
dishwasher (ts 10): frink (ts 20): dishwasher (ts 15):
tomato flayven tomacco
purple (ts 10): monkey (ts 10):
cromulent embiggins
115. Merge SSTables with Tombstones.
Column SSTable 1 SSTable 2 SSTable 4
cromulent <tombstone>
purple
(timestamp 10) (timestamp 15)
embiggens
monkey
(timestamp 10)
tomato tomacco
dishwasher
(timestamp 10) (timestamp 15)
116. Merge node response with Tombstones.
Column Node 1 Node 2 Node 3
cromulent cromulent <tombstone>
purple
(timestamp 10) (timestamp 10) (timestamp 15)
embiggens embiggens debigulator
monkey
(timestamp 10) (timestamp 10) (timestamp 5)
tomato tomato tomacco
dishwasher
(timestamp 10) (timestamp 10) (timestamp 15)
117. Compaction merges truth from
multiple SSTables into one
SSTable with the same truth.
(Manual and continuous background process.)
118. Compaction.
Column SSTable 1 SSTable 2 SSTable 4 New
cromulent <tombstone> <tombstone>
purple
(timestamp 10) (timestamp 15) (timestamp 15)
embiggens embiggens
monkey
(timestamp 10) (timestamp 10)
tomato tomacco tomacco
dishwasher
(timestamp 10) (timestamp 15) (timestamp 15)
119. Today.
Cluster
Data Model
Node
120. Papers.
•Cassandra - A Decentralized Structured Storage System (Lakshman et al).
•Bigtable: A Distributed Storage System for Structured Data (Chang, et al).
•Dynamo: Amazon’s Highly Available Key-value Store (DeCandia, et al).
•Eventually Consistent (Werner Vogels).
•Epidemic algorithms for replicated database maintenance (Demers, et al).
•Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web
services (Gilbert et al).
•Consistent hashing and random trees: distributed caching protocols for relieving
hot spots on the world wide web (Karger, et al).
•The φ Accrual Failure Detector (Hayashibara et al).
121. Aaron Morton
@aaronmorton
www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License