9. keyspace
Row Key: column
name
value
timestamp
column column
column family
Row Key: column column column
Data Model Keyspace is like database in an
RDBMS
A column family is a tableEach row has a unique Row
Key, like primary key
10. Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
11. Cassandra is a Distributed Hash Table using
consistent hashing
12. Firstly, we have an empty token ring with
2^64 positions
-2^632^63 - 1 A token
represents a
position on the
ring
13. We add two nodes (B and D) and their tokens
determine their positions on the ring
D
B
-2^63
0
2^63 - 1
Nodes mean
machines here
Tokens could
be assigned
manually or
generated
randomly
14. A node is responsible for the range between
its predecessor and itself
D
B
-2^63
0
2^63 - 1
B's range
D's range
15. D has a list of seed nodes that include B
such that D knows the IP address of B and
could talk to B
D
B
-2^63
0
2^63 - 1
messages
16. When D hasn't received a reply from B for a
while it suspects that B is down
D
B
-2^63
0
2^63 - 1
No reply
17. Then we add more nodes (A and C)
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
Part of B and D's
ranges are taken
by A and C
18. Node A and C have D as their seed node so that
they could talk to D
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
messages messages
20. The way A and C learn about other nodes are
called Gossip
● Gossip is a peer-to-peer communication
protocol for exchanging location and state
information between nodes
21. Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
23. a row key will also get a token (a position
on the ring)
Row key Token
24. The row is stored on the node that is
responsible for the range
D
AC
B
-2^63
-2^62
0
2^62
johnny
jim
suzy
carol
2^63 - 1
e.g. johnny's
token falls in
the range of
A and is
hence stored
there
25. Partitioner is to assign tokens
partitioner function range
Murmur3Partitioner
MurmurHash
Function
[-2^63, 2^63 - 1]
RandomPartitioner MD5 hash value [0, 2^127 - 1]
ByteOrderedPartitioner
Orders rows
lexically by key
bytes
Platform 's
default charset
(e.g. 32 bit for
utf8)
One cluster, one partitioner !
27. Drawback of ByteOrderedPartitioner
✗ Sequential writes can cause hot spots
✗ More administrative overhead to load balance
the cluster
✗ Uneven load balancing for multiple column
families
28. Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
29. Data could be lost when nodes fail; we need a
replication strategy
30. D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
johnny
The first replica is determined by partitioner
and additional replicas are placed on the next
nodes clockwise in the ring (SimpleStrategy)
Suppose we
store 3 replicas
32. D
AC
B
joh.
H
EG
F
Cassandra can replica data across racks and
data centers
West Data center East Data center
Suppose A, B are on rack1 and C, D are
on rack2
Suppose E, F and G are on rack1 and H are
on rack2
33. This is called NetworkTopologyStrategy
● Use for multiple racks in a data center and
multiple data centers
● Specify how many replicas you want in each
data center
● Places replicas in the same data center by
walking down the ring clockwise until reaching
the first node in another rack
34. Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
35. A write or read request could go to any node
which serves as a coordinator
36. A write request where D serves as the
coordinator and replicas are stored on A, B, C
D
AC
B
client
insert 'johnny'
coordinator
johnny
johnny
johnny
By partitioner and
replica strategy, a
coordinator
determines which
nodes to get the
request
37. When does a coordinator return an
acknowledgement to the client ?
● When the write succeeds on consistency level
replicas
✔ Consistency is the synchronization of data on
replicas in a cluster
✔ Consistency level is a client setting that defines a
successful write or read by the number of cluster
replicas that acknowledge the write or respond to
the read request, respectively
38. insert 'johnny' with consistency level = one
D
AC
B
client
insert 'johnny'
coordinator
johnny
lost lost
ACK
ACK
39. insert 'johnny' with consistency level = quorum
D
AC
B
client coordinator
johnny
johnny
(replicas / 2) + 1ACKACK
ACK
ACK
lost
insert 'johnny'
Quorum
means
majority
40. get 'johnny' with consistency level = quorum
D
AC
B
client
johnny
v2
johnny
v1
johnny
v2 Coordinator returns
the most recent data
determined by timestamp
41. What if I want strong consistency
● Write CL + Read CL > Replicas
e.g. write one, read all
write all, read one
write quorum, read quorum
A B C
client
A B C
client
A B C
client
read
write
45. ● Memtable
an in-memory sorted map from row key to
columns
● SSTable
an immutable data file to which Cassandra
writes memtables periodically
● Commit log
a redo log to which Cassandra appends data
for recovery in the event of a hardware failure
What are they ?
46. More updates and flush
memtable
Commit log
They belong to the same column family
48. ● A tombstone is written to indicate a deleted
column
● Columns marked with a tombstone exist for
configured gc_grace_seconds after which
compaction permanently deletes the column
SSTable is immutable, how about delete ?
49. compaction
● In the background, Cassandra periodically
merges SSTables together into larger
SSTables
● Compaction merges row fragments, removes
expired tombstones, and rebuilds indexes.
50. Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
51. CQL
● Cassandra Query Language (CQL) is a SQL
like language for querying Cassandra.
● CQL doesn't support joins; Cassandra
encourages denormalization
We refer to CQL3 here
Joins require expensive random reads, which
need to be merged across the network
52. CQL3 structure
clientcqlsh
Thrift RPC CQL binary protocol
Query Processor
Internal write / read API
Local path Remote path
server
transport
Java / .NET driver
53. CQL3 queries
CREATE TABLE profiles (
id text PRIMARY KEY,
first_name text,
last_name text,
age int
);
id first_name last_name age
11485603 tianlun zhang 23
INSERT INTO profiles (id,
first_name, last_name, age)
VALUES ('11485603',
'tianlun', 'zhang', 23);
SELECT * FROM profiles;
Table means column family here
54. CQL3 hides internal storage from
users
id first_name last_name age
11485603 tianlun zhang 23
first_name:
last_name:
age:
tianlun
zhang
23
11485603
internal
storage
Row key Column name Column value
:
Columns
are sorted
by column
name
55. compound primary key in CQL3
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
content text,
PRIMARY KEY (article_id, posted_at)
);
Row Key The remaining component
ensures that the columns in a
row are stored in ascending
order on disk
56. Columns are sorted first by posted_at and
then by column name
article_id posted_at author content
550e8400-..
1970-01-17 00:08:19+0900
yukim blah, blah, blah
550e8400-..
1970-01-17 05:08:19+0900
yukim well, well, well
57. Since columns of a row are sorted by time,
we could efficiently get the comment on an
article after a certain time
SELECT * FROM comments WHERE
article_id = '550e8400-..' AND
posted_at >= '1970-01-17 03:08:19+0900';
article_id posted_at author content
550e8400-.. 1970-01-17 05:08:19+0900 yukim well, well, well
58. How about query on value ?
Secondary index enables us to query on value
SELECT * FROM comments where author = 'yukim';
Bad Request: No indexed columns present in by-columns
clause with Equal operator
59. Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
60. ● Index on column values (should not be primary
key or part of compound primary key)
● Cassandra implements secondary indexes as a
hidden column family (invisible to client),
separate from the column family that contains
the values being indexed
Secondary index
61. CREATE INDEX c_author on comments (author);
`
yukim [550e8400-.., 1350499616, author]:
[550e8400-.., 1368499616, author]:
Index column family
Base CF and Index CF are
flushed to disk at the same time
Column value
Row key + column name
62. SELECT * FROM comments where author='yukim';
● Index column family is stored on the same
node as base column family
● Cassandra doesn't maintain column value
information in any one node and the query
still needs to be sent to all nodes
63. Using multiple secondary indexes
● If 'bob' is less frequent than 'smith', Cassandra
will process users_fname = 'bob' first for
efficiency
64. DELETE FROM comments where author='yukim';
● This is not allowed
● Delete a indexed column won't update index
65. Secondary index updates
● Cassandra appends data to the commit log,
updates the memtable, and updates the
secondary index
● If a read sees a stale index entry before
compaction purges it, the reader thread
invalidates it
66. Secondary index overhead
● Built on existing data in the background
automatically, without blocking reads or writes
(the CREATE clause)
● Updating indexes blocks reads or writes at row
level
(the INSERT clause)