Cassandra does what ? Code Mania 2012

CASSANDRA DOES WHAT?
CODE MANIA 2012
Aaron Morton, Apache Cassandra Committer
@aaronmorton
www.thelastpickle.com

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

Cassandra is...

Scalable

Cassandra is...

Distributed

Cassandra is...

Highly Available

Cassandra uses...

Column Families

Cassandra is...

Fast

Cassandra is...

Fun
(Really.)

Why Cassandra?

Scale

Why Cassandra?

Operations

Why Cassandra?

Data Model

Today.

Cluster
Data Model
Node

Cluster.

Store the ‘foo’ row.

Store ‘foo’.
Node 1 - 'foo'

Node 4 - 'foo' Node 2 - 'foo'

Node 3 - 'foo'

Cluster Capacity?

Limited.

Replication Factor speciﬁes the
number of row replicas.
(RF)

Everything is a copy.

Master Slave
Replication.

Store ‘foo’ with Replication Factor 3.
Node 1 - 'foo'

Node 4 Node 2 - 'foo'

Node 3 - 'foo'

Cluster Capacity?
Node Capacity X Number Nodes
Replication Factor

Scalable Capacity?

✓

Consistent Hashing...

Evenly map keys to
nodes.

Consistent Hashing...

Minimise key
movements when
nodes join or leave.

Partitioner...
RandomPartitioner
transforms Keys to Tokens
using MD5.
(Default Partitioner, there are others.)

Keys and Tokens?
key 'fop' 'foo'

token 0 10 90 99

128 Bit Unsigned Integer Token.

170,141,183,460,46
9,231,731,687,303,7
15,884,105,728

Token Ring.
99 0
'foo' 'fop'
token: 90 token: 10

Partitioning...

Assign a Token to
each node.
(initial_token)

Token Ranges.
Node 1
token: 0

76-0 1-25

Node 4 Node 2
token: 75 token: 25

Node 3
token: 50

Token Ranges.
Node Token Range From Range To
1 0 76 0

2 25 1 25

3 50 26 50

4 75 51 75

Locate Token Range.
Node 1
token: 0

'foo'
token: 90

Node 4 Node 2
token: 75 token: 25

Node 3
token: 50

Replication Strategy selects
Replication Factor number of
nodes for a row.

SimpleStrategy selects
nodes by Token Order.
(Non default, there are others.)

SimpleStrategy with RF 3.
Node 1
token: 0

'foo'
token: 90

Node 4 Node 2
token: 75 token: 25

Node 3
token: 50

NetworkTopologyStrategy uses a
Replication Factor per Data
Centre.
(Default.)

NetworkTopologyStrategy...

Stripes replicas
across racks.

Multi DC Replication with RF 3 and RF 2.
Node 1 Node 1
token: 0 token: 1

'foo'
token: 90

Node 4 West DC Node 2 Node 4 East DC Node 2
token: 75 token: 25 token: 76 token: 26

Node 3 Node 3
token: 50 token: 51

The Snitch knows which Data
Centre and rack contains a
Node.

SimpleSnitch.
Places all nodes in the same
DC and rack.
(Default, there are others.)

PropertyFileSnitch.
Places nodes in a multiple
DCs and racks using
conﬁguration.
(There are others.)

EC2Snitch.
Places nodes in a DC using
the AWS Region and a rack
using Availability Zone.
(There are others.)

DynamicSnitch.
Re-orders nodes according to
their observed performance.
(Wraps other snitch.)

Clients connect to
any node in the
cluster.

Coordinator handles
a request for a
client.

The Client and the Coordinator.
Node 1
token: 0

'foo'
token: 90

Node 4 Node 2
token: 75 token: 25

Node 3
Client
token: 50

Nodes Gossip about
other nodes.

Gossip?
Nodes share information with
a small number of neighbours.
Who share information with...

Scalable Throughput?

✓

Node Down.
Node 1
token: 0

'foo'
token: 90

Node 4 Node 2
token: 75 token: 25

Node 3
Client
token: 50

Client speciﬁed
Consistency Level.

Consistency Level...

Any*, One, Two,
Three,

Consistency Level...
QUORUM,
LOCAL_QUORUM,
EACH_QUOURM*

Quorum?
ﬂoor(RF / 2) +1

QUOURM at Replication Factor...
Replication
2 or 3 4 or 5 6 or 7
Factor

QUOURM 2 3 4

Node Down with Hinted Handoff.
Node 1
'foo'

'foo'
token: 90

Node 4 Node 2
'foo' for #3 'foo'

Node 3
Client

Cluster.

Read the ‘foo’ row.

Read ‘foo’.
Node 1
token: 0

'foo'
token: 90

Node 4 Node 2
token: 75 token: 25

Node 3
Client
token: 50

Consistency Level
nodes must
respond.

Read ‘foo’ at QUOURM.
Node 1
'foo'

'foo'
token: 90

Node 4 Node 2
'foo'

Node 3
Client

Consistency Level
nodes must agree.

Digests used to
detect differences.

Timestamps used to
resolve differences.

Differences in the ‘foo’ row.
Column Node 1 Node 2 Node 3
cromulent cromulent
purple <missing>
(timestamp 10) (timestamp 10)

embiggens embiggens debigulator
monkey
(timestamp 10) (timestamp 10) (timestamp 5)

tomato tomato tomacco
dishwasher

Consistent Read.
Node 1 Node 1

cromulent

cromulent
Node 4 Node 2 Node 4 Node 2

<empty> cromulent
cromulent

Client Client
Node 3 Node 3

Read Repair is active
on a fraction of
requests.
(10% by default)

QUORUM with and without Read Repair.
Node 1 Node 1

Node 4 Node 2 Node 4 Node 2

Node 3 Node 3
Client Client

I can haz Consistency ?

R +W > N
(#Read Nodes + #Write Nodes > Replication Factor)

Anti Entropy...

Hash key ranges on
each node using
Merkle Trees.

Anti Entropy...

Stream differences
between nodes.

Highly Available?

✓

Today.
Cluster
Data Model
Node

Data Model so far.

Row Key: Column Column Column

(Incomplete.)

Data Model.
Keyspace

Column Family Column Family Column Family
Column Column Column
Row Key: Column Column Column
Column Column Column

(Excludes Super Columns.)

Rows are the unit
of replication.

The Column Family
is the unit of
storage.

Row and Column
Family are the unit
of querying.

API...
Mutate
# pycassa - Python

>>> col_fam = pycassa.ColumnFamily(pool, 'ColumnFamily1')

>>> col_fam.insert('row_key', {'col_name': 'col_val'})

API...
Mutate
# Cassandra Query Language (CQL)

INSERT INTO ColumnFamily1 (KEY, col_name)
VALUES ('row_key', 'col_value');

API...
Delete
# pycassa - Python

>>> col_fam.remove('row_key')

>>> col_fam.remove('row_key', [‘col_name’])

API...
Delete

DELETE FROM ColumnFamily1 WHERE key IN
('row_key',);

DELETE col_name FROM ColumnFamily1 WHERE
key = 'row_key';

Batch Mutate saves
on round trips.
(It’s not a Tx.)

API...
Get, Multi-Get
# pycassa - Python

>>> col_fam.get('row_key')
{'col_name': 'col_val', 'col_name2': 'col_val2'}

>>> col_fam.multi_get(['row_key'], [‘col_name’])
{‘row_key’ : {'col_name': 'col_val'}}

API...
Get, Multi-Get

SELECT * FROM ColumnFamily1;

SELECT col_name FROM ColumnFamily1 WHERE
KEY IN (‘row_key’);

API...
Get Range*
# pycassa - Python

>>> col_fam.get_range(start='row_key')
{
'row_key' : {'col_name': 'col_val'},
'row_key50': {'col_name': 'col_val'},
'row_key2': {'col_name': 'col_val'}
}

API...
Get Range*

SELECT * FROM ColumnFamily1 WHERE KEY >=
‘row_key’;

Column Families?

✓

Write path...
Append to Write
Ahead Log.
(fsync every 10s by default, other options available)

Write path...
Merge Columns
into Memtable.
(Lock free, always in memory.)

Fast for writes?

✓

(Later.)
Asynchronously ﬂush
Memtable to new ﬁles.
(May be 10’s or 100’s of MB in size.)

Data is stored in
immutable SSTables.
(Sorted String table.)

SSTable ﬁles.
*-Data.db
*-Index.db
*-Filter.db
(Also *-Statistics.db and *-Digest.sha1)

SSTables.
SSTable 1 SSTable 2 SSTable 3 SSTable 4 SSTable 5
foo: foo: foo:
dishwasher (ts 10): frink (ts 20): dishwasher (ts 15):
tomato ﬂayven tomacco
purple (ts 10): monkey (ts 10):
cromulent embiggins

Read Path...
Read columns from each
SSTable, then merge results.
(Roughly speaking.)

Read Path...
Use Bloom Filter to
determine if a row key does
not exist in a SSTable.
(In memory)

Bloom Filter says if a key is
deﬁnitely not present, or
present with a certain
probability.
(Default false positive rate is 0.0744%)

Read Path...
Search for prior key in
*-Index.db sample.
(In memory)

Read Path...
Scan *-Index.db from prior
key to ﬁnd the search key and
its’ *-Data.db offset.
(On disk.)

Read Path...
Read *-Data.db from offset, all
columns or speciﬁc pages.
(Default 64KB page size.)

Read purple, monkey, dishwasher.
Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter

Memory Index Sample Index Sample Index Sample Index Sample Index Sample

Disk
SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db

SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db
foo: foo: foo:
cromulent embiggins

Merge SSTables.
Column SSTable 1 SSTable 2 SSTable 4
cromulent
purple
(timestamp 10)

embiggens
monkey
(timestamp 10)

tomato tomacco
dishwasher

Key Cache caches row key
position in *-Data.db ﬁle.
(Removes up to1disk seek per SSTable.)

Read with Key Cache.

Key Cache Key Cache Key Cache Key Cache Key Cache


Disk

foo: foo: foo:
cromulent embiggins

Row Cache caches entire row.
(Removes all disk IO.)

Read with Row Cache.
Row Cache


Key Cache Key Cache Key Cache Key Cache Key Cache


Disk

foo: foo: foo:
cromulent embiggins

Fast for reads?

✓

Tombstones ensure all replicas
see a delete.
(Purged after 10 days, conﬁgurable.)

Merge SSTables with Tombstones.
Column SSTable 1 SSTable 2 SSTable 4
cromulent <tombstone>
purple

embiggens
monkey
(timestamp 10)

tomato tomacco
dishwasher

Merge node response with Tombstones.
Column Node 1 Node 2 Node 3
cromulent cromulent <tombstone>
purple

embiggens embiggens debigulator
monkey

tomato tomato tomacco
dishwasher

Compaction merges truth from
multiple SSTables into one
SSTable with the same truth.
(Manual and continuous background process.)

Compaction.
Column SSTable 1 SSTable 2 SSTable 4 New
cromulent <tombstone> <tombstone>
purple

embiggens embiggens
monkey

tomato tomacco tomacco
dishwasher

Papers.
•Cassandra - A Decentralized Structured Storage System (Lakshman et al).
•Bigtable: A Distributed Storage System for Structured Data (Chang, et al).
•Dynamo: Amazon’s Highly Available Key-value Store (DeCandia, et al).
•Eventually Consistent (Werner Vogels).
•Epidemic algorithms for replicated database maintenance (Demers, et al).
•Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web
services (Gilbert et al).
•Consistent hashing and random trees: distributed caching protocols for relieving
hot spots on the world wide web (Karger, et al).
•The φ Accrual Failure Detector (Hayashibara et al).

Aaron Morton
@aaronmorton
www.thelastpickle.com

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

Cassandra does what ? Code Mania 2012

Recommandé

Recommandé

Contenu connexe

Plus de aaronmorton

Plus de aaronmorton (18)

Dernier

Dernier (20)

Cassandra does what ? Code Mania 2012

Notes de l'éditeur