Cassandra for the ops dos and donts

Cassandra for the ops,
the do’s and the don’ts
DuyHai DOAN, Technical Advocate
@doanduyhai

Shameless self-promotion!
@doanduyhai
2
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
• production troubleshooting

Datastax!
@doanduyhai
3
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features

Agenda!
@doanduyhai
4
Cassandra Architecture
• cluster
• replication
• Consistency
Data model 101
• last write win
• read path
• write path

Agenda!
@doanduyhai
5
Advanced Architecture
• failure handling
• multi data-centers
Hardware
• CPU
• memory
• storage

Agenda!
@doanduyhai
6
Configuration
• system
• Cassandra

Recommendation!
@doanduyhai
7

Cassandra history!
@doanduyhai
8
NoSQL database
• created at Facebook
• open-sourced since 2008
• current version = 2.1
• column-oriented ☞ distributed table

Cassandra 5 key facts!
@doanduyhai
9
Linear scalability
Small & « huge » scale
• 2 à1k+ nodes cluster
• 3Gb à Pb+

@doanduyhai
10
Continuous availability (≈100% up-time)
• resilient architecture (Dynamo)

Rolling Upgrades!
@doanduyhai
11
n1
n2
n3
n4
n5
n6
n7
n8
Live production

Rolling Upgrades!
@doanduyhai
12
n1
n2
n3
n4
n5
n6
n7
n8
Live production

Rolling Upgrades!
@doanduyhai
13
n1
n2
n3
n4
n5
n6
n7
n8
Live production

@doanduyhai
14
Multi-data centers
• out-of-the-box (config only)
• AWS conf for multi-region DCs
• GCE/CloudStack support

@doanduyhai
15
Operational simplicity
• 1 node = 1 process + 1 config file
• deployment automation

@doanduyhai
16
OpsCenter

@doanduyhai
17
Analytics combo
• Cassandra + Spark = awesome !
• realtime streaming

Data distribution!
@doanduyhai
19
Random: hash of #partition → token = hash(#p)
Hash: ]-X, X]
X = huge number (264/2)
n1
n2
n3
n4
n5
n6
n7
n8

Token Ranges!
@doanduyhai
20
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H

Data Distribution!
@doanduyhai
21
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
#P1
#P2
#P3
#P4
#P5

Data Distribution!
@doanduyhai
22
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
#P1
#P2
#P3
#P4
#P5

Partitioners!
@doanduyhai
23
Murmur3Partitioner (default)
RandomPartitioner (legacy MD5 hash)
OrderPreservingPartitioner
BytesOrderPartitioner

Partitioners!
@doanduyhai
24
OrderPreservingPartitioner
• #partition = lastname
A-C
D-F
G-I
J-L
M-O
P-R
S-U
V-Z

Scaling Out!
@doanduyhai
25
n1
n2
8 nodes 10 nodes
n3
n4
n5
n6
n7
n8
n1
n2
n3 n4
n5
n6
n7
n9 n8
n10

Data Streaming!
@doanduyhai
26
n1
n2
n3 n4
n5
n6
n7
n9 n8
n10
to n9 ... and n10

Data Streaming!
@doanduyhai
27
Production is priority n°1
Play with nodetool setstreamthroughput
Add multiple nodes at the same time
Might use nodetool disableautocompaction <ks> <table>

Failure tolerance!
@doanduyhai
28
Replication Factor (RF) = 3
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
{B, A, H}
{C, B, A}
{D, C, B}
A
B
C
D
E
F
G
H

Coordinator node!
Incoming requests (read/write)
Coordinator node handles the request
Every node can be coordinator àmasterless
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
request
29

Token Aware Load Balancing!
@doanduyhai
30
n2
n3
n4
n5
n6
n7
n8
1
n1 Client
2
3
⤫
Embed the hash function
Know about the topology

DC Aware Load Balancing!
@doanduyhai
31
Client1 DC1
⤫
Client2 DC2

DC Aware Load Balancing!
@doanduyhai
32
⤫
Client1 DC1
Client2 DC2

Load balancing!
@doanduyhai
33
Custom load-balancers
• useless
• cost-money
• deactivate optimizations

Consistency!
@doanduyhai
34
Tunable at runtime
• ONE
• QUORUM (strict majority w.r.t. RF)
• ALL
Apply both to read & write

Write consistency!
Write ONE
• write request to all replicas in //
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
35

Write consistency!
Write ONE
• wait for ONE ack before returning to
client
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
36

Write consistency!
Write ONE
• wait for ONE ack before returning to
client
• other acks later, asynchronously
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs
37

Write consistency!
Write QUORUM
• wait for QUORUM acks before
returning to client
• other acks later, asynchronously
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs
38

Read consistency!
Read ONE
• read from one node among all replicas
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
39

Read consistency!
Read ONE
• read from one node among all replicas
• contact the fastest node (stats)
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
40

Read consistency!
Read QUORUM
• read from one fastest node
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
41

Read consistency!
Read QUORUM
• AND request digest from other
replicas to reach QUORUM
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
42

Read consistency!
Read QUORUM
• return most up-to-date data to client
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
43

Read consistency!
Read QUORUM
• repair if digest mismatch n1
@doanduyhai
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
44

Consistency trade-off!
@doanduyhai
45

Consistency level!
@doanduyhai
46
ONE
Fast, may not read latest written value

Consistency level!
@doanduyhai
47
QUORUM
Strict majority w.r.t. Replication Factor
Good balance

Consistency level!
@doanduyhai
48
ALL
Paranoid
Slow, no high availability

Consistency summary!
ONERead + ONEWrite
☞ available for read/write even (N-1) replicas down
@doanduyhai 49

QUORUMRead + QUORUMWrite
☞ available for read/write even 1+ replica down
@doanduyhai 50

ONERead + ALLWrite
☞ no high availability
@doanduyhai 51

Last Write Win (LWW)!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
@doanduyhai
53
jdoe
age
name
33 John DOE
#partition

@doanduyhai
jdoe
age (t1) name (t1)
33 John DOE
54
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
auto-generated timestamp
.

@doanduyhai
55
UPDATE users SET age = 34 WHERE login = jdoe;
jdoe
SSTable1 SSTable2
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34

@doanduyhai
56
DELETE age FROM users WHERE login = jdoe;
tombstone
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34

@doanduyhai
57
SELECT age FROM users WHERE login = jdoe;
? ? ?
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34

@doanduyhai
58
SELECT age FROM users WHERE login = jdoe;
✕ ✕ ✓
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34

@doanduyhai
59
Timestamp heavily used
NTP mandatory

Compaction!
@doanduyhai
60
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
New SSTable
jdoe
age (t3) name (t1)
ý John DOE

Compaction strategies!
@doanduyhai
61
SizeTiered
• withstand heavy write load
• group SSTables of similar size (no more than ≈50% difference)
• I/O friendly
Leveled
• compacts more frequently
• for heavy update/delete scenario
• I/O un-friendly (SSD recommended)

Compaction strategies!
@doanduyhai
62
SizeTiered
• scale out if disk usage ≈ 60% - 70%
Leveled
• risky if short on disk I/O

Cassandra Write Path!
@doanduyhai
63
Commit log1
. . .
1
Commit log2
Commit logn
Memory

@doanduyhai
64
Memory
MemTable
Table1
Commit log1
. . .
1
Commit log2
Commit logn
MemTable
Table2
MemTable
TableN
2
. . .

@doanduyhai
65
Commit log1
Commit log2
Commit logn
Table1
Table2 Table3
SStable2 SStable3 3
SStable1
Memory
. . .

@doanduyhai
66
MemTable . . . Memory
Table1
Commit log1
Commit log2
Commit logn
Table1
SStable1
Table2 Table3
SStable2 SStable3
MemTable
Table2
MemTable
TableN
. . .

@doanduyhai
67
Commit log1
Commit log2
SStable3 . . .
Commit logn
Table1
SStable1
Memory
Table2 Table3
SStable2 SStable3
SStable1
SStable2

Commit logs on dedicated disk partition
• if possible on SSD
@doanduyhai
68

Cassandra Read Path!
@doanduyhai
69
1 Row Cache
Off Heap
JVM Heap
SELECT .. FROM … WHERE #partition =…;
SStable1 SStable2 SStable3

Row cache!
@doanduyhai
70
Before C* 2.1
• load entire partition into memory
• fat partitions (2Gb)
From C* 2.1
• put first rows of partition in the cache
• ☞ rows_per_partition

@doanduyhai
Bloom Filter
Bloom Filter
71
1 Row Cache
Off Heap
JVM Heap
2 Bloom Filter
? ? ?

Bloom filters in action!
Write #partition = bar
h1 h2
@doanduyhai
72
#partition = foo
h3
1 0 0 1* 0 0 1 0 1 1
Read #partition = qux

Bloom Filter!
@doanduyhai
73
Tunable per table using bloom_filter_fp_chance
• write offen, rarely read ☞ reduce to save memory
1-2Gb per 109 partitions
• tiny partitions ☞ a lot of memory overhead, reduce
bloom_filter_fp_chance if you have good compaction rate

@doanduyhai
Partition Key Cache
74
1 Row Cache
Off Heap
JVM Heap
3
2 Bloom Filter
Bloom Filter
Bloom Filter
× ✓ ×

Partition Key Cache!
Partition Key Cache SStable
@doanduyhai
75
#partition001
data
data
….
#partition002
#partition350
data
data
…
#Partition Offset …
#partition001 0x0 …

@doanduyhai
76
1 Row Cache
Off Heap
JVM Heap
4 Partition Summary
3 Partition Key Cache
SStable2
2 Bloom Filter
Bloom Filter
Bloom Filter

Partition Summary!
Partition Summary
@doanduyhai
77
SStable
#partition001
data
data
….
#partition128
#partition350
data
data
…
Sample Offset
#partition001 0x0
#partition128 0x4500

Partition Summary!
@doanduyhai
78
Tunable per table using index_interval
• write offen, rarely read ☞ increase to save memory
• 128 – 512 good range of value

@doanduyhai
79
1 Row Cache
Off Heap
JVM Heap
4 Key Index Sample
5 Compression Offset
SStable2
2 Bloom Filter
Bloom Filter
Bloom Filter

Compression Offset!
@doanduyhai
80
Compression Offset
Normal offset Compressed offset
0x0 0x0
0x4500 0x100
0x851513 0x1353
0x5464321 0x21245

Data compression!
@doanduyhai
81
Tunable per table using compression
• use LZ4Compressor
• enabled by default
• ×2 – ×5 gain on disk space
Do not deactivate compression
• unless you’re very short on CPU
• CPU vs memory is always a good trade

@doanduyhai
82
1 Row Cache
Off Heap
JVM Heap
4 Key Index Sample
6 SStable2
7 MemTable
Σ
2 Bloom Filter
Bloom Filter
Bloom Filter

@doanduyhai
83
1 Row Cache
Off Heap
JVM Heap
4 Key Index Sample
6 SStable2
7 MemTable
Σ
2 Bloom Filter
Bloom Filter
Bloom Filter
Page cache

OS Page Cache!
@doanduyhai
84
More RAM ☞ more page cache
Max 8Gb of RAM for JVM
☞ the rest for page cache

Failure Handling!
Failure occurs when
• node hardware failure (disk, …)
• network issues
• heavy load, node flapping
• JVM long GCs
@doanduyhai
86

Hinted Handoff!
When 1 replica down
• coordinator stores Hints
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
Hint
❌
87

Hinted Handoff!
When replica up
• coordinator forward stored Hints
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
Hints
88

Hinted Handoff!
But in reality … hints storm
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
Hints
89

Hinted Handoff!
But in reality … hints storm
☞ hinted_handoff_throttle_delay_in_ms
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
Hints
90

Hinted Handoff!
What if node is dead for too long ?
☞ max_hint_window_in_ms
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
❌
❌ ❌
91

Hinted Handoff!
Node online after a long period
☞ full node repair from replicas
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
92

Consistent read!
Read at QUORUM or more
• repair if digest mismatch n1
@doanduyhai
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
93

Read Repair!
Read at CL < QUORUM
• read from one least-loaded node
• every x reads, request digests from
other replicas
• compare digest & repair
asynchronously
• read_repair_chance (10%)
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
94

Manual repair!
Should be part of cluster operations
• scheduled
• I/O intensive
Use incremental repair of 2.1
@doanduyhai
95

Replication Factor!
1
@doanduyhai
97
n2
n3
n4
n5
n6
n7
n8
n1
2
3
New York (DC1)
RF = 3
n2
n3
n n4 5
n1
1
2
London (DC2)
RF = 2
Async
replication

Client consistency level!
@doanduyhai
98
Write ONE
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3
n n4 5
n1
DC1 coordinator
DC2 coordinator
10 μs
67 ms

@doanduyhai
99
Write LOCAL_ONE
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3
n n4 5
n1
DC1 coordinator
DC2 coordinator
10 μs
67 ms

@doanduyhai
100
DC-aware consistency levels
• LOCAL_ONE
• LOCAL_QUORUM
Apply both to read & write

Read Repair!
@doanduyhai
101
To avoid cross DC read repair
☞ read_repair_chance = 0
☞ dclocal_read_repair_chance = …

Muti-DC usages!
@doanduyhai
102
New York (DC1)
London (DC2)
Data-locality, disaster recovery
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3
n n4 5
n1

Muti-DC usages!
@doanduyhai
103
Workload segregation/virtual DC
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3
n n4 5
n1
Production
(Live)
Analytics
(Spark/Hadoop)
Same DC

Muti-DC usages!
@doanduyhai
104
Prod data copy for testing/benchmarking
n2
n3
n4
n5
n6
n7
n8
n1
n2
n1 n3
Use
LOCAL
consistency
My tiny test
cluster
Data copy
❌
Never read
back

Storage!
Use SSD whenever possible
• better I/O
• seek time in 1/10 ms
@doanduyhai
seek time hard drive type
12ms
7200RPM
7ms
10k
5ms
15k
.04ms
SSD
106

Storage!
Do NOT use shared storage/SAN
• drop all Cassandra optimizations for disks
• bad disk latency & seek time
@doanduyhai
107

Storage!
JBOD vs RAID
• RAID0 for performance
• rely on Cassandra replication for resilience if RF ≥ 3
• RAID5/RAID1 overkill
@doanduyhai
108

Storage!
Disk space
• SizeTiered compaction requires ×2 disk space temporarily
• for capacity planning also take into account
• snapshots
• secondary indices
@doanduyhai
109

Memory!
Recommended amount
• 32Gb to 128Gb
• ideally 32Gb – 64Gb
@doanduyhai
110

CPU !
Recommendations
• Intel Xeon or AMD (64-bit)
• 4 to 8 cores. 8 ideally
• intense write workload is CPU-bound
• data compression costs CPU
• encryption (disk or protocol) costs CPU
@doanduyhai
111

On the cloud!
AWS
• ephemeral disks (attached devices) recommended
• à100Gb/node, m1.xlarge, 15Gb RAM
• à1Tb/node, h1.4xlarge, 60Gb RAM, SSD
@doanduyhai
112

On the cloud!
GCE
• persistent storage
• à200Gb/node, n1-standard-8, 30Gb RAM
• à1Tb/node, n1-highmem-16, 104Gb RAM, SSD
@doanduyhai
113

On the cloud!
Internode latency
• may be higher
• set phi_convict_threshold = 10 …12
@doanduyhai
114

Using VMs for production!
Virtual Box, VMWare, …., Docker images
• fine for testing
• production
• ensure proper resource segregation
@doanduyhai
115

System!
JVM
• understand the memory model
• long GC pause ☞ often consequence, not cause
• new heap size (-Xmn)
Master Linux system commands (iostat, vmstat, …)
File handle limits
Swapping
@doanduyhai
117

Cassandra!
Configuration
• do not remove comments in the cassandra.yaml
• beware of multi-threaded compaction
• cluster name ?
@doanduyhai
118

Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

Cassandra for the ops dos and donts

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Cassandra for the ops dos and donts

Similaire à Cassandra for the ops dos and donts (20)

Plus de Duyhai Doan

Plus de Duyhai Doan (17)

Dernier

Dernier (20)

Cassandra for the ops dos and donts