From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Cassandra for the ops dos and donts
1. Cassandra for the ops,
the do’s and the don’ts
DuyHai DOAN, Technical Advocate
@doanduyhai
2. Shameless self-promotion!
@doanduyhai
2
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
• production troubleshooting
3. Datastax!
@doanduyhai
3
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
4. Agenda!
@doanduyhai
4
Cassandra Architecture
• cluster
• replication
• Consistency
Data model 101
• last write win
• read path
• write path
5. Agenda!
@doanduyhai
5
Advanced Architecture
• failure handling
• multi data-centers
Hardware
• CPU
• memory
• storage
8. Cassandra history!
@doanduyhai
8
NoSQL database
• created at Facebook
• open-sourced since 2008
• current version = 2.1
• column-oriented ☞ distributed table
27. Data Streaming!
@doanduyhai
27
Production is priority n°1
Play with nodetool setstreamthroughput
Add multiple nodes at the same time
Might use nodetool disableautocompaction <ks> <table>
28. Failure tolerance!
@doanduyhai
28
Replication Factor (RF) = 3
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
{B, A, H}
{C, B, A}
{D, C, B}
A
B
C
D
E
F
G
H
29. Coordinator node!
Incoming requests (read/write)
Coordinator node handles the request
Every node can be coordinator àmasterless
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
request
29
30. Token Aware Load Balancing!
@doanduyhai
30
n2
n3
n4
n5
n6
n7
n8
1
n1 Client
2
3
⤫
Embed the hash function
Know about the topology
34. Consistency!
@doanduyhai
34
Tunable at runtime
• ONE
• QUORUM (strict majority w.r.t. RF)
• ALL
Apply both to read & write
35. Write consistency!
Write ONE
• write request to all replicas in //
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
35
36. Write consistency!
Write ONE
• write request to all replicas in //
• wait for ONE ack before returning to
client
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
36
37. Write consistency!
Write ONE
• write request to all replicas in //
• wait for ONE ack before returning to
client
• other acks later, asynchronously
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs
37
38. Write consistency!
Write QUORUM
• write request to all replicas in //
• wait for QUORUM acks before
returning to client
• other acks later, asynchronously
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs
38
39. Read consistency!
Read ONE
• read from one node among all replicas
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
39
40. Read consistency!
Read ONE
• read from one node among all replicas
• contact the fastest node (stats)
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
40
41. Read consistency!
Read QUORUM
• read from one fastest node
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
41
42. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
replicas to reach QUORUM
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
42
43. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
replicas to reach QUORUM
• return most up-to-date data to client
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
43
44. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
replicas to reach QUORUM
• return most up-to-date data to client
• repair if digest mismatch n1
@doanduyhai
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
44
53. Last Write Win (LWW)!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
@doanduyhai
53
jdoe
age
name
33 John DOE
#partition
54. Last Write Win (LWW)!
@doanduyhai
jdoe
age (t1) name (t1)
33 John DOE
54
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
auto-generated timestamp
.
55. Last Write Win (LWW)!
@doanduyhai
55
UPDATE users SET age = 34 WHERE login = jdoe;
jdoe
SSTable1 SSTable2
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
56. Last Write Win (LWW)!
@doanduyhai
56
DELETE age FROM users WHERE login = jdoe;
tombstone
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
57. Last Write Win (LWW)!
@doanduyhai
57
SELECT age FROM users WHERE login = jdoe;
? ? ?
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
58. Last Write Win (LWW)!
@doanduyhai
58
SELECT age FROM users WHERE login = jdoe;
✕ ✕ ✓
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
59. Last Write Win (LWW)!
@doanduyhai
59
Timestamp heavily used
NTP mandatory
60. Compaction!
@doanduyhai
60
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
New SSTable
jdoe
age (t3) name (t1)
ý John DOE
61. Compaction strategies!
@doanduyhai
61
SizeTiered
• withstand heavy write load
• group SSTables of similar size (no more than ≈50% difference)
• I/O friendly
Leveled
• compacts more frequently
• for heavy update/delete scenario
• I/O un-friendly (SSD recommended)
68. Cassandra Write Path!
Commit logs on dedicated disk partition
• if possible on SSD
@doanduyhai
68
69. Cassandra Read Path!
@doanduyhai
69
1 Row Cache
Off Heap
JVM Heap
SELECT .. FROM … WHERE #partition =…;
SStable1 SStable2 SStable3
70. Row cache!
@doanduyhai
70
Before C* 2.1
• load entire partition into memory
• fat partitions (2Gb)
From C* 2.1
• put first rows of partition in the cache
• ☞ rows_per_partition
73. Bloom Filter!
@doanduyhai
73
Tunable per table using bloom_filter_fp_chance
• write offen, rarely read ☞ reduce to save memory
1-2Gb per 109 partitions
• tiny partitions ☞ a lot of memory overhead, reduce
bloom_filter_fp_chance if you have good compaction rate
77. Partition Summary!
Partition Summary
@doanduyhai
77
SStable
#partition001
data
data
….
#partition128
#partition350
data
data
…
Sample Offset
#partition001 0x0
#partition128 0x4500
#partition256 0x851513
#partition512 0x5464321
78. Partition Summary!
@doanduyhai
78
Tunable per table using index_interval
• write offen, rarely read ☞ increase to save memory
• 128 – 512 good range of value
79. Cassandra Read Path!
@doanduyhai
79
1 Row Cache
Off Heap
JVM Heap
SELECT .. FROM … WHERE #partition =…;
4 Key Index Sample
3 Partition Key Cache
5 Compression Offset
SStable2
2 Bloom Filter
Bloom Filter
Bloom Filter
81. Data compression!
@doanduyhai
81
Tunable per table using compression
• use LZ4Compressor
• enabled by default
• ×2 – ×5 gain on disk space
Do not deactivate compression
• unless you’re very short on CPU
• CPU vs memory is always a good trade
91. Hinted Handoff!
What if node is dead for too long ?
☞ max_hint_window_in_ms
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
❌
❌ ❌
91
92. Hinted Handoff!
Node online after a long period
☞ full node repair from replicas
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
92
93. Consistent read!
Read at QUORUM or more
• read from one fastest node
• AND request digest from other
replicas to reach QUORUM
• return most up-to-date data to client
• repair if digest mismatch n1
@doanduyhai
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
93
94. Read Repair!
Read at CL < QUORUM
• read from one least-loaded node
• every x reads, request digests from
other replicas
• compare digest & repair
asynchronously
• read_repair_chance (10%)
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
94
95. Manual repair!
Should be part of cluster operations
• scheduled
• I/O intensive
Use incremental repair of 2.1
@doanduyhai
95
100. Client consistency level!
@doanduyhai
100
DC-aware consistency levels
• LOCAL_ONE
• LOCAL_QUORUM
Apply both to read & write
101. Read Repair!
@doanduyhai
101
To avoid cross DC read repair
☞ read_repair_chance = 0
☞ dclocal_read_repair_chance = …
102. Muti-DC usages!
@doanduyhai
102
New York (DC1)
London (DC2)
Data-locality, disaster recovery
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3
n n4 5
n1
103. Muti-DC usages!
@doanduyhai
103
Workload segregation/virtual DC
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3
n n4 5
n1
Production
(Live)
Analytics
(Spark/Hadoop)
Same DC
104. Muti-DC usages!
@doanduyhai
104
Prod data copy for testing/benchmarking
n2
n3
n4
n5
n6
n7
n8
n1
n2
n1 n3
Use
LOCAL
consistency
My tiny test
cluster
Data copy
❌
Never read
back
106. Storage!
Use SSD whenever possible
• better I/O
• seek time in 1/10 ms
@doanduyhai
seek time hard drive type
12ms
7200RPM
7ms
10k
5ms
15k
.04ms
SSD
106
107. Storage!
Do NOT use shared storage/SAN
• drop all Cassandra optimizations for disks
• bad disk latency & seek time
@doanduyhai
107
108. Storage!
JBOD vs RAID
• RAID0 for performance
• rely on Cassandra replication for resilience if RF ≥ 3
• RAID5/RAID1 overkill
@doanduyhai
108
109. Storage!
Disk space
• SizeTiered compaction requires ×2 disk space temporarily
• for capacity planning also take into account
• snapshots
• secondary indices
@doanduyhai
109
111. CPU !
Recommendations
• Intel Xeon or AMD (64-bit)
• 4 to 8 cores. 8 ideally
• intense write workload is CPU-bound
• data compression costs CPU
• encryption (disk or protocol) costs CPU
@doanduyhai
111
114. On the cloud!
Internode latency
• may be higher
• set phi_convict_threshold = 10 …12
@doanduyhai
114
115. Using VMs for production!
Virtual Box, VMWare, …., Docker images
• fine for testing
• production
• ensure proper resource segregation
@doanduyhai
115
117. System!
JVM
• understand the memory model
• long GC pause ☞ often consequence, not cause
• new heap size (-Xmn)
Master Linux system commands (iostat, vmstat, …)
File handle limits
Swapping
@doanduyhai
117
118. Cassandra!
Configuration
• do not remove comments in the cassandra.yaml
• beware of multi-threaded compaction
• cluster name ?
@doanduyhai
118