2. 1-minute recap on Cassandra distribution
• Nodes are clustered in a “ring”
• Each node has a token in [0,2127-1]:
• 0
• 42535295865117307932921825928971026432
• 85070591730234615865843651857942052864
• 127605887595351923798765477786913079296
• Keys are hashed using MD5 (now Murmur3)
• Each node owns a share of the key-space
2
Thursday, 24 October 13
3. Cassandra distribution limitations
• Operational complexity
• Rebuild cost for capacity bound clusters
• Impact on maintenance operations
• Impact on topology changes
• No native support for heterogeneous hardware
3
Thursday, 24 October 13
4. Adding a node to an existing cluster
4
Thursday, 24 October 13
8. Add/remove node
• Need to rebalance ranges between nodes
• Move more data than is optimal
• (optimal would be 1/N)
• Impacts at most RF nodes
• (prefer to spread load across cluster)
• Manual, tedious, error-prone, painful...
8
Thursday, 24 October 13
9. Removing a node
• nodetool removetoken (removenode from 1.2)
• Dead host's token removed from ring
• Next host in ring assumes range
• Replica count restored
• Involves at most 2 * RF - 1 nodes
• If we can make it faster, we can store more data!
9
Thursday, 24 October 13
11. Virtual nodes in Cassandra 1.2+
• More than one token per node
• Random token assignment
• Incremental cluster resize, one node at a time
• Streaming to/from all nodes, not just neighbors
• Only random partitioners are supported
• Multi-DC support still works in the same way
11
Thursday, 24 October 13
12. Different virtual nodes strategies
Number partitions
Partition Size
Random
(Cassandra 1.2+)
O(N)
O(B/N)
Fixed
(Riak)
O(1)
O(B)
Auto-sharding
(MongoDb)
O(B)
O(1)
N = number of nodes
B = size of dataset
(read more at http://bit.ly/virtualnodes)
Thursday, 24 October 13
12
13. Virtual Nodes!
New in 1.2
Enabled by
default in 2.0
→ set num_tokens:
256
in cassandra.yaml
13
Thursday, 24 October 13
14. Adding nodes to a cluster
• From a single node...
• Multiple tokens
• Ranges of different
sizes
14
Thursday, 24 October 13
15. Adding nodes to a cluster
• We add a second node
• “Steals” ranges from
the existing node
15
Thursday, 24 October 13
16. Adding nodes to a cluster
• And a third one...
• “Steals” ranges from
the existing nodes
• Distribution is close
to 1/3 each
16
Thursday, 24 October 13
19. Bootstrap
• Assign a new host T random tokens (T=256)
• New tokens split ranges from existing nodes
• Each existing node contributes to the bootstrap
• Optimal data movement
• No need to rebalance, or double cluster size
• No need to calculate tokens
19
Thursday, 24 October 13
20. Removing nodes from a cluster
Removing the node
with blue ranges:
20
Thursday, 24 October 13
22. Removing a node
• Nodetool removetoken removenode
• nodetool removenode <host_id>
• Dead host's tokens removed from ring
• Ranges recalculated & data moved
• All nodes participate!
22
Thursday, 24 October 13
23. nodetool ring becomes useless...
$ nodetool ring
Datacenter: datacenter1
==========
Address
Rack
Status State
Load
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
192.168.100.2
48.18
48.18
48.18
48.18
48.18
48.18
48.18
48.18
48.18
48.18
48.18
48.18
48.18
48.18
rack1
rack1
rack1
rack1
rack1
rack1
rack1
rack1
rack1
rack1
rack1
rack1
rack1
rack1
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Owns
KB
KB
KB
KB
KB
KB
KB
KB
KB
KB
KB
KB
KB
KB
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
19.46%
Token
908086307850
-92138833317
-91449505236
-89961709812
-89833237466
-89829145910
-88349645925
-87940053784
-87315744643
-86833403935
-86172092729
-85207698040
-85134888150
-85110178049
-84965473082
23
Thursday, 24 October 13
24. nodetool status
$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address
Load
Tokens Owns
UN 192.168.100.2 48.18 KB
256
19.5%
UN 192.168.100.3 48.21 KB
256
19.9%
UN 192.168.100.1 46.19 KB
256
21.3%
UN 192.168.100.5 48.13 KB
256
18.9%
UN 192.168.100.4 48.15 KB
256
20.5%
Host ID
bb84e34e-b929-41c2-a5
50e8c1b1-a28f-431a-85
67bdd989-1b34-4bbd-a5
1a88e040-84fd-4461-80
3be40484-8225-467c-82
24
Thursday, 24 October 13
26. Modeled with simulated token assignment
Virtual Nodes: Operational Aspirin
Frequency
mean range size
Range size (arbitrary units)
Thursday, 24 October 13
26
27. How does this lead to balanced load?!
• Each host has the same distribution of
range sizes
• So will assume roughly equal portions of
the key-space
• Modelled with simulated data inserted
into ranges...
27
Thursday, 24 October 13
28. Normalised data load
Virtual Nodes: Operational Aspirin
Virtual node (location in key-space)
Thursday, 24 October 13
28
29. Frequency
How balanced is balanced?
Virtual Nodes: Operational Aspirin
Normalised load (arbitrary units)
Thursday, 24 October 13
29
30. A balanced cluster
• Keys are randomly distributed
• V-node partition will assume the load
proportional to its size
• Load tends towards balance with increase in
number of nodes
• 2 nodes: 48.4% and 51.6%
• 3 nodes: 34.3%, 33.0%, 32.7%
• 4 nodes: 24.3%, 25.2%, 24.9%, 25.6%
30
Thursday, 24 October 13
31. Performance testing
• 17 node EC2 m1.large
• Inserted 460 million keys
• at RF=3
• Timed removenode and then bootstrap
• Results at http://bit.ly/vnodesperf
31
Thursday, 24 October 13
33. Migration path for a non-vnode cluster
• Several techniques to migrate to vnodes
• The “simplest” is to rebuild your cluster
• With downtime: restore from backup
• Without downtime: twice the hardware
• “shuffle” is the proposed alternative
• Migrate all nodes to vnodes, shuffle ranges
• Very few success stories
33
Thursday, 24 October 13
34. Conclusion
• You should already be using virtual nodes!
• Token management is a thing of the past
• Embrace the randomness
• Scale up and down without pain
34
Thursday, 24 October 13
36. We’re hiring!
• Acunu suggested and developed virtual nodes
• Patches by @samoverton and @jericevans
• Eric Evans also contributed much of CQL
• We are looking for developers to work on
Apache Cassandra, contributing features and
enhancements to the Open-Source project
36
Thursday, 24 October 13