Beyond the immediate schema changes supported in Scylla Open Source 5.0, learn how the Raft consensus infrastructure will enable radical new capabilities. Discover how it will enable more dynamic topology changes, tablets, immediate consistency, better and faster elasticity, and simplification to repair operations.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
1. The Future of
Consensus in ScyllaDB
5.0 and Beyond
Tomasz Grabiec
Distinguished Software Engineer
2. Tomasz Grabiec
■ Core engineer and maintainer at ScyllaDB for the past 8 years
■ Started coding when Commodore 64 was still a thing
■ Lives in Cracow, Poland
Distinguished Software Engineer
9. What is topology?
Topology is defined as all of the following:
the set of nodes in the cluster,
location of those nodes in DCs and racks,
and assignment of ownership of data to nodes
10. Triggers for topology changes in Scylla
■ Node operations:
• Bootstrapping a new node
• Replacing a node
• Decommissioning a node
• Removing a node
■ Changing replication strategy of a keyspace
12. Token metadata
■ Members, data partitioning and distribution
■ Where does each key live in the cluster?
13. Token partitioning
■ token = hash(partition key)
■ token ring: space of all tokens, set of all partition keys
■ token range: set of partition keys
Token ring:
token
token
range
token
14. ■ Each node has a set of tokens assigned during bootstrap (vnodes)
■ Tokens combined determine primary owning replicas for key ranges
Token metadata
node A node B node C
A
C
B
C
A
B
Token metadata:
16. Token metadata replication
■ Every node has its local view of topology
• token metadata
• replication strategy (schema)
• used by coordinators to route requests
■ Token metadata changes propagate through gossip protocol
• Each node advertises its tokens
• Eventually consistent propagation
18. Eventually (in)consistent topology
■ To ensure data consistency, all coordinators need to agree on topology
■ Eventually consistent propagation -> stale topology
26. Eventually (in)consistent topology
node A node B node C
Token
metadata
node D
A
B
A
B
Bootstrapping
node D
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip
28. Eventually (in)consistent topology
“Cannot” happen:
“Before adding the new node,
check the node’s status in the cluster using nodetool status command.
You cannot add new nodes to the cluster if any of the nodes are down.” [1]
[1] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/add-node-to-cluster/
30. Eventually (in)consistent topology
… or can it?
■ Admins are humans, who are emotional beings
■ May not always do all the checks
■ May cut corners when working under stress
■ Things which seem to be irrelevant each time we start ignoring
31. Eventually (in)consistent topology
The plan:
■ Make the database responsible for consistency under all conditions
Why:
■ Gives a reliable safety net for admins
■ Reduces stress
■ Increases confidence
■ Simplifies procedures
32. Elasticity
Prerequisite for automatic topology changes:
■ Auto-scaling
Changing cluster capacity based on current demand by adding and removing nodes
■ Dynamic data partitioning / auto rebalancing
Optimizing data location to handle changing workloads (e.g. hotspot elimination)
33. Elasticity
Prerequisite for automatic topology changes:
■ The database itself is the admin making decisions
■ Shifts responsibility back to the database
34. Elasticity
Prerequisite for automatic topology changes:
■ Topology changes will be concurrent with other events:
• Node restarts
• Manual topology changes
■ Smaller increments (sub-node granularity)
■ More frequent
■ Need to keep the system consistent
■ Need to be fault-tolerant to ensure liveness (no admin)
■ Need to be fast
35. Moving token metadata to RAFT
system.token_metadata
Strongly consistent fault-tolerant storage
for topology information
36. Moving token metadata to RAFT
system.token_metadata
■ Have a RAFT group which includes all cluster members (raft_group0)
■ Token metadata be the state machine which is replicated by RAFT
■ Changes of token metadata are raft commands
37. Moving token metadata to RAFT
system.token_metadata
■ Topology change can do linearizable reads and writes of token metadata
■ No stale topology
38. ■ RAFT eagerly replicates to every node
■ Like RF=ALL tables with auto-repair
■ Request coordinators still use the local view on topology
• No extra coordination when executing user requests
■ Topology changes use linearizable access for learning and modification
• No need for sleep(30s)
■ Faster topology changes
Replacing gossip with RAFT
40. Serializing topology changes
■ Topology changes cannot be made concurrently
■ Currently, responsibility on the admin
■ We will make database take care of that
41. Serializing topology changes
■ Lock using linearizable CAS on topology lock register
■ Lock acquired before starting any topology change
■ Blocks until acquired
■ Released when topology change completes or aborts
45. Automatic transaction failover
lock
...
...
send data
unlock
■ Automatic topology changes
(e.g. load balancing) cannot
wait for admin intervention
■ We need fault-tolerant
orchestration
■ We need transactions to
resume or abort
automatically
■ Whatever the decision, we
must have a new orchestrator
for the transaction
46. Automatic transaction failover
Keep saga state in a fault-tolerant
linearizable storage
Orchestrator runs where the RAFT
leader of raft_group0 runs
53. Bring RAFT to user tables!
CREATE TABLE foo WITH raft = TRUE;
Strongly-consistent tables
Do not have any databases
before Scylla.
Always run nodetool cleanup
after bootstrapping a new
node.
Run repair within gc-grace
seconds.
Do not bootstrap nodes
concurrently, or make
any other topology change
Do not use SimpleStrategy in
a multi DC setup
Do not have any databases
before Scylla.
Always run nodetool cleanup
after bootstrapping a new
node.
Run repair within gc-grace
seconds.
Do not bootstrap nodes
concurrently, or make
any other topology change
Do not use SimpleStrategy in
a multi DC setup
Do not have any databases
before Scylla.
Always run nodetool cleanup
after bootstrapping a new
node.
Run repair within gc-grace
seconds.
Do not bootstrap nodes
concurrently, or make
any other topology change
Do not use SimpleStrategy in
a multi DC setup
Do not have any databases
before Scylla.
Always run nodetool cleanup
after bootstrapping a new
node.
Run repair within gc-grace
seconds.
Do not bootstrap nodes
concurrently, or make
any other topology change
Do not use SimpleStrategy in
a multi DC setup
54. Strongly-consistent tables
LWT (paxos based)
■ Slow
■ 3 rounds to replicas per user
request
■ Concurrent conflicting requests
-> retries -> negative scaling
RAFT
■ Fast
■ 1 round to replicas (on leader)
■ or less, due to request batching
■ Pipelining all the way down to
each CPU -> high throughput
■ No retries
55. RAFT
■ Adds latency on leader failure
(1s + election time)
■ Add 1 hop when not on the leader
(make drivers leader-aware)
■ Need many RAFT groups to
distribute load among shards
(tablets to the rescue)
Strongly-consistent tables
LWT (paxos based)
■ No latency on leader failure
■ Load easy to distribute
59. RAFT tables key1
key2
Good load distribution requires
lots of RAFT groups.
We can use more tokens.
Too many -> explosion of
metadata and management
overhead
Too few -> load imbalance