Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond

The Future of
Consensus in ScyllaDB
5.0 and Beyond
Tomasz Grabiec
Distinguished Software Engineer

Tomasz Grabiec
■ Core engineer and maintainer at ScyllaDB for the past 8 years
■ Started coding when Commodore 64 was still a thing
■ Lives in Cracow, Poland
Distinguished Software Engineer

RAFT
X = 0 X += 1 CAS(X, 0, 1)
X = 0
X = 0
Protocol for state machine replication
Total order broadcast of state change commands
Single leader

RAFT
Fault-tolerant
Quorum needs to be alive to make progress
Automatic leader failover
X = 0

RAFT
Can be used to build linearizable fault-tolerant storage
0
R(0)
2
R(2)
W(2)

What is topology?
Topology is deﬁned as all of the following:
the set of nodes in the cluster,
location of those nodes in DCs and racks,
and assignment of ownership of data to nodes

Triggers for topology changes in Scylla
■ Node operations:
• Bootstrapping a new node
• Replacing a node
• Decommissioning a node
• Removing a node
■ Changing replication strategy of a keyspace

Token metadata
■ Members, data partitioning and distribution
■ Where does each key live in the cluster?

Token partitioning
■ token = hash(partition key)
■ token ring: space of all tokens, set of all partition keys
■ token range: set of partition keys
Token ring:
token
token
range
token

■ Each node has a set of tokens assigned during bootstrap (vnodes)
■ Tokens combined determine primary owning replicas for key ranges
Token metadata
node A node B node C
A
C
B
C
A
B
Token metadata:

Token metadata
A
C
B
C
A
B
{A, C}
{C, B}
{B, A}
{C, A}
{A, B}
{B, C}
token
metadata
replication
metadata
replication
strategy

Token metadata replication
■ Every node has its local view of topology
• token metadata
• replication strategy (schema)
• used by coordinators to route requests
■ Token metadata changes propagate through gossip protocol
• Each node advertises its tokens
• Eventually consistent propagation

Eventually (in)consistent topology
■ To ensure data consistency, all coordinators need to agree on topology
■ Eventually consistent propagation -> stale topology

A
C
B
C
A
B
Token
Metadata
A
C
B
C
A
B
A
C
B
C
A
B

Cluster
down!

Cluster up
except node C

Token
metadata
(in gossip)
A
B
A
B
Cluster up
except node C
A
B
A
B

Token
metadata
Cluster up
except node C
A
C
B
C
A
B
A
B
A
B
local view in gossip

Token
metadata
(in gossip)
A
B
A
B
node D
A
B
A
B
Bootstrapping
node D
A
B
A
B

Token
metadata
node D
A
B
A
B
Bootstrapping
node D
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip

Token
metadata
A
B
A
B
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip
■ Different token metadata -> different replica sets
■ Different nodes use different quorums -> inconsistent reads
■ Writes go to the wrong replica set temporarily
■ etc.

“Cannot” happen:
“Before adding the new node,
check the node’s status in the cluster using nodetool status command.
You cannot add new nodes to the cluster if any of the nodes are down.” [1]
[1] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/add-node-to-cluster/

… or can it?
1. nodetool status check
2. cluster reboots with some nodes down
3. bootstrap starts

… or can it?
■ Admins are humans, who are emotional beings
■ May not always do all the checks
■ May cut corners when working under stress
■ Things which seem to be irrelevant each time we start ignoring

The plan:
■ Make the database responsible for consistency under all conditions
Why:
■ Gives a reliable safety net for admins
■ Reduces stress
■ Increases conﬁdence
■ Simpliﬁes procedures

Elasticity
Prerequisite for automatic topology changes:
■ Auto-scaling
Changing cluster capacity based on current demand by adding and removing nodes
■ Dynamic data partitioning / auto rebalancing
Optimizing data location to handle changing workloads (e.g. hotspot elimination)

Elasticity
■ The database itself is the admin making decisions
■ Shifts responsibility back to the database

Elasticity
■ Topology changes will be concurrent with other events:
• Node restarts
• Manual topology changes
■ Smaller increments (sub-node granularity)
■ More frequent
■ Need to keep the system consistent
■ Need to be fault-tolerant to ensure liveness (no admin)
■ Need to be fast

Moving token metadata to RAFT
system.token_metadata
Strongly consistent fault-tolerant storage
for topology information

■ Have a RAFT group which includes all cluster members (raft_group0)
■ Token metadata be the state machine which is replicated by RAFT
■ Changes of token metadata are raft commands

■ Topology change can do linearizable reads and writes of token metadata
■ No stale topology

■ RAFT eagerly replicates to every node
■ Like RF=ALL tables with auto-repair
■ Request coordinators still use the local view on topology
• No extra coordination when executing user requests
■ Topology changes use linearizable access for learning and modiﬁcation
• No need for sleep(30s)
■ Faster topology changes
Replacing gossip with RAFT

node A
bootstrap
bootstrap
node B
node C
Read
barrier
Read
barrier

Serializing topology changes
■ Topology changes cannot be made concurrently
■ Currently, responsibility on the admin
■ We will make database take care of that

Serializing topology changes
■ Lock using linearizable CAS on topology lock register
■ Lock acquired before starting any topology change
■ Blocks until acquired
■ Released when topology change completes or aborts

Automatic
transaction failover

Automatic transaction failover
lock
...
...
send data
unlock
■ Topology changes are
multi-step sagas
■ Single saga orchestrator in
the cluster

lock
...
...
send data
unlock
■ Orchestrator dies -> saga
stalls
■ Topology change incomplete
■ Locks held, blocking other
topology changes

lock
...
...
send data
unlock
■ Automatic topology changes
(e.g. load balancing) cannot
wait for admin intervention
■ We need fault-tolerant
orchestration
■ We need transactions to
resume or abort
automatically
■ Whatever the decision, we
must have a new orchestrator
for the transaction

Keep saga state in a fault-tolerant
linearizable storage
Orchestrator runs where the RAFT
leader of raft_group0 runs

As long as quorum is alive, we can
make progress.

Leader dies.

RAFT detects leader failure and
elects a new leader

New orchestrator takes over from
where the previous one left off.

Eventual
Consistency
Immediate

Bring RAFT to user tables!
CREATE TABLE foo WITH raft = TRUE;
Strongly-consistent tables
Do not have any databases
before Scylla.
Always run nodetool cleanup
after bootstrapping a new
node.
Run repair within gc-grace
seconds.
Do not bootstrap nodes
concurrently, or make
any other topology change
Do not use SimpleStrategy in
a multi DC setup
before Scylla.
node.
seconds.
a multi DC setup
before Scylla.
node.
seconds.
a multi DC setup
before Scylla.
node.
seconds.
a multi DC setup

LWT (paxos based)
■ Slow
■ 3 rounds to replicas per user
request
■ Concurrent conﬂicting requests
-> retries -> negative scaling
RAFT
■ Fast
■ 1 round to replicas (on leader)
■ or less, due to request batching
■ Pipelining all the way down to
each CPU -> high throughput
■ No retries

RAFT
■ Adds latency on leader failure
(1s + election time)
■ Add 1 hop when not on the leader
(make drivers leader-aware)
■ Need many RAFT groups to
distribute load among shards
(tablets to the rescue)
LWT (paxos based)
■ No latency on leader failure
■ Load easy to distribute

Standard tables
{R1, R2, R3}
R1
R2
R3
key1
replication
metadata:
(per keyspace)

Standard tables
{R1, R2, R3}
R1
R2
R3
key1
key2
Sharding function generates
good load distribution between
CPUs

RAFT group
No. 299238
RAFT group
No. 299236 RAFT group
No. 299237
RAFT tables key1
key2

RAFT tables key1
key2
Good load distribution requires
lots of RAFT groups.
We can use more tokens.
Too many -> explosion of
metadata and management
overhead
Too few -> load imbalance

Tablets
tablet
tablet
replica
tablet
replica
tablet
replica
replication
metadata:
(per table)

Tablets - balancing
Aim at manageable size:
● Not too small -> few ->
low metadata overhead
● Not too large -> many ->
enough to balance load

Tablets - balancing
Table starts with a few tablets.
Small tables end there
Not fragmented into tiny pieces
like with tokens

Tablets - balancing
When tablet becomes too heavy
(disk, CPU, …) it is split

Tablets - balancing
The load balancer can decide to
move tablets

Tablets - balancing
Helps to relieve an overloaded
shard
Increases resource utilization

Tablets - balancing
Depends on fault-tolerant,
reliable, and fast topology
changes.

Tablets - RAFT
RAFT group
No. 299237
For RAFT tables, each tablet has
exactly one RAFT group
Each RAFT member is co-located
with tablet replica

Tablets
Resharding is cheap.
SStables split at tablet boundary.
Reassign tablets to shards (logical operation).

Tablets
Cleanup after topology change is cheap.
Just delete SStables.

Thank you!
Stay in touch
Tomasz Grabiec
@tgrabiec
tgrabiec@scylladb.com

Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond

Similaire à Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond (20)

Plus de ScyllaDB

Plus de ScyllaDB (20)

Dernier

Dernier (20)

Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond