SlideShare une entreprise Scribd logo
Invalidation-Based Protocols
for Replicated Datastores
Antonios Katsarakis
Doctor of Philosophy
T
H
E
U
N I V E R
S
I
T
Y
O
F
E
D I N B U
R
G
H
Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
Mandate data replication
Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
…
… … …
replication protocol
Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
…
… … …
replication protocol
Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
Can strongly consistent protocols offer
fault tolerance and high performance?
…
… … …
replication protocol
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Replication protocols inside a DC
- Network: fast, remote direct memory access (RDMA)
- Faults are rare within a replica group
A server fails at most twice a year
fault-free operation >> operation under faults
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Replication protocols inside a DC
- Network: fast, remote direct memory access (RDMA)
- Faults are rare within a replica group
A server fails at most twice a year
fault-free operation >> operation under faults
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
The common operation of replication protocols
resembles the multiprocessor!
Thesis overview
5
Adapting multiprocessor-inspired invalidating protocols to intra-DC replicated
datastores enables: strong consistency, fault tolerance, high performance
Primary contributions
4 invalidating protocols à 3 most common replication uses in datastores
1-slide summary N-slides 1-slide summary
Scale-out ccNUMA [Eurosys’18]
Galene
protocol
Performant read/write
replication (skew)
Zeus [Eurosys’21]
Zeus ownership,
Zeus reliable commit
Replicated fault-tolerant
distributed txs
Hermes [ASPLOS’20]
Hermes
protocol
Fast fault-tolerant
read/write replication
Performant read/write replication for skew
12
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
Performant read/write replication for skew
13
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
RDMA
Symmetric caching
and Galene
Performant read/write replication for skew
14
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
RDMA
Symmetric caching
and Galene
100s millions ops/sec & up to 3x state-of-the-art!
15
Hmmm …
Invalidating protocols
good read/write performance
when replicating under skew
can maintain high read/write performance
while providing fault tolerance?
reliable = strongly consistent + fault tolerant
2nd primary contribution: Hermes!
16
Hmmm …
Invalidating protocols
good read/write performance
when replicating under skew
can maintain high read/write performance
while providing fault tolerance?
reliable = strongly consistent + fault tolerant
2nd primary contribution: Hermes!
What is the issue of existing reliable protocols?
Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
17
Paxos
Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
18
Paxos
State-of-the-art replication protocols exploit
failure-free operation for performance
11
Performance of state-of-the-art protocols
Leader
ZAB
replicas
20
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
write
read bcast
ucast
Local reads form all replicas à Fast Local reads form all replicas à Fast
21
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
write
read bcast
ucast
Fast reads but poor write performance
Local reads form all replicas à Fast Local reads form all replicas à Fast
13
Goal: low latency + high throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Head Tail
Avoid long latencies
23
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service
Leader
Avoid write serialization
Key protocol features for high performance
Local reads from all replicas
24
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
25
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
Existing replication protocols are deficient
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
26
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
I
Invalidation
I
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
27
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
At this point, no stale reads can be served
Strong consistency!
I
Invalidation
I
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
28
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
V
Validation
V
Ack
Ack
I
Invalidation
I
V
commit
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
29
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
What about concurrent writes?
V
Validation
V
Ack
Ack
I
Invalidation
I
V
commit
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
30
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
31
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Broadcast + Invalidations + TS à high performance writes
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Commit in 1 RTT
Never abort
Writes in Hermes
32
Broadcast + Invalidations + TS
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Commit in 1 RTT
Never abort
Writes in Hermes
33
Awesome! But what about fault tolerance?
Broadcast + Invalidations + TS
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
34
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
Handling faults in Hermes
35
Inv(3,TS)
write(A=3)
read(A)
Coordinator
fails
I
I
Coordinator Followers
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
36
Inv(3,TS)
write(A=3)
Coordinator
fails
I
I
Coordinator Followers
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
37
Inv(3,TS)
write(A=3)
Early value propagation enables write replays
Coordinator
fails
I
I
Coordinator Followers
Evaluation
38
Evaluated protocols:
- ZAB
- CRAQ
- Hermes
State-of-the-art hardware testbed
- 5 servers
- 56 Gb/s InfiniBand NICs
- 2x 10 core Intel Xeon E5-2630v4 per server
KVS Workload
- Uniform access distribution
- Million key-value pairs: <8B keys, 32B values>
Performance
39
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Million
requests
/
sec
Write performance matters even at low write ratios
6x
% Write Ratio
Performance
40
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Million
requests
/
sec
Write performance matters even at low write ratios
6x
Hermes: highest throughput & lowest latency
% Write Ratio
Strong Consistency
through multiprocessor-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes recap
41
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
Strong Consistency
through multiprocessor-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes recap
42
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
What about reliable txs? … 3rd primary contribution (1-slide)!
Reliable replicated transactions
43
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
Reliable replicated transactions
44
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
45
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
46
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
47
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
48
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
10s millions txs/sec & up to 2x state-of-the-art!
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
49
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
10s millions txs/sec & up to 2x state-of-the-art!
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Two Invalidating protocols!
Thesis summary
50
Replicated datastores powered by multiprocessor-inspired invalidating
protocols can deliver: strong consistency, fault tolerance, high performance
4 invalidating protocols à 3 most common replication uses in datastores
- High performance (10s–100s M ops / sec)
- Strong consistency under concurrency & faults (formally verified in TLA+)
Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21]
Galene
protocol
Hermes
protocol
Zeus ownership,
Zeus reliable commit
Performant read/write
replication for skew
Fast reliable read/write
replication
Locality-aware reliable txs
with dynamic sharding
Thesis summary
51
Replicated datastores powered by multiprocessor-inspired invalidating
protocols can deliver: strong consistency, fault tolerance, high performance
4 invalidating protocols à 3 most common replication uses in datastores
- High performance (10s–100s M ops / sec)
- Strong consistency under concurrency & faults (formally verified in TLA+)
Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21]
Galene
protocol
Hermes
protocol
Zeus ownership,
Zeus reliable commit
Performant read/write
replication for skew
Fast reliable read/write
replication
Locality-aware reliable txs
with dynamic sharding
Is this the end ??
Follow up research
52
• The L2AW theorem
[to be submitted]
• Hardware offloading
• Replication across datacenters
• Single-shot reliable writes from external clients
• Non-blocking reconfiguration on node crashes
…
Follow up research
53
• The L2AW theorem
[to be submitted]
• Hardware offloading
• Replication across datacenters
• Single-shot reliable writes from external clients
• Non-blocking reconfiguration on node crashes
…
Thank you! Questions?

Contenu connexe

Similaire à Invalidation-Based Protocols for Replicated Datastores

Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with php
Elizabeth Smith
 
HTTP at your local BigCo
HTTP at your local BigCoHTTP at your local BigCo
HTTP at your local BigCo
pgriess
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
NAVER D2
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And Availability
ConSanFrancisco123
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
Sri Prasanna
 
Creating customized openSUSE versions with SUSE Studio
Creating customized openSUSE versions with SUSE StudioCreating customized openSUSE versions with SUSE Studio
Creating customized openSUSE versions with SUSE Studio
elliando dias
 

Similaire à Invalidation-Based Protocols for Replicated Datastores (20)

Knowledge share about scalable application architecture
Knowledge share about scalable application architectureKnowledge share about scalable application architecture
Knowledge share about scalable application architecture
 
What a Modern Database Enables_Srini Srinivasan.pdf
What a Modern Database Enables_Srini Srinivasan.pdfWhat a Modern Database Enables_Srini Srinivasan.pdf
What a Modern Database Enables_Srini Srinivasan.pdf
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machines
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with php
 
HTTP at your local BigCo
HTTP at your local BigCoHTTP at your local BigCo
HTTP at your local BigCo
 
Data center disaster recovery.ppt
Data center disaster recovery.ppt Data center disaster recovery.ppt
Data center disaster recovery.ppt
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And Availability
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...
 
Creating customized openSUSE versions with SUSE Studio
Creating customized openSUSE versions with SUSE StudioCreating customized openSUSE versions with SUSE Studio
Creating customized openSUSE versions with SUSE Studio
 
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 

Plus de Antonios Katsarakis

Plus de Antonios Katsarakis (7)

The L2AW theorem
The L2AW theoremThe L2AW theorem
The L2AW theorem
 
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
 
Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster
 
Scale-out ccNUMA - Eurosys'18
Scale-out ccNUMA - Eurosys'18Scale-out ccNUMA - Eurosys'18
Scale-out ccNUMA - Eurosys'18
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 

Dernier

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Dernier (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
UiPath New York Community Day in-person event
UiPath New York Community Day in-person eventUiPath New York Community Day in-person event
UiPath New York Community Day in-person event
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 

Invalidation-Based Protocols for Replicated Datastores

  • 1. Invalidation-Based Protocols for Replicated Datastores Antonios Katsarakis Doctor of Philosophy T H E U N I V E R S I T Y O F E D I N B U R G H
  • 2. Data: in-memory, sharded across servers within a datacenter (DC) Offer a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore
  • 3. Data: in-memory, sharded across servers within a datacenter (DC) Offer a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore
  • 4. Data: in-memory, sharded across servers within a datacenter (DC) Offer a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore Mandate data replication
  • 5. Performance a single node may not keep up with load Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 … … … … replication protocol
  • 6. Performance a single node may not keep up with load Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 … … … … replication protocol
  • 7. Performance a single node may not keep up with load Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 Can strongly consistent protocols offer fault tolerance and high performance? … … … … replication protocol
  • 8. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality
  • 9. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Replication protocols inside a DC - Network: fast, remote direct memory access (RDMA) - Faults are rare within a replica group A server fails at most twice a year fault-free operation >> operation under faults Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality
  • 10. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Replication protocols inside a DC - Network: fast, remote direct memory access (RDMA) - Faults are rare within a replica group A server fails at most twice a year fault-free operation >> operation under faults Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality The common operation of replication protocols resembles the multiprocessor!
  • 11. Thesis overview 5 Adapting multiprocessor-inspired invalidating protocols to intra-DC replicated datastores enables: strong consistency, fault tolerance, high performance Primary contributions 4 invalidating protocols à 3 most common replication uses in datastores 1-slide summary N-slides 1-slide summary Scale-out ccNUMA [Eurosys’18] Galene protocol Performant read/write replication (skew) Zeus [Eurosys’21] Zeus ownership, Zeus reliable commit Replicated fault-tolerant distributed txs Hermes [ASPLOS’20] Hermes protocol Fast fault-tolerant read/write replication
  • 12. Performant read/write replication for skew 12 Many workloads exhibit skewed data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA
  • 13. Performant read/write replication for skew 13 Many workloads exhibit skewed data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA RDMA Symmetric caching and Galene
  • 14. Performant read/write replication for skew 14 Many workloads exhibit skewed data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA RDMA Symmetric caching and Galene 100s millions ops/sec & up to 3x state-of-the-art!
  • 15. 15 Hmmm … Invalidating protocols good read/write performance when replicating under skew can maintain high read/write performance while providing fault tolerance? reliable = strongly consistent + fault tolerant 2nd primary contribution: Hermes!
  • 16. 16 Hmmm … Invalidating protocols good read/write performance when replicating under skew can maintain high read/write performance while providing fault tolerance? reliable = strongly consistent + fault tolerant 2nd primary contribution: Hermes! What is the issue of existing reliable protocols?
  • 17. Golden standard strong consistency and fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 17 Paxos
  • 18. Golden standard strong consistency and fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 18 Paxos State-of-the-art replication protocols exploit failure-free operation for performance
  • 19. 11 Performance of state-of-the-art protocols Leader ZAB replicas
  • 20. 20 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency write read bcast ucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  • 21. 21 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency write read bcast ucast Fast reads but poor write performance Local reads form all replicas à Fast Local reads form all replicas à Fast
  • 22. 13 Goal: low latency + high throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Head Tail Avoid long latencies
  • 23. 23 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service Leader Avoid write serialization Key protocol features for high performance Local reads from all replicas
  • 24. 24 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes
  • 25. 25 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes Existing replication protocols are deficient
  • 26. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 26 States of A: Valid, Invalid write(A=3) Coordinator Followers I Invalidation I
  • 27. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 27 States of A: Valid, Invalid write(A=3) Coordinator Followers At this point, no stale reads can be served Strong consistency! I Invalidation I
  • 28. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 28 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V commit
  • 29. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 29 States of A: Valid, Invalid write(A=3) Coordinator Followers What about concurrent writes? V Validation V Ack Ack I Invalidation I V commit
  • 30. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 30 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  • 31. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 31 write(A=3) write(A=1) Inv(TS1) Inv(TS4) Broadcast + Invalidations + TS à high performance writes
  • 32. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Commit in 1 RTT Never abort Writes in Hermes 32 Broadcast + Invalidations + TS
  • 33. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Commit in 1 RTT Never abort Writes in Hermes 33 Awesome! But what about fault tolerance? Broadcast + Invalidations + TS
  • 34. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 34 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  • 35. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation Handling faults in Hermes 35 Inv(3,TS) write(A=3) read(A) Coordinator fails I I Coordinator Followers
  • 36. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 36 Inv(3,TS) write(A=3) Coordinator fails I I Coordinator Followers
  • 37. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 37 Inv(3,TS) write(A=3) Early value propagation enables write replays Coordinator fails I I Coordinator Followers
  • 38. Evaluation 38 Evaluated protocols: - ZAB - CRAQ - Hermes State-of-the-art hardware testbed - 5 servers - 56 Gb/s InfiniBand NICs - 2x 10 core Intel Xeon E5-2630v4 per server KVS Workload - Uniform access distribution - Million key-value pairs: <8B keys, 32B values>
  • 39. Performance 39 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios 6x % Write Ratio
  • 40. Performance 40 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios 6x Hermes: highest throughput & lowest latency % Write Ratio
  • 41. Strong Consistency through multiprocessor-inspired Invalidations Fault-tolerance write replays via early value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully concurrent Hermes recap 41 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation
  • 42. Strong Consistency through multiprocessor-inspired Invalidations Fault-tolerance write replays via early value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully concurrent Hermes recap 42 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation What about reliable txs? … 3rd primary contribution (1-slide)!
  • 43. Reliable replicated transactions 43 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16]
  • 44. Reliable replicated transactions 44 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 45. Reliable replicated transactions 45 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 46. Reliable replicated transactions 46 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 47. Reliable replicated transactions 47 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 48. Reliable replicated transactions 48 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit 10s millions txs/sec & up to 2x state-of-the-art! distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 49. Reliable replicated transactions 49 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit 10s millions txs/sec & up to 2x state-of-the-art! distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality Two Invalidating protocols!
  • 50. Thesis summary 50 Replicated datastores powered by multiprocessor-inspired invalidating protocols can deliver: strong consistency, fault tolerance, high performance 4 invalidating protocols à 3 most common replication uses in datastores - High performance (10s–100s M ops / sec) - Strong consistency under concurrency & faults (formally verified in TLA+) Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21] Galene protocol Hermes protocol Zeus ownership, Zeus reliable commit Performant read/write replication for skew Fast reliable read/write replication Locality-aware reliable txs with dynamic sharding
  • 51. Thesis summary 51 Replicated datastores powered by multiprocessor-inspired invalidating protocols can deliver: strong consistency, fault tolerance, high performance 4 invalidating protocols à 3 most common replication uses in datastores - High performance (10s–100s M ops / sec) - Strong consistency under concurrency & faults (formally verified in TLA+) Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21] Galene protocol Hermes protocol Zeus ownership, Zeus reliable commit Performant read/write replication for skew Fast reliable read/write replication Locality-aware reliable txs with dynamic sharding Is this the end ??
  • 52. Follow up research 52 • The L2AW theorem [to be submitted] • Hardware offloading • Replication across datacenters • Single-shot reliable writes from external clients • Non-blocking reconfiguration on node crashes …
  • 53. Follow up research 53 • The L2AW theorem [to be submitted] • Hardware offloading • Replication across datacenters • Single-shot reliable writes from external clients • Non-blocking reconfiguration on node crashes … Thank you! Questions?