SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Distributed Consensus:
Making Impossible
Possible
Heidi Howard
PhD Student @ University of Cambridge
heidi.howard@cl.cam.ac.uk
@heidiann360
hh360.user.srcf.net
Sometimes inconsistency is
not an option
• Distributed locking
• Financial services/ blockchain
• Safety critical systems
• Distributed scheduling and coordination
• Strongly consistent databases
Anything which requires guaranteed agreement
What is Consensus?
“The process by which we reach agreement over
system state between unreliable machines connected
by asynchronous networks”
A walk through history
We are going to take a journey
through the developments in
distributed consensus, spanning
3 decades.
Bob
FLP Result
off to a slippery start
Impossibility of distributed
consensus with one faulty process
Michael Fischer, Nancy Lynch
and Michael Paterson
ACM SIGACT-SIGMOD
Symposium on Principles of
Database Systems
1983
FLP
We cannot guarantee agreement in an asynchronous
system where even one host might fail.
Why?
We cannot reliably detect failures. We cannot know
for sure the difference between a slow host/network
and a failed host
Note: We can still guarantee safety, the issue limited
to guaranteeing liveness.
Solution to FLP
In practice:
We accept that sometimes the system will not be
available. We mitigate this using timers and backoffs.
In theory:
We make weaker assumptions about the synchrony
of the system e.g. messages arrive within a year.
Viewstamped
Replication
the forgotten algorithm
Viewstamped Replication Revisited
Barbara Liskov and James
Cowling
MIT Tech Report
MIT-CSAIL-TR-2012-021
Not the original from 1988, but recommended
Viewstamped Replication
(Revisited)
In my view, the pioneer on the field of consensus.
Let one node be the ‘master’, rotating when failures
occur. Replicate requests for a state machine.
Now considered a variant of SMR + Multi-Paxos.
Paxos
Lamport’s consensus algorithm
The Part-Time Parliament
Leslie Lamport
ACM Transactions on Computer Systems
May 1998
Paxos
The textbook consensus algorithm for reaching
agreement on a single value.
• two phase process: promise and commit
• each requiring majority agreement (aka quorums)
• 2 RRTs to agreement a single value
Paxos Example -
Failure Free
1 2
3
P:
C:
P:
C:
P:
C:
1 2
3
P:
C:
P:
C:
P:
C:
B
Incoming request from Bob
1 2
3
P:
C:
P: 13
C:
P:
C:
B
Promise (13) ?
Phase 1
Promise (13) ?
1 2
3
P: 13
C:
OKOK
P: 13
C:
P: 13
C:
Phase 1
1 2
3
P: 13
C: 13, B
P: 13
C:
P: 13
C:
Phase 2
Commit (13, ) ?B Commit (13, ) ?B
1 2
3
P: 13
C: 13, B
P: 13
C: 13,
P: 13
C: 13,
Phase 2
B B
OKOK
1 2
3
P: 13
C: 13, B
P: 13
C: 13,
P: 13
C: 13, B B
OK
Bob is granted the lock
Paxos Example -
Node Failure
1 2
3
P:
C:
P:
C:
P:
C:
1 2
3
P:
C:
P: 13
C:
P:
C:
Promise (13) ?
Phase 1
B
Incoming request from Bob
Promise (13) ?
1 2
3
P: 13
C:
P: 13
C:
P: 13
C:
Phase 1
B
OK
OK
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C:
Phase 2
Commit (13, ) ?B
B
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C: 13,
Phase 2
B
B
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C: 13,
Alice would also like the lock
B
B
A
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C: 13,
Alice would also like the lock
B
B
A
1 2
3
P: 22
C:
P: 13
C: 13,
P: 13
C: 13,
Phase 1
B
B
A
Promise (22) ?
1 2
3
P: 22
C:
P: 13
C: 13,
P: 22
C: 13,
Phase 1
B
B
A
OK(13, )B
1 2
3
P: 22
C: 22,
P: 13
C: 13,
P: 22
C: 13,
Phase 2
B
B
A
Commit (22, ) ?B
B
1 2
3
P: 22
C: 22,
P: 13
C: 13,
P: 22
C: 22,
Phase 2
B
B
OK
B
NO
Paxos Example -
Conflict
1 2
3
P: 13
C:
P: 13
C:
P: 13
C:
B
Phase 1 - Bob
1 2
3
P: 21
C:
P: 21
C:
P: 21
C:
B
Phase 1 - Alice
A
1 2
3
P: 33
C:
P: 33
C:
P: 33
C:
B
Phase 1 - Bob
A
1 2
3
P: 41
C:
P: 41
C:
P: 41
C:
B
Phase 1 - Alice
A
Paxos
Clients must wait two round trips (2 RTT) to the
majority of nodes. Sometimes longer.
The system will continue as long as a majority of
nodes are up
Multi-Paxos
Lamport’s leader-driven consensus algorithm
Paxos Made Moderately Complex
Robbert van Renesse and Deniz
Altinbuken
ACM Computing Surveys
April 2015
Not the original, but highly recommended
Multi-Paxos
Lamport’s insight:
Phase 1 is not specific to the request so can be done
before the request arrives and can be reused.
Implication:
Bob now only has to wait one RTT
State Machine
Replication
fault-tolerant services using consensus
Implementing Fault-Tolerant
Services Using the State Machine
Approach: A Tutorial
Fred Schneider
ACM Computing Surveys
1990
State Machine Replication
A general technique for making a service, such as a
database, fault-tolerant.
Application
Client Client
Application
Application
Application
Client
Client
Network
Consensus
Consensus
Consensus
Consensus
Consensus
CAP Theorem
You cannot have your cake and eat it
CAP Theorem
Eric Brewer
Presented at Symposium on
Principles of Distributed
Computing, 2000
Consistency, Availability &
Partition Tolerance - Pick Two
1 2
3 4
B C
Paxos Made Live
How google uses Paxos
Paxos Made Live - An Engineering
Perspective
Tushar Chandra, Robert Griesemer
and Joshua Redstone
ACM Symposium on Principles of
Distributed Computing
2007
Isn’t this a solved problem?
“There are significant gaps between the description
of the Paxos algorithm and the needs of a real-world
system.
In order to build a real-world system, an expert needs
to use numerous ideas scattered in the literature and
make several relatively small protocol extensions.
The cumulative effort will be substantial and the final
system will be based on an unproven protocol.”
Paxos Made Live
Paxos made live documents the challenges in
constructing Chubby, a distributed coordination
service, built using Multi-Paxos and SMR.
Challenges
• Handling disk failure and corruption
• Dealing with limited storage capacity
• Effectively handling read-only requests
• Dynamic membership & reconfiguration
• Supporting transactions
• Verifying safety of the implementation
Fast Paxos
Like Multi-Paxos, but faster
Fast Paxos
Leslie Lamport
Microsoft Research Tech Report
MSR-TR-2005-112
Fast Paxos
Paxos: Any node can commit a value in 2 RTTs
Multi-Paxos: The leader node can commit a value in
1 RTT
But, what about any node committing a value in 1
RTT?
Fast Paxos
We can bypass the leader node for many operations,
so any node can commit a value in 1 RTT.
However, we must increase the size of the quorum.
Zookeeper
The open source solution
Zookeeper: wait-free coordination
for internet-scale systems
Hunt et al
USENIX ATC 2010
Code: zookeeper.apache.org
Zookeeper
Consensus for the masses.
It utilizes and extends Multi-Paxos for
strong consistency.
Unlike “Paxos made live”, this is
clearly discussed and openly
available.
Egalitarian Paxos
Don’t restrict yourself unnecessarily
There Is More Consensus in
Egalitarian Parliaments
Iulian Moraru, David G. Andersen,
Michael Kaminsky
SOSP 2013
also see Generalized Consensus and Paxos
Egalitarian Paxos
The basis of SMR is that every replica of an
application receives the same commands in the
same order.
However, sometimes the ordering can be relaxed…
C=1 B? C=C+1 C? B=0 B=C
C=1 B?
C=C+1
C?
B=0
B=C
Partial Ordering
Total Ordering
C=1 B? C=C+1 C? B=0 B=C
Many possible orderings
B? C=C+1 C?B=0 B=CC=1
B?C=C+1 C? B=0 B=CC=1
B? C=C+1 C? B=0 B=CC=1
Egalitarian Paxos
Allow requests to be out-of-order if they are
commutative.
Conflict becomes much less common.
Works well in combination with Fast Paxos.
Raft Consensus
Paxos made understandable
In Search of an Understandable
Consensus Algorithm
Diego Ongaro and John
Ousterhout
USENIX Annual Technical
Conference
2014
Raft
Raft has taken the wider community by storm. Largely,
due to its understandable description.
It’s another variant of SMR with Multi-Paxos.
Key features:
• Really strong leadership - all other nodes are passive
• Various optimizations - e.g. dynamic membership
and log compaction
Follower Candidate Leader
Startup/
Restart
Timeout Win
Timeout
Step down
Step down
Ios
Why do things yourself, when you can delegate it?
to appear
Ios
The issue with leader-driven algorithms like
Viewstamp Replication, Multi-Paxos, Zookeeper and
Raft is that throughput is limited to one node.
Ios allows a leader to safely and dynamically
delegate their responsibilities to other nodes in the
system.
Flexible Paxos
Paxos made scalable
Flexible Paxos: Quorum
intersection revisited
Heidi Howard, Dahlia Malkhi,
Alexander Spiegelman
ArXiv:1608.06696
Majorities are not needed
Usually, we use require majorities to agree so we can
guarantee that all quorums (groups) intersect.
This work shows that not all quorums need to
intersect. Only the ones used for replication and
leader election.
This applies to all algorithms in this class: Paxos,
Viewstamped Replication, Zookeeper, Raft etc..
The road we travelled
• 2 theoretical results: FLP & Flexible Paxos
• 2 popular ideas: CAP & Paxos made live
• 1 replication method: State machine Replication
• 8 consensus algorithms: Viewstamped
Replication, Paxos, Multi-Paxos, Fast Paxos,
Zookeeper, Egalitarian Paxos, Raft & Ios
How strong is the
leadership?
Strong
Leadership Leaderless
Paxos
Egalitarian
Paxos
Raft Viewstamped
Replication
Ios
Multi-Paxos
Fast Paxos
Leader with
Delegation
Leader only
when needed
Leader driven
Zookeeper
Who is the winner?
Depends on the award:
• Best for minimum latency: Viewstamped
Replication
• Most widely used open source project: Zookeeper
• Easiest to understand: Raft
• Best for WANs: Egalitarian Paxos
Future
1. More scalable and performant consensus
algorithms utilizing Flexible Paxos.
2. A clearer understanding of consensus and better
explained consensus algorithms.
3. Achieving consensus in challenge settings such
as geo-replicated systems.
Stops we drove passed
We have seen one path through history, but many
more exist.
• Alternative replication techniques e.g. chain
replication and primary backup replication
• Alternative failure models e.g. nodes acting
maliciously
• Alternative domains e.g. sensor networks, mobile
networks, between cores
Summary
Do not be discouraged by impossibility
results and dense abstract academic papers.
Don’t give up on consistency. Consensus is
achievable, even performant and scalable (if
done correctly)
Find the right algorithm for your specific
domain.
heidi.howard@cl.cam.ac.uk
@heidiann360

Contenu connexe

En vedette

Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEPAmir Payberah
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module ProgrammingAmir Payberah
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and SpannerAmir Payberah
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2Amir Payberah
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Amir Payberah
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2Amir Payberah
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2Amir Payberah
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformAmir Payberah
 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing FrameworksAmir Payberah
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformAmir Payberah
 
File System Interface
File System InterfaceFile System Interface
File System InterfaceAmir Payberah
 

En vedette (20)

Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEP
 
MapReduce
MapReduceMapReduce
MapReduce
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and Spanner
 
Security
SecuritySecurity
Security
 
Protection
ProtectionProtection
Protection
 
IO Systems
IO SystemsIO Systems
IO Systems
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
 
Storage
StorageStorage
Storage
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing Frameworks
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
 
Paxos
PaxosPaxos
Paxos
 
File System Interface
File System InterfaceFile System Interface
File System Interface
 
Deadlocks
DeadlocksDeadlocks
Deadlocks
 

Similaire à Distributed Consensus: Making Impossible Possible [Revised]

Distributed Consensus: Making Impossible Possible by Heidi howard
Distributed Consensus: Making Impossible Possible by Heidi howardDistributed Consensus: Making Impossible Possible by Heidi howard
Distributed Consensus: Making Impossible Possible by Heidi howardJ On The Beach
 
Distributed Consensus: Making the Impossible Possible
Distributed Consensus: Making the Impossible PossibleDistributed Consensus: Making the Impossible Possible
Distributed Consensus: Making the Impossible PossibleC4Media
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusRebecca Bilbro
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesYoav Francis
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeDilum Bandara
 
[Paperreading] Paxos made easy (by sen han)
[Paperreading]  Paxos made easy (by sen han)[Paperreading]  Paxos made easy (by sen han)
[Paperreading] Paxos made easy (by sen han)PingCAP
 
Algorand Consensus Algorithm
Algorand Consensus AlgorithmAlgorand Consensus Algorithm
Algorand Consensus AlgorithmVanessa Lošić
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goalskamaelian
 
Blockchain consensus algorithms
Blockchain consensus algorithmsBlockchain consensus algorithms
Blockchain consensus algorithmsAnurag Dashputre
 
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareBeyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareQuantum Leaps, LLC
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloodsRaymond Tay
 
unit3consesence.pptx
unit3consesence.pptxunit3consesence.pptx
unit3consesence.pptxGopalSB
 
Maintaining data consistency in
Maintaining data consistency inMaintaining data consistency in
Maintaining data consistency inIMPULSE_TECHNOLOGY
 
Distributed Reactive Services with Reactor & Spring - Stéphane Maldini
Distributed Reactive Services with Reactor & Spring - Stéphane MaldiniDistributed Reactive Services with Reactor & Spring - Stéphane Maldini
Distributed Reactive Services with Reactor & Spring - Stéphane MaldiniVMware Tanzu
 
FIWARE Data Management in High Availability
FIWARE Data Management in High AvailabilityFIWARE Data Management in High Availability
FIWARE Data Management in High AvailabilityFederico Michele Facca
 

Similaire à Distributed Consensus: Making Impossible Possible [Revised] (20)

Distributed Consensus: Making Impossible Possible by Heidi howard
Distributed Consensus: Making Impossible Possible by Heidi howardDistributed Consensus: Making Impossible Possible by Heidi howard
Distributed Consensus: Making Impossible Possible by Heidi howard
 
Distributed Consensus: Making the Impossible Possible
Distributed Consensus: Making the Impossible PossibleDistributed Consensus: Making the Impossible Possible
Distributed Consensus: Making the Impossible Possible
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
Hashgraph as Code
Hashgraph as CodeHashgraph as Code
Hashgraph as Code
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain Syndrome
 
9X5u87KWa267pP7aGX3K
9X5u87KWa267pP7aGX3K9X5u87KWa267pP7aGX3K
9X5u87KWa267pP7aGX3K
 
[Paperreading] Paxos made easy (by sen han)
[Paperreading]  Paxos made easy (by sen han)[Paperreading]  Paxos made easy (by sen han)
[Paperreading] Paxos made easy (by sen han)
 
Algorand Consensus Algorithm
Algorand Consensus AlgorithmAlgorand Consensus Algorithm
Algorand Consensus Algorithm
 
Haystack + DASH7 Security
Haystack + DASH7 SecurityHaystack + DASH7 Security
Haystack + DASH7 Security
 
Ch3-2
Ch3-2Ch3-2
Ch3-2
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
 
Blockchain consensus algorithms
Blockchain consensus algorithmsBlockchain consensus algorithms
Blockchain consensus algorithms
 
osi-oss-dbs.pptx
osi-oss-dbs.pptxosi-oss-dbs.pptx
osi-oss-dbs.pptx
 
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareBeyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloods
 
unit3consesence.pptx
unit3consesence.pptxunit3consesence.pptx
unit3consesence.pptx
 
Maintaining data consistency in
Maintaining data consistency inMaintaining data consistency in
Maintaining data consistency in
 
Distributed Reactive Services with Reactor & Spring - Stéphane Maldini
Distributed Reactive Services with Reactor & Spring - Stéphane MaldiniDistributed Reactive Services with Reactor & Spring - Stéphane Maldini
Distributed Reactive Services with Reactor & Spring - Stéphane Maldini
 
FIWARE Data Management in High Availability
FIWARE Data Management in High AvailabilityFIWARE Data Management in High Availability
FIWARE Data Management in High Availability
 

Dernier

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Distributed Consensus: Making Impossible Possible [Revised]

  • 1. Distributed Consensus: Making Impossible Possible Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360 hh360.user.srcf.net
  • 2. Sometimes inconsistency is not an option • Distributed locking • Financial services/ blockchain • Safety critical systems • Distributed scheduling and coordination • Strongly consistent databases Anything which requires guaranteed agreement
  • 3. What is Consensus? “The process by which we reach agreement over system state between unreliable machines connected by asynchronous networks”
  • 4.
  • 5. A walk through history We are going to take a journey through the developments in distributed consensus, spanning 3 decades. Bob
  • 6. FLP Result off to a slippery start Impossibility of distributed consensus with one faulty process Michael Fischer, Nancy Lynch and Michael Paterson ACM SIGACT-SIGMOD Symposium on Principles of Database Systems 1983
  • 7. FLP We cannot guarantee agreement in an asynchronous system where even one host might fail. Why? We cannot reliably detect failures. We cannot know for sure the difference between a slow host/network and a failed host Note: We can still guarantee safety, the issue limited to guaranteeing liveness.
  • 8. Solution to FLP In practice: We accept that sometimes the system will not be available. We mitigate this using timers and backoffs. In theory: We make weaker assumptions about the synchrony of the system e.g. messages arrive within a year.
  • 9. Viewstamped Replication the forgotten algorithm Viewstamped Replication Revisited Barbara Liskov and James Cowling MIT Tech Report MIT-CSAIL-TR-2012-021 Not the original from 1988, but recommended
  • 10. Viewstamped Replication (Revisited) In my view, the pioneer on the field of consensus. Let one node be the ‘master’, rotating when failures occur. Replicate requests for a state machine. Now considered a variant of SMR + Multi-Paxos.
  • 11. Paxos Lamport’s consensus algorithm The Part-Time Parliament Leslie Lamport ACM Transactions on Computer Systems May 1998
  • 12. Paxos The textbook consensus algorithm for reaching agreement on a single value. • two phase process: promise and commit • each requiring majority agreement (aka quorums) • 2 RRTs to agreement a single value
  • 16. 1 2 3 P: C: P: 13 C: P: C: B Promise (13) ? Phase 1 Promise (13) ?
  • 17. 1 2 3 P: 13 C: OKOK P: 13 C: P: 13 C: Phase 1
  • 18. 1 2 3 P: 13 C: 13, B P: 13 C: P: 13 C: Phase 2 Commit (13, ) ?B Commit (13, ) ?B
  • 19. 1 2 3 P: 13 C: 13, B P: 13 C: 13, P: 13 C: 13, Phase 2 B B OKOK
  • 20. 1 2 3 P: 13 C: 13, B P: 13 C: 13, P: 13 C: 13, B B OK Bob is granted the lock
  • 23. 1 2 3 P: C: P: 13 C: P: C: Promise (13) ? Phase 1 B Incoming request from Bob Promise (13) ?
  • 24. 1 2 3 P: 13 C: P: 13 C: P: 13 C: Phase 1 B OK OK
  • 25. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: Phase 2 Commit (13, ) ?B B
  • 26. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: 13, Phase 2 B B
  • 27. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: 13, Alice would also like the lock B B A
  • 28. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: 13, Alice would also like the lock B B A
  • 29. 1 2 3 P: 22 C: P: 13 C: 13, P: 13 C: 13, Phase 1 B B A Promise (22) ?
  • 30. 1 2 3 P: 22 C: P: 13 C: 13, P: 22 C: 13, Phase 1 B B A OK(13, )B
  • 31. 1 2 3 P: 22 C: 22, P: 13 C: 13, P: 22 C: 13, Phase 2 B B A Commit (22, ) ?B B
  • 32. 1 2 3 P: 22 C: 22, P: 13 C: 13, P: 22 C: 22, Phase 2 B B OK B NO
  • 34. 1 2 3 P: 13 C: P: 13 C: P: 13 C: B Phase 1 - Bob
  • 35. 1 2 3 P: 21 C: P: 21 C: P: 21 C: B Phase 1 - Alice A
  • 36. 1 2 3 P: 33 C: P: 33 C: P: 33 C: B Phase 1 - Bob A
  • 37. 1 2 3 P: 41 C: P: 41 C: P: 41 C: B Phase 1 - Alice A
  • 38. Paxos Clients must wait two round trips (2 RTT) to the majority of nodes. Sometimes longer. The system will continue as long as a majority of nodes are up
  • 39. Multi-Paxos Lamport’s leader-driven consensus algorithm Paxos Made Moderately Complex Robbert van Renesse and Deniz Altinbuken ACM Computing Surveys April 2015 Not the original, but highly recommended
  • 40. Multi-Paxos Lamport’s insight: Phase 1 is not specific to the request so can be done before the request arrives and can be reused. Implication: Bob now only has to wait one RTT
  • 41. State Machine Replication fault-tolerant services using consensus Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred Schneider ACM Computing Surveys 1990
  • 42. State Machine Replication A general technique for making a service, such as a database, fault-tolerant. Application Client Client
  • 43.
  • 45.
  • 46. CAP Theorem You cannot have your cake and eat it CAP Theorem Eric Brewer Presented at Symposium on Principles of Distributed Computing, 2000
  • 47. Consistency, Availability & Partition Tolerance - Pick Two 1 2 3 4 B C
  • 48. Paxos Made Live How google uses Paxos Paxos Made Live - An Engineering Perspective Tushar Chandra, Robert Griesemer and Joshua Redstone ACM Symposium on Principles of Distributed Computing 2007
  • 49. Isn’t this a solved problem? “There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.”
  • 50. Paxos Made Live Paxos made live documents the challenges in constructing Chubby, a distributed coordination service, built using Multi-Paxos and SMR.
  • 51. Challenges • Handling disk failure and corruption • Dealing with limited storage capacity • Effectively handling read-only requests • Dynamic membership & reconfiguration • Supporting transactions • Verifying safety of the implementation
  • 52. Fast Paxos Like Multi-Paxos, but faster Fast Paxos Leslie Lamport Microsoft Research Tech Report MSR-TR-2005-112
  • 53. Fast Paxos Paxos: Any node can commit a value in 2 RTTs Multi-Paxos: The leader node can commit a value in 1 RTT But, what about any node committing a value in 1 RTT?
  • 54. Fast Paxos We can bypass the leader node for many operations, so any node can commit a value in 1 RTT. However, we must increase the size of the quorum.
  • 55. Zookeeper The open source solution Zookeeper: wait-free coordination for internet-scale systems Hunt et al USENIX ATC 2010 Code: zookeeper.apache.org
  • 56. Zookeeper Consensus for the masses. It utilizes and extends Multi-Paxos for strong consistency. Unlike “Paxos made live”, this is clearly discussed and openly available.
  • 57. Egalitarian Paxos Don’t restrict yourself unnecessarily There Is More Consensus in Egalitarian Parliaments Iulian Moraru, David G. Andersen, Michael Kaminsky SOSP 2013 also see Generalized Consensus and Paxos
  • 58. Egalitarian Paxos The basis of SMR is that every replica of an application receives the same commands in the same order. However, sometimes the ordering can be relaxed…
  • 59. C=1 B? C=C+1 C? B=0 B=C C=1 B? C=C+1 C? B=0 B=C Partial Ordering Total Ordering
  • 60. C=1 B? C=C+1 C? B=0 B=C Many possible orderings B? C=C+1 C?B=0 B=CC=1 B?C=C+1 C? B=0 B=CC=1 B? C=C+1 C? B=0 B=CC=1
  • 61. Egalitarian Paxos Allow requests to be out-of-order if they are commutative. Conflict becomes much less common. Works well in combination with Fast Paxos.
  • 62. Raft Consensus Paxos made understandable In Search of an Understandable Consensus Algorithm Diego Ongaro and John Ousterhout USENIX Annual Technical Conference 2014
  • 63. Raft Raft has taken the wider community by storm. Largely, due to its understandable description. It’s another variant of SMR with Multi-Paxos. Key features: • Really strong leadership - all other nodes are passive • Various optimizations - e.g. dynamic membership and log compaction
  • 64. Follower Candidate Leader Startup/ Restart Timeout Win Timeout Step down Step down
  • 65. Ios Why do things yourself, when you can delegate it? to appear
  • 66. Ios The issue with leader-driven algorithms like Viewstamp Replication, Multi-Paxos, Zookeeper and Raft is that throughput is limited to one node. Ios allows a leader to safely and dynamically delegate their responsibilities to other nodes in the system.
  • 67. Flexible Paxos Paxos made scalable Flexible Paxos: Quorum intersection revisited Heidi Howard, Dahlia Malkhi, Alexander Spiegelman ArXiv:1608.06696
  • 68. Majorities are not needed Usually, we use require majorities to agree so we can guarantee that all quorums (groups) intersect. This work shows that not all quorums need to intersect. Only the ones used for replication and leader election. This applies to all algorithms in this class: Paxos, Viewstamped Replication, Zookeeper, Raft etc..
  • 69. The road we travelled • 2 theoretical results: FLP & Flexible Paxos • 2 popular ideas: CAP & Paxos made live • 1 replication method: State machine Replication • 8 consensus algorithms: Viewstamped Replication, Paxos, Multi-Paxos, Fast Paxos, Zookeeper, Egalitarian Paxos, Raft & Ios
  • 70. How strong is the leadership? Strong Leadership Leaderless Paxos Egalitarian Paxos Raft Viewstamped Replication Ios Multi-Paxos Fast Paxos Leader with Delegation Leader only when needed Leader driven Zookeeper
  • 71. Who is the winner? Depends on the award: • Best for minimum latency: Viewstamped Replication • Most widely used open source project: Zookeeper • Easiest to understand: Raft • Best for WANs: Egalitarian Paxos
  • 72. Future 1. More scalable and performant consensus algorithms utilizing Flexible Paxos. 2. A clearer understanding of consensus and better explained consensus algorithms. 3. Achieving consensus in challenge settings such as geo-replicated systems.
  • 73. Stops we drove passed We have seen one path through history, but many more exist. • Alternative replication techniques e.g. chain replication and primary backup replication • Alternative failure models e.g. nodes acting maliciously • Alternative domains e.g. sensor networks, mobile networks, between cores
  • 74. Summary Do not be discouraged by impossibility results and dense abstract academic papers. Don’t give up on consistency. Consensus is achievable, even performant and scalable (if done correctly) Find the right algorithm for your specific domain. heidi.howard@cl.cam.ac.uk @heidiann360