SlideShare a Scribd company logo
1 of 73
Distributed Consensus:
Why Can't We All Just Agree?
Heidi Howard
PhD Student @ University of Cambridge
heidi.howard@cl.cam.ac.uk
@heidiann360
hh360.user.srcf.net
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
future-distributed-consensus
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
Sometimes inconsistency is not an option
• Distributed locking
• Safety critical systems
• Distributed scheduling
• Strongly consistent databases
• Blockchain
Anything which requires guaranteed agreement
• Leader election
• Orchestration services
• Distributed file systems
• Coordination & configuration
• Strongly consistent databases
What is Distributed Consensus?
“The process of reaching agreement over
state between unreliable hosts connected
by unreliable networks, all operating
asynchronously”
A walk through time
We are going to take a journey through the developments in
distributed consensus, spanning over three decades. Stops
include:
• FLP Result & CAP Theorem
• Viewstamped Replication, Paxos & Multi-Paxos
• State Machine Replication
• Paxos Made Live, Zookeeper & Raft
• Flexible Paxos
Bob
Fischer, Lynch & Paterson Result
We begin with a slippery start
Impossibility of distributed
consensus with one faulty process
Michael Fischer, Nancy Lynch
and Michael Paterson
ACM SIGACT-SIGMOD Symposium
on Principles of Database Systems
1983
FLP Result
We cannot guarantee agreement in an asynchronous system where even
one host might fail.
Why?
We cannot reliably detect failures. We cannot know for sure the difference
between a slow host/network and a failed host
Note: We can still guarantee safety, the issue limited to guaranteeing
liveness.
Solution to FLP
In practice:
We approximate reliable failure detectors using heartbeats and timers. We
accept that sometimes the service will not be available (when it could be).
In theory:
We make weak assumptions about the synchrony of the system e.g.
messages arrive within a year.
Viewstamped Replication
the forgotten algorithm
Viewstamped Replication Revisited
Barbara Liskov and James Cowling
MIT Tech Report
MIT-CSAIL-TR-2012-021
Not the original from 1988, but recommended
Viewstamped Replication
In my view, the pioneering algorithm on the field of distributed consensus.
Approach: Select one node to be the ‘master’. The master is responsible for
replicating decisions. Once a decision has been replicated onto the majority
of nodes then it is commit.
We rotate the master when the old master fails with agreement from the
majority of nodes.
Paxos
Lamport’s consensus algorithm
The Part-Time Parliament
Leslie Lamport
ACM Transactions on Computer Systems
May 1998
Paxos
The textbook algorithm for reaching consensus on a single value.
• two phase process: promise and commit
• each requiring majority agreement (aka quorums)
Paxos Example -
Failure Free
1 2
3
P:
C:
P:
C:
P:
C:
1 2
3
P:
C:
P:
C:
P:
C:
B
Incoming request from Bob
1 2
3
P:
C:
P: 13
C:
P:
C:
B
Promise (13) ?
Phase 1
Promise (13) ?
1 2
3
P: 13
C:
OKOK
P: 13
C:
P: 13
C:
Phase 1
1 2
3
P: 13
C: 13, B
P: 13
C:
P: 13
C:
Phase 2
Commit (13, ) ?B Commit (13, ) ?B
1 2
3
P: 13
C: 13, B
P: 13
C: 13,
P: 13
C: 13,
Phase 2
B B
OKOK
1 2
3
P: 13
C: 13, B
P: 13
C: 13,
P: 13
C: 13, B B
OK
Bob is granted the lock
Paxos Example - Node
Failure
1 2
3
P:
C:
P:
C:
P:
C:
1 2
3
P:
C:
P: 13
C:
P:
C:
Promise (13) ?
Phase 1
B
Incoming request from Bob
Promise (13) ?
1 2
3
P: 13
C:
P: 13
C:
P: 13
C:
Phase 1
B
OK
OK
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C:
Phase 2
Commit (13, ) ?B
B
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C: 13,
Phase 2
B
B
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C: 13,
Alice
B
B
A
1 2
3
P: 22
C:
P: 13
C: 13,
P: 13
C: 13,
Phase 1
B
B
A
Promise (22) ?
1 2
3
P: 22
C:
P: 13
C: 13,
P: 22
C: 13,
Phase 1
B
B
A
OK(13, )B
1 2
3
P: 22
C: 22,
P: 13
C: 13,
P: 22
C: 13,
Phase 2
B
B
A
Commit (22, ) ?B
B
1 2
3
P: 22
C: 22,
P: 13
C: 13,
P: 22
C: 22,
Phase 2
B
B
OK
B
NO
Paxos Example -
Conflict
1 2
3
P: 13
C:
P: 13
C:
P: 13
C:
B
Phase 1 - Bob
1 2
3
P: 21
C:
P: 21
C:
P: 21
C:
B
Phase 1 - Alice
A
1 2
3
P: 33
C:
P: 33
C:
P: 33
C:
B
Phase 1 - Bob
A
1 2
3
P: 41
C:
P: 41
C:
P: 41
C:
B
Phase 1 - Alice
A
What does Paxos give us?
Safety - Decisions are always final
Liveness - Decision will be reached as long as a majority of nodes are up
and able to communicate*. Clients must wait two round trips to the majority
of nodes, sometimes longer.
*plus our weak synchrony assumptions for the FLP result
Multi-Paxos
Lamport’s leader-driven consensus algorithm
Paxos Made Moderately Complex
Robbert van Renesse and Deniz
Altinbuken
ACM Computing Surveys
April 2015
Not the original, but highly recommended
Multi-Paxos
Lamport’s insight:
Phase 1 is not specific to the request so can be done before the request
arrives and can be reused for multiple instances of Paxos.
Implication:
Bob now only has to wait one round trip
State Machine Replication
fault-tolerant services using consensus
Implementing Fault-Tolerant
Services Using the State Machine
Approach: A Tutorial
Fred Schneider
ACM Computing Surveys
1990
State Machine Replication (SMR)
A general technique for making a service, such as a
database, fault-tolerant.
Application
Client Client
Application
Application
Application
Client
Client
Network
Consensus
Consensus
Consensus
Consensus
Consensus
CAP Theorem
You cannot have your cake and eat it
CAP Theorem
Eric Brewer
Presented at Symposium on
Principles of Distributed
Computing, 2000
Consistency, Availability & Partition
Tolerance - Pick Two
1 2
3 4
B C
Paxos Made Live & Chubby
How google uses Paxos
Paxos Made Live - An Engineering
Perspective
Tushar Chandra, Robert Griesemer
and Joshua Redstone
ACM Symposium on Principles of
Distributed Computing
2007
Isn’t this a solved problem?
“There are significant gaps between the description of the
Paxos algorithm and the needs of a real-world system.
In order to build a real-world system, an expert needs to use
numerous ideas scattered in the literature and make several
relatively small protocol extensions.
The cumulative effort will be substantial and the final system
will be based on an unproven protocol.”
Paxos Made Live
Paxos made live documents the challenges in constructing Chubby, a
distributed coordination service, built using Multi-Paxos and State machine
replication.
Challenges
• Handling disk failure and corruption
• Dealing with limited storage capacity
• Effectively handling read-only requests
• Dynamic membership & reconfiguration
• Supporting transactions
• Verifying safety of the implementation
Fast Paxos
Like Multi-Paxos, but faster
Fast Paxos
Leslie Lamport
Microsoft Research Tech Report
MSR-TR-2005-112
Fast Paxos
Paxos: Any node can commit a value in 2 RTTs
Multi-Paxos: The leader node can commit a value in 1 RTT
But, what about any node committing a value in 1 RTT?
Fast Paxos
We can bypass the leader node for many operations, so any node can
commit a value in 1 RTT.
However, we must increase the size of the quorum.
Zookeeper
The open source solution
Zookeeper: wait-free coordination
for internet-scale systems
Hunt et al
USENIX ATC 2010
Code: zookeeper.apache.org
Zookeeper
Consensus for the masses.
It utilizes and extends Multi-Paxos for strong
consistency.
Unlike “Paxos made live”, this is clearly
discussed and openly available.
Egalitarian Paxos
Don’t restrict yourself unnecessarily
There Is More Consensus in
Egalitarian Parliaments
Iulian Moraru, David G. Andersen,
Michael Kaminsky
SOSP 2013
also see Generalized Consensus and Paxos
Egalitarian Paxos
The basis of SMR is that every replica of an
application receives the same commands in the same
order.
However, sometimes the ordering can be relaxed…
C=1 B? C=C+1 C? B=0 B=C
C=1 B?
C=C+1
C?
B=0
B=C
Partial Ordering
Total Ordering
C=1 B? C=C+1 C? B=0 B=C
Many possible orderings
B? C=C+1 C?B=0 B=CC=1
B?C=C+1 C? B=0 B=CC=1
B? C=C+1 C? B=0 B=CC=1
Egalitarian Paxos
Allow requests to be out-of-order if they are commutative.
Conflict becomes much less common.
Works well in combination with Fast Paxos.
Raft Consensus
Paxos made understandable
In Search of an Understandable
Consensus Algorithm
Diego Ongaro and John Ousterhout
USENIX ATC
2014
Raft
Raft has taken the wider community by storm. Largely, due to its
understandable description.
It’s another variant of SMR with Multi-Paxos.
Key features:
• Really strong leadership - all other nodes are passive
• Various optimizations - e.g. dynamic membership and log compaction
Flexible Paxos
Paxos made scalable
Flexible Paxos: Quorum
intersection revisited
Heidi Howard, Dahlia Malkhi,
Alexander Spiegelman
ArXiv:1608.06696
Majorities are not needed
Usually, we use require majorities to agree so we can guarantee that all
quorums (groups) intersect.
This work shows that not all quorums need to intersect. Only the ones used for
phase 2 (replication) and phase 1 (leader election).
This applies to all algorithms in this class: Paxos, Viewstamped Replication,
Zookeeper, Raft etc..
Example: Non-strict majorities
Phase 2
Replication quorum
Phase 1
Leader election quorum
Example: Counting quorums
Replication quorum Leader election quorum
Example: Group quorums
Replication quorum Leader election quorum
How strong is the leadership?
Strong
Leadership Leaderless
Paxos
Egalitarian
Paxos
Raft Viewstamped
Replication
Multi-Paxos
Fast Paxos
Leader only
when needed
Leader driven
Zookeeper
Chubby
Who is the winner?
Depends on the award:
• Best for minimum latency: Viewstamped Replication
• Most widely used open source project: Zookeeper
• Easiest to understand: Raft
• Best for WANs: Egalitarian Paxos
Future
1. More scalable consensus algorithms utilizing Flexible Paxos.
2. A clearer understanding of consensus and better explained
algorithms.
3. Consensus in challenging settings such as geo-replicated
systems.
Summary
Do not be discouraged by impossibility results and
dense abstract academic papers.
Don’t give up on consistency. Consensus is
achievable, even performant and scalable.
Find the right algorithm and quorum system for your
specific domain. There is no single silver bullet.
heidi.howard@cl.cam.ac.uk
@heidiann360
Watch the video with slide synchronization on
InfoQ.com!
https://www.infoq.com/presentations/future-
distributed-consensus

More Related Content

More from C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Consensus: Why Can't We All Just Agree?

  • 1. Distributed Consensus: Why Can't We All Just Agree? Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360 hh360.user.srcf.net
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ future-distributed-consensus • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. Sometimes inconsistency is not an option • Distributed locking • Safety critical systems • Distributed scheduling • Strongly consistent databases • Blockchain Anything which requires guaranteed agreement • Leader election • Orchestration services • Distributed file systems • Coordination & configuration • Strongly consistent databases
  • 5. What is Distributed Consensus? “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, all operating asynchronously”
  • 6.
  • 7. A walk through time We are going to take a journey through the developments in distributed consensus, spanning over three decades. Stops include: • FLP Result & CAP Theorem • Viewstamped Replication, Paxos & Multi-Paxos • State Machine Replication • Paxos Made Live, Zookeeper & Raft • Flexible Paxos Bob
  • 8. Fischer, Lynch & Paterson Result We begin with a slippery start Impossibility of distributed consensus with one faulty process Michael Fischer, Nancy Lynch and Michael Paterson ACM SIGACT-SIGMOD Symposium on Principles of Database Systems 1983
  • 9. FLP Result We cannot guarantee agreement in an asynchronous system where even one host might fail. Why? We cannot reliably detect failures. We cannot know for sure the difference between a slow host/network and a failed host Note: We can still guarantee safety, the issue limited to guaranteeing liveness.
  • 10. Solution to FLP In practice: We approximate reliable failure detectors using heartbeats and timers. We accept that sometimes the service will not be available (when it could be). In theory: We make weak assumptions about the synchrony of the system e.g. messages arrive within a year.
  • 11. Viewstamped Replication the forgotten algorithm Viewstamped Replication Revisited Barbara Liskov and James Cowling MIT Tech Report MIT-CSAIL-TR-2012-021 Not the original from 1988, but recommended
  • 12. Viewstamped Replication In my view, the pioneering algorithm on the field of distributed consensus. Approach: Select one node to be the ‘master’. The master is responsible for replicating decisions. Once a decision has been replicated onto the majority of nodes then it is commit. We rotate the master when the old master fails with agreement from the majority of nodes.
  • 13. Paxos Lamport’s consensus algorithm The Part-Time Parliament Leslie Lamport ACM Transactions on Computer Systems May 1998
  • 14. Paxos The textbook algorithm for reaching consensus on a single value. • two phase process: promise and commit • each requiring majority agreement (aka quorums)
  • 18. 1 2 3 P: C: P: 13 C: P: C: B Promise (13) ? Phase 1 Promise (13) ?
  • 19. 1 2 3 P: 13 C: OKOK P: 13 C: P: 13 C: Phase 1
  • 20. 1 2 3 P: 13 C: 13, B P: 13 C: P: 13 C: Phase 2 Commit (13, ) ?B Commit (13, ) ?B
  • 21. 1 2 3 P: 13 C: 13, B P: 13 C: 13, P: 13 C: 13, Phase 2 B B OKOK
  • 22. 1 2 3 P: 13 C: 13, B P: 13 C: 13, P: 13 C: 13, B B OK Bob is granted the lock
  • 23. Paxos Example - Node Failure
  • 25. 1 2 3 P: C: P: 13 C: P: C: Promise (13) ? Phase 1 B Incoming request from Bob Promise (13) ?
  • 26. 1 2 3 P: 13 C: P: 13 C: P: 13 C: Phase 1 B OK OK
  • 27. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: Phase 2 Commit (13, ) ?B B
  • 28. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: 13, Phase 2 B B
  • 29. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: 13, Alice B B A
  • 30. 1 2 3 P: 22 C: P: 13 C: 13, P: 13 C: 13, Phase 1 B B A Promise (22) ?
  • 31. 1 2 3 P: 22 C: P: 13 C: 13, P: 22 C: 13, Phase 1 B B A OK(13, )B
  • 32. 1 2 3 P: 22 C: 22, P: 13 C: 13, P: 22 C: 13, Phase 2 B B A Commit (22, ) ?B B
  • 33. 1 2 3 P: 22 C: 22, P: 13 C: 13, P: 22 C: 22, Phase 2 B B OK B NO
  • 35. 1 2 3 P: 13 C: P: 13 C: P: 13 C: B Phase 1 - Bob
  • 36. 1 2 3 P: 21 C: P: 21 C: P: 21 C: B Phase 1 - Alice A
  • 37. 1 2 3 P: 33 C: P: 33 C: P: 33 C: B Phase 1 - Bob A
  • 38. 1 2 3 P: 41 C: P: 41 C: P: 41 C: B Phase 1 - Alice A
  • 39. What does Paxos give us? Safety - Decisions are always final Liveness - Decision will be reached as long as a majority of nodes are up and able to communicate*. Clients must wait two round trips to the majority of nodes, sometimes longer. *plus our weak synchrony assumptions for the FLP result
  • 40. Multi-Paxos Lamport’s leader-driven consensus algorithm Paxos Made Moderately Complex Robbert van Renesse and Deniz Altinbuken ACM Computing Surveys April 2015 Not the original, but highly recommended
  • 41. Multi-Paxos Lamport’s insight: Phase 1 is not specific to the request so can be done before the request arrives and can be reused for multiple instances of Paxos. Implication: Bob now only has to wait one round trip
  • 42. State Machine Replication fault-tolerant services using consensus Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred Schneider ACM Computing Surveys 1990
  • 43. State Machine Replication (SMR) A general technique for making a service, such as a database, fault-tolerant. Application Client Client
  • 45.
  • 46. CAP Theorem You cannot have your cake and eat it CAP Theorem Eric Brewer Presented at Symposium on Principles of Distributed Computing, 2000
  • 47. Consistency, Availability & Partition Tolerance - Pick Two 1 2 3 4 B C
  • 48. Paxos Made Live & Chubby How google uses Paxos Paxos Made Live - An Engineering Perspective Tushar Chandra, Robert Griesemer and Joshua Redstone ACM Symposium on Principles of Distributed Computing 2007
  • 49. Isn’t this a solved problem? “There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.”
  • 50. Paxos Made Live Paxos made live documents the challenges in constructing Chubby, a distributed coordination service, built using Multi-Paxos and State machine replication.
  • 51. Challenges • Handling disk failure and corruption • Dealing with limited storage capacity • Effectively handling read-only requests • Dynamic membership & reconfiguration • Supporting transactions • Verifying safety of the implementation
  • 52. Fast Paxos Like Multi-Paxos, but faster Fast Paxos Leslie Lamport Microsoft Research Tech Report MSR-TR-2005-112
  • 53. Fast Paxos Paxos: Any node can commit a value in 2 RTTs Multi-Paxos: The leader node can commit a value in 1 RTT But, what about any node committing a value in 1 RTT?
  • 54. Fast Paxos We can bypass the leader node for many operations, so any node can commit a value in 1 RTT. However, we must increase the size of the quorum.
  • 55. Zookeeper The open source solution Zookeeper: wait-free coordination for internet-scale systems Hunt et al USENIX ATC 2010 Code: zookeeper.apache.org
  • 56. Zookeeper Consensus for the masses. It utilizes and extends Multi-Paxos for strong consistency. Unlike “Paxos made live”, this is clearly discussed and openly available.
  • 57. Egalitarian Paxos Don’t restrict yourself unnecessarily There Is More Consensus in Egalitarian Parliaments Iulian Moraru, David G. Andersen, Michael Kaminsky SOSP 2013 also see Generalized Consensus and Paxos
  • 58. Egalitarian Paxos The basis of SMR is that every replica of an application receives the same commands in the same order. However, sometimes the ordering can be relaxed…
  • 59. C=1 B? C=C+1 C? B=0 B=C C=1 B? C=C+1 C? B=0 B=C Partial Ordering Total Ordering
  • 60. C=1 B? C=C+1 C? B=0 B=C Many possible orderings B? C=C+1 C?B=0 B=CC=1 B?C=C+1 C? B=0 B=CC=1 B? C=C+1 C? B=0 B=CC=1
  • 61. Egalitarian Paxos Allow requests to be out-of-order if they are commutative. Conflict becomes much less common. Works well in combination with Fast Paxos.
  • 62. Raft Consensus Paxos made understandable In Search of an Understandable Consensus Algorithm Diego Ongaro and John Ousterhout USENIX ATC 2014
  • 63. Raft Raft has taken the wider community by storm. Largely, due to its understandable description. It’s another variant of SMR with Multi-Paxos. Key features: • Really strong leadership - all other nodes are passive • Various optimizations - e.g. dynamic membership and log compaction
  • 64. Flexible Paxos Paxos made scalable Flexible Paxos: Quorum intersection revisited Heidi Howard, Dahlia Malkhi, Alexander Spiegelman ArXiv:1608.06696
  • 65. Majorities are not needed Usually, we use require majorities to agree so we can guarantee that all quorums (groups) intersect. This work shows that not all quorums need to intersect. Only the ones used for phase 2 (replication) and phase 1 (leader election). This applies to all algorithms in this class: Paxos, Viewstamped Replication, Zookeeper, Raft etc..
  • 66. Example: Non-strict majorities Phase 2 Replication quorum Phase 1 Leader election quorum
  • 67. Example: Counting quorums Replication quorum Leader election quorum
  • 68. Example: Group quorums Replication quorum Leader election quorum
  • 69. How strong is the leadership? Strong Leadership Leaderless Paxos Egalitarian Paxos Raft Viewstamped Replication Multi-Paxos Fast Paxos Leader only when needed Leader driven Zookeeper Chubby
  • 70. Who is the winner? Depends on the award: • Best for minimum latency: Viewstamped Replication • Most widely used open source project: Zookeeper • Easiest to understand: Raft • Best for WANs: Egalitarian Paxos
  • 71. Future 1. More scalable consensus algorithms utilizing Flexible Paxos. 2. A clearer understanding of consensus and better explained algorithms. 3. Consensus in challenging settings such as geo-replicated systems.
  • 72. Summary Do not be discouraged by impossibility results and dense abstract academic papers. Don’t give up on consistency. Consensus is achievable, even performant and scalable. Find the right algorithm and quorum system for your specific domain. There is no single silver bullet. heidi.howard@cl.cam.ac.uk @heidiann360
  • 73. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/future- distributed-consensus