SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
May 2016
Raft in details
Ivan Glushkov
ivan.glushkov@gmail.com
@gliush
Raft is a consensus algorithm for managing a
replicated log
Why
❖ Very important topic
❖ [Multi-]Paxos, the key player, is very difficult to
understand
❖ Paxos -> Multi-Paxos -> different approaches
❖ Different optimization, closed-source implementation
Raft goals
❖ Main goal - understandability:
❖ Decomposition: separated leader election, log
replication, safety, membership changes
❖ State space reduction
Replicated state machines
❖ The same state on each server
❖ Compute the state from replicated log
❖ Consensus algorithm: keep the replicated log consistent
Properties of consensus algorithm
❖ Never return an incorrect result: Safety
❖ Functional as long as any majority of the severs are
operational: Availability
❖ Do not depend on timing to ensure consistency (at
worst - unavailable)
Raft in a nutshell
❖ Elect a leader
❖ Give full responsibility for replicated log to leader:
❖ Leader accepts log entries from client
❖ Leader replicates log entries on other servers
❖ Tell servers when it is safe to apply changes on their
state
Raft basics: States
❖ Follower
❖ Candidate
❖ Leader
RequestVote RPC
❖ Become a candidate, if no communication from leader over
a timeout (random election timeout: e.g. 150-300ms)
❖ Spec: (term, candidateId, lastLogIndex, lastLogTerm) 

-> (term, voteGranted)
❖ Receiver:
❖ false if term < currentTerm
❖ true if not voted yet and term and log are Up-To-Date
❖ false
Leader Election
❖ Heartbeats from leader
❖ Election timeout
❖ Candidate results:
❖ It wins
❖ Another server wins
❖ No winner for a period of time -> new election
Leader Election Property
❖ Election Safety: at most one leader can be elected in a
given term
Log Replication
❖ log index
❖ log term
❖ log command
Log Replication
❖ Leader appends command from client to its log
❖ Leader issues AppendEntries RPC to replicate the entry
❖ Leader replicates entry to majority -> apply entry to
state -> “committed entry”
❖ Follower checks if entry is committed by server ->
commit it locally
Log Replication Properties
❖ Leader Append-Only: a leader never overwrites or
deletes entries in its log; it only appends new entries
❖ Log Matching: If two entries in different logs have the
same index and term, then they store the same
command.
❖ Log Matching: If two entries in different logs have the
same index and term, then the logs are identical in all
preceding entries.
Inconsistency
Solving Inconsistency
❖ Leader forces the followers’ logs to duplicate its own
❖ Conflicting entries in the follower logs will be
overwritten with entries from the Leader’s log:
❖ Find latest log where two servers agree
❖ Remove everything after this point
❖ Write new logs
Safety Property
❖ Leader Completeness: if a log entry is committed in a
given term, then that entry will be present in the logs of
the leaders for all higher-numbered terms
❖ State Machine Safety: if a server has applied a log entry
at a given index to its state machine, no other server will
ever apply a different log entry for the same index.
Safety Restriction
❖ Up-to-date: if the logs have last entries with different
terms, then the log with the later term is more up-to-
date. If the logs end with the same term, then whichever
log is longer is more up-to-date.
time
Follower and Candidate Crashes
❖ Retry indefinitely
❖ Raft RPC are idempotent
Raft Joint Consensus
❖ Two phase to change membership configuration
❖ In Joint Consensus state:
❖ replicate log entries to both configuration
❖ any server from either configuration may be a leader
❖ agreement (for election and commitment) requires
separate majorities from both configuration
Log compaction
❖ Independent

snapshotting
❖ Snapshot metadata
❖ InstallSnapshot RPC
Client Interaction
❖ Works with leader
❖ Leader respond to a request when it commits an entry
❖ Assign uniqueID to every command, leader stores latest
ID with response.
❖ Read-only - could be without writing to a log, possible
stale data. Or write “no-op” entry at start
Correctness
❖ TLA+ formal specification for consensus alg (400 LOC)
❖ Proof of linearizability in Raft using Coq (community)
Questions?

Contenu connexe

Tendances

PyOpenCLによるGPGPU入門
PyOpenCLによるGPGPU入門PyOpenCLによるGPGPU入門
PyOpenCLによるGPGPU入門
Yosuke Onoue
 

Tendances (20)

PyOpenCLによるGPGPU入門
PyOpenCLによるGPGPU入門PyOpenCLによるGPGPU入門
PyOpenCLによるGPGPU入門
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Elements of cache design
Elements of cache designElements of cache design
Elements of cache design
 
Multiprocessor Scheduling
Multiprocessor SchedulingMultiprocessor Scheduling
Multiprocessor Scheduling
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Distributed system
Distributed systemDistributed system
Distributed system
 
Using Kafka to scale database replication
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replication
 
Parallel and distributed Computing
Parallel and distributed Computing Parallel and distributed Computing
Parallel and distributed Computing
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Unit 6 interprocessor arbitration
Unit 6 interprocessor arbitrationUnit 6 interprocessor arbitration
Unit 6 interprocessor arbitration
 
Gossip-based algorithms
Gossip-based algorithmsGossip-based algorithms
Gossip-based algorithms
 
Scalar DL Technical Overview
Scalar DL Technical OverviewScalar DL Technical Overview
Scalar DL Technical Overview
 
Operating System- Multilevel Paging, Inverted Page Table
Operating System- Multilevel Paging, Inverted Page TableOperating System- Multilevel Paging, Inverted Page Table
Operating System- Multilevel Paging, Inverted Page Table
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar Overview
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
Collective Communications in MPI
 Collective Communications in MPI Collective Communications in MPI
Collective Communications in MPI
 
unit 4 ai.pptx
unit 4 ai.pptxunit 4 ai.pptx
unit 4 ai.pptx
 
Superscalar Processor
Superscalar ProcessorSuperscalar Processor
Superscalar Processor
 

Similaire à Raft in details

Manging scalability of distributed system
Manging scalability of distributed systemManging scalability of distributed system
Manging scalability of distributed system
Atin Mukherjee
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
Azeem Mumtaz
 
Raft_Diego_Ongaro.pdf
Raft_Diego_Ongaro.pdfRaft_Diego_Ongaro.pdf
Raft_Diego_Ongaro.pdf
jsathy97
 

Similaire à Raft in details (20)

Consensus algo with_distributed_key_value_store_in_distributed_system
Consensus algo with_distributed_key_value_store_in_distributed_systemConsensus algo with_distributed_key_value_store_in_distributed_system
Consensus algo with_distributed_key_value_store_in_distributed_system
 
Manging scalability of distributed system
Manging scalability of distributed systemManging scalability of distributed system
Manging scalability of distributed system
 
Module: Mutable Content in IPFS
Module: Mutable Content in IPFSModule: Mutable Content in IPFS
Module: Mutable Content in IPFS
 
Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
 
New awesome features in MySQL 5.7
New awesome features in MySQL 5.7New awesome features in MySQL 5.7
New awesome features in MySQL 5.7
 
Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep Dive
 
Top 10 interview question and answers for mcsa
Top 10 interview question and answers for mcsaTop 10 interview question and answers for mcsa
Top 10 interview question and answers for mcsa
 
NFSv4 Replication for Grid Computing
NFSv4 Replication for Grid ComputingNFSv4 Replication for Grid Computing
NFSv4 Replication for Grid Computing
 
Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPSLoadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
 
Distributed transactions
Distributed transactionsDistributed transactions
Distributed transactions
 
Real-Time Streaming Protocol -QOS
Real-Time Streaming Protocol -QOSReal-Time Streaming Protocol -QOS
Real-Time Streaming Protocol -QOS
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
 
Thriftypaxos
ThriftypaxosThriftypaxos
Thriftypaxos
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Raft_Diego_Ongaro.pdf
Raft_Diego_Ongaro.pdfRaft_Diego_Ongaro.pdf
Raft_Diego_Ongaro.pdf
 
Denser, cooler, faster, stronger: PHP on ARM microservers
Denser, cooler, faster, stronger: PHP on ARM microserversDenser, cooler, faster, stronger: PHP on ARM microservers
Denser, cooler, faster, stronger: PHP on ARM microservers
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
ONOS Platform Architecture
ONOS Platform ArchitectureONOS Platform Architecture
ONOS Platform Architecture
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 

Plus de Ivan Glushkov

Plus de Ivan Glushkov (8)

Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixir
 
Kubernetes is not needed to 90 percents of the companies.rus
Kubernetes is not needed to 90 percents of the companies.rusKubernetes is not needed to 90 percents of the companies.rus
Kubernetes is not needed to 90 percents of the companies.rus
 
Mystery Machine Overview
Mystery Machine OverviewMystery Machine Overview
Mystery Machine Overview
 
Hashicorp Nomad
Hashicorp NomadHashicorp Nomad
Hashicorp Nomad
 
Google Dataflow Intro
Google Dataflow IntroGoogle Dataflow Intro
Google Dataflow Intro
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Comparing ZooKeeper and Consul
Comparing ZooKeeper and ConsulComparing ZooKeeper and Consul
Comparing ZooKeeper and Consul
 
fp intro
fp introfp intro
fp intro
 

Dernier

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Dernier (20)

Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 

Raft in details

  • 1. May 2016 Raft in details Ivan Glushkov ivan.glushkov@gmail.com @gliush
  • 2. Raft is a consensus algorithm for managing a replicated log
  • 3. Why ❖ Very important topic ❖ [Multi-]Paxos, the key player, is very difficult to understand ❖ Paxos -> Multi-Paxos -> different approaches ❖ Different optimization, closed-source implementation
  • 4. Raft goals ❖ Main goal - understandability: ❖ Decomposition: separated leader election, log replication, safety, membership changes ❖ State space reduction
  • 5. Replicated state machines ❖ The same state on each server ❖ Compute the state from replicated log ❖ Consensus algorithm: keep the replicated log consistent
  • 6. Properties of consensus algorithm ❖ Never return an incorrect result: Safety ❖ Functional as long as any majority of the severs are operational: Availability ❖ Do not depend on timing to ensure consistency (at worst - unavailable)
  • 7. Raft in a nutshell ❖ Elect a leader ❖ Give full responsibility for replicated log to leader: ❖ Leader accepts log entries from client ❖ Leader replicates log entries on other servers ❖ Tell servers when it is safe to apply changes on their state
  • 8. Raft basics: States ❖ Follower ❖ Candidate ❖ Leader
  • 9. RequestVote RPC ❖ Become a candidate, if no communication from leader over a timeout (random election timeout: e.g. 150-300ms) ❖ Spec: (term, candidateId, lastLogIndex, lastLogTerm) 
 -> (term, voteGranted) ❖ Receiver: ❖ false if term < currentTerm ❖ true if not voted yet and term and log are Up-To-Date ❖ false
  • 10. Leader Election ❖ Heartbeats from leader ❖ Election timeout ❖ Candidate results: ❖ It wins ❖ Another server wins ❖ No winner for a period of time -> new election
  • 11. Leader Election Property ❖ Election Safety: at most one leader can be elected in a given term
  • 12. Log Replication ❖ log index ❖ log term ❖ log command
  • 13. Log Replication ❖ Leader appends command from client to its log ❖ Leader issues AppendEntries RPC to replicate the entry ❖ Leader replicates entry to majority -> apply entry to state -> “committed entry” ❖ Follower checks if entry is committed by server -> commit it locally
  • 14. Log Replication Properties ❖ Leader Append-Only: a leader never overwrites or deletes entries in its log; it only appends new entries ❖ Log Matching: If two entries in different logs have the same index and term, then they store the same command. ❖ Log Matching: If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
  • 16. Solving Inconsistency ❖ Leader forces the followers’ logs to duplicate its own ❖ Conflicting entries in the follower logs will be overwritten with entries from the Leader’s log: ❖ Find latest log where two servers agree ❖ Remove everything after this point ❖ Write new logs
  • 17. Safety Property ❖ Leader Completeness: if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms ❖ State Machine Safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.
  • 18. Safety Restriction ❖ Up-to-date: if the logs have last entries with different terms, then the log with the later term is more up-to- date. If the logs end with the same term, then whichever log is longer is more up-to-date. time
  • 19. Follower and Candidate Crashes ❖ Retry indefinitely ❖ Raft RPC are idempotent
  • 20. Raft Joint Consensus ❖ Two phase to change membership configuration ❖ In Joint Consensus state: ❖ replicate log entries to both configuration ❖ any server from either configuration may be a leader ❖ agreement (for election and commitment) requires separate majorities from both configuration
  • 21. Log compaction ❖ Independent
 snapshotting ❖ Snapshot metadata ❖ InstallSnapshot RPC
  • 22. Client Interaction ❖ Works with leader ❖ Leader respond to a request when it commits an entry ❖ Assign uniqueID to every command, leader stores latest ID with response. ❖ Read-only - could be without writing to a log, possible stale data. Or write “no-op” entry at start
  • 23. Correctness ❖ TLA+ formal specification for consensus alg (400 LOC) ❖ Proof of linearizability in Raft using Coq (community)