SlideShare une entreprise Scribd logo
1  sur  33
Zab: High-performance broadcast
  for primary-backup systems
  Flavio Junqueira, Benjamin Reed, Marco Serafini

                Yahoo! Research
                     June 2011
Setting up the stage


•   Background: ZooKeeper
•   Coordination service
    ! Web-scale applications
    ! Intensive use (high performance)
    ! Source of truth for many applications


                        June 2011             2
ZooKeeper

•   Open source Apache project
•   Used in production
    ! Yahoo!
    ! Facebook
    ! Rackspace
    ! ...
                                http://zookeeper.apache.org

                    June 2011                             3
ZooKeeper

•   ... is a leader-based, replicated service
    ! Processes crash and recover

•   Leader
    ! Executes requests
                                          Leader     Follower     Follower
    ! Propagates state updates
                                     Broadcast     Deliver      Deliver

•   Follower
                                                 Atomic broadcast
    ! Applies state updates

                              June 2011                                   4
ZooKeeper

•   Client
                                                    Client
    ! Submits operations to a
      server                                               Request

    ! If follower, forwards to          Leader     Follower      Follower
      leader
                                   Broadcast     Deliver       Deliver
    ! Leader executes and
      propagates state update                  Atomic broadcast


                            June 2011                                    5
ZooKeeper

•   State updates
    ! All followers apply the same updates
    ! All followers apply them in the same order
    ! Atomic broadcast

•   Performance requirements
    ! Multiple outstanding operations
    ! Low latency and high throughput
                         June 2011                 6
ZooKeeper
• Update configuration and create ready
• If ready exists, then configuration is
consistent
                                                    setData        del
                                     setData      /cfg/client   /cfg/ready
                                    /cfg/server
                         create          B
                                                       B
                                                                             Follower
                       /cfg/ready

         Leader
                        create
                      /cfg/ready     setData                                 Follower
                                    /cfg/server     setData
                                         B        /cfg/client      del
                                                       B        /cfg/ready




    • If 1 doesn’t commit, then 2+3 can’t                • If 2+3 don’t commit, then 4 must not
    commit                                               commit
                                             June 2011                                       7
ZooKeeper

•   Exploring Paxos
    ! Efficient consensus protocol
    ! State-machine replication
    ! Multiple consecutive instances

•   Why is it not suitable out of the box?
    ! Does not guarantee order
    ! Multiple outstanding operations

                        June 2011            8
Paxos at a glance
                     1b: Acceptor promises         2b: If quorum, value
                      not to accept lower                 is chosen
                             ballots
Acceptor + Learner


                     1a               1b        2a                  2b 3a
    Acceptor +
Proposer + Learner

                      1a              1b          2a                2b    3a
Acceptor + Learner

                          Phase 1:                       Phase 2:           Phase 3:
                           Selects                       Proposes            Value
                          value to                        a value           learned
                          propose

                                             June 2011                                 9
Paxos run                                           Interleaves
                                                                             operations of P1,
           27: <1a,3>                                    27: <2a, 3, C>      P2, and and P3
           28: <1a,3>                                    28: <2a, 3, B>
           29: <1a,3>                                    29: <2a, 3, D>
P3
                        Has
                   accepted A and
                     B from P1
A1
     27: <1, A>               27: <1b, 1, A>
     28: <1, B>               28: <1b, 1, B>
                              29: <1b, _, _>
A2
                             Has                                          27: <3, C>
     27: <2, C>
                         accepted C                                       28: <3, B>
                           from P2                                        29: <3, D>
A3
     27: <2, C>                         27: <1b, 2, C>          27: <3, C>
                                        28: <1b, _, _>          28: <3, B>
                                        29: <1b, _, _>
                                                                29: <3, D>




                                          June 2011                                              10
ZooKeeper

•   Another requirement
    ! Minimize downtime
    ! Efficient recovery

•   Reduce the amount of state transfered
•   Zab
    ! One identifier
    ! Missing values for each process

                          June 2011         11
Zab and PO Broadcast
Definitions

•   Processes: Lead or Follow
•   Followers
    ! Maintain a history of transactions (updates)

•   Transaction identifiers: !e,c"

    ! e : epoch number of the leader
    ! c : epoch counter

                             June 2011               13
Properties of PO Broadcast


•   Integrity
    ! Only broadcast transactions are delivered
    ! Leader recovers before broadcasting new transactions

•   Total order and agreement
    ! Followers deliver the same transactions and in the
      same order


                             June 2011                       14
Primary order

•   Local: Transactions of a leader accepted in
    order
•   Global: Transactions in history respect the
    order of epochs




                      June 2011                   15
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
             abcast(!e,10") abcast(!e,11") abcast(!e,12")
    Leader



Follower



                                     June 2011              16
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
             abcast(!e,10") abcast(!e,11") abcast(!e,12")
    Leader



Follower



                                    June 2011               17
Primary order

•     Local: Transactions of a primary accepted in
      order
•     Global: Transactions in history respect the
      order of epochs
               abcast(!e,10") abcast(!e,11")
    Leader

                                               abcast(!e’,1")
    Leader’


    Follower
                                        June 2011               18
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
              abcast(!e,10")         abcast(!e,11")
    Leader

                               abcast(!e’,1")
    Leader’


Follower
                                       June 2011      19
Zab in Phases

•   Phase 0 - Leader election
    ! Prospective leader          elected

•   Phase 1- Discovery
    ! Followers promise not to go back to previous
      epochs
    ! Followers send to          their last epoch and history

    !    selects longest history of latest epoch
                           June 2011                            20
Zab in Phases

•   Phase 2 - Synchronization
    !    sends new history to followers

    ! Followers confirm leadership

•   Phase 3 - Broadcast
    !    proposes new transactions

    !    commits if quorum acknowledges

                       June 2011          21
Zab in Phases


•   Phases 1 and 2: Recovery
    ! Critical to guarantee order with multiple
      outstanding transactions

•   Phase 3: Broadcast
    ! Just like Phases 2 and 3 of Paxos



                         June 2011                22
Zab: Sample run

                  f1                  f2       f3

               !0,1"               !0,1"     !0,1"
               !0,2"               !0,2"
               !0,3"
New epoch
             f1.a = 0,          f2.a = 0,   f3.a = 0,
               !0,3"              !0,2"       !0,1"
            Initial history
            of new epoch



                              June 2011                 23
Zab: Sample run

                  f1               f2         f3

                !0,1"          !0,1"        !0,1"
                !0,2"          !0,2"        !0,2"
     Chosen!    !1,1"          !1,1"
                !1,2"
New epoch

               f1.a = 1,      f2.a = 1,    f3.a = 2,
                 !1,2"          !1,1"        !0,2"

                           Can’t happen!


                              June 2011                24
Paxos run (revisited)
       Epoch 1, Phase 3                Epoch 2, Phase 3                  Epoch 3, Phase 3
         L1 History: #     Phases 1     L2 History: #        Phases 1     L3 History: !2,1",C
                             and 2                             and 2
                          of Epoch 2                        of Epoch 3




Follower 1
              Epoch: 1                           Epoch: 1                      Epoch: 3
              !1,1",A                            !1,1",A                       !2,1",C
              !1,2",B                            !1,2",B                       !3,1",D
Follower 2
              Epoch: 1                           Epoch: 2                      Epoch: 2
              #                                  !2,1",C                       !2,1",C

Follower 3                                                                     Epoch: 3
              Epoch: 1                           Epoch: 2
              #                                  !2,1",C                       !2,1",C
                                                                               !3,1",D



                                           June 2011                                            25
Notes on implementation

•   Use of TCP
    ! Ordered delivery, retransmissions, etc.

    ! Notion of session

•   Elect leader with most committed txns
    ! No follower ! leader copies

•   Recovery
    ! Last zxid is sufficient
    ! In Phase 2, leader commands to add or truncate

                               June 2011               26
Performance
Experimental setup


•   Implementation in Java
•   13 identical servers
    ! Xeon 2.50GHz, Gigabit interface, two SATA
      disks


                                   http://zookeeper.apache.org

                       June 2011                             28
Throughput
                                        Continuous saturated throughput
                        70000
                                                                         Net only
                                                                      Net + Disk
                        60000                         Net + Disk (no write cache)
                                                                          Net cap

                        50000
Operations per second




                        40000


                        30000


                        20000


                        10000


                            0
                                2   4     6           8          10           12    14
                                        Number of servers in ensemble




                                                  June 2011                              29
Latency




  June 2011   30
Wrap up
Conclusion

•   Zookeeper
    ! Multiple outstanding operations
    ! Dependencies between consecutive updates

•   Zab
    ! Primary Order Broadcast
    ! Synchronization phase
    ! Efficient recovery


                              June 2011          32
Questions?


http://zookeeper.apache.org

Contenu connexe

Tendances

Introduction aux systèmes d'exploitation mobile
Introduction aux systèmes d'exploitation mobileIntroduction aux systèmes d'exploitation mobile
Introduction aux systèmes d'exploitation mobile
Houssem Rouini
 
Sécurité des Applications WEB -LEVEL1
 Sécurité des Applications WEB-LEVEL1 Sécurité des Applications WEB-LEVEL1
Sécurité des Applications WEB -LEVEL1
Tarek MOHAMED
 
Programmation web1 complet
Programmation web1 completProgrammation web1 complet
Programmation web1 complet
Annabi Gihed
 
Presentation (comprendre le telephone satelitaire)
Presentation (comprendre le telephone satelitaire)Presentation (comprendre le telephone satelitaire)
Presentation (comprendre le telephone satelitaire)
USIGGENEVE
 

Tendances (20)

Gestion comptes bancaires Spring boot
Gestion comptes bancaires Spring bootGestion comptes bancaires Spring boot
Gestion comptes bancaires Spring boot
 
Introduction aux systèmes d'exploitation mobile
Introduction aux systèmes d'exploitation mobileIntroduction aux systèmes d'exploitation mobile
Introduction aux systèmes d'exploitation mobile
 
CHAOTIC MOON Introduction Deck
CHAOTIC MOON Introduction DeckCHAOTIC MOON Introduction Deck
CHAOTIC MOON Introduction Deck
 
Sécurité des Applications WEB -LEVEL1
 Sécurité des Applications WEB-LEVEL1 Sécurité des Applications WEB-LEVEL1
Sécurité des Applications WEB -LEVEL1
 
Support de cours Spring M.youssfi
Support de cours Spring  M.youssfiSupport de cours Spring  M.youssfi
Support de cours Spring M.youssfi
 
Design Pattern introduction
Design Pattern introductionDesign Pattern introduction
Design Pattern introduction
 
Tfc kaka vvvvvv ndosi
Tfc kaka vvvvvv ndosiTfc kaka vvvvvv ndosi
Tfc kaka vvvvvv ndosi
 
Projet routier
Projet routierProjet routier
Projet routier
 
Cours systèmes temps réel partie 2 Prof. Khalifa MANSOURI
Cours  systèmes temps réel partie 2 Prof. Khalifa MANSOURICours  systèmes temps réel partie 2 Prof. Khalifa MANSOURI
Cours systèmes temps réel partie 2 Prof. Khalifa MANSOURI
 
Gestion des stations de pompage
Gestion des stations de pompageGestion des stations de pompage
Gestion des stations de pompage
 
PNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving PlatformPNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving Platform
 
Aide au diagnostic
Aide au diagnosticAide au diagnostic
Aide au diagnostic
 
Methodes d'accès dans les réseaux locaux
Methodes d'accès dans les réseaux locauxMethodes d'accès dans les réseaux locaux
Methodes d'accès dans les réseaux locaux
 
Rapport final mini-projet.pdf
Rapport final mini-projet.pdfRapport final mini-projet.pdf
Rapport final mini-projet.pdf
 
Python avancé : Tuple et objet
Python avancé : Tuple et objetPython avancé : Tuple et objet
Python avancé : Tuple et objet
 
Architecture réparties et les services web
Architecture réparties et les services webArchitecture réparties et les services web
Architecture réparties et les services web
 
Programmation web1 complet
Programmation web1 completProgrammation web1 complet
Programmation web1 complet
 
Presentation (comprendre le telephone satelitaire)
Presentation (comprendre le telephone satelitaire)Presentation (comprendre le telephone satelitaire)
Presentation (comprendre le telephone satelitaire)
 
cour robotique
cour robotiquecour robotique
cour robotique
 
Rapport de Stage PFE - Développement d'un Projet ALTEN MAROC Concernant le Sy...
Rapport de Stage PFE - Développement d'un Projet ALTEN MAROC Concernant le Sy...Rapport de Stage PFE - Développement d'un Projet ALTEN MAROC Concernant le Sy...
Rapport de Stage PFE - Développement d'un Projet ALTEN MAROC Concernant le Sy...
 

Similaire à Zab dsn-2011

Environment Delivery Management Services
Environment Delivery Management  ServicesEnvironment Delivery Management  Services
Environment Delivery Management Services
drummondrj
 
20110903 candycane
20110903 candycane20110903 candycane
20110903 candycane
Yusuke Ando
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Sv jug - mar 2013 - sl
Sv jug - mar 2013 - slSv jug - mar 2013 - sl
Sv jug - mar 2013 - sl
CloudBees
 
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development ProcessZararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zarafa
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11
Hortonworks
 

Similaire à Zab dsn-2011 (20)

PushToTest TestMaker 6.5 Open Source Test Design Document
PushToTest TestMaker 6.5 Open Source Test Design DocumentPushToTest TestMaker 6.5 Open Source Test Design Document
PushToTest TestMaker 6.5 Open Source Test Design Document
 
Environment Delivery Management Services
Environment Delivery Management  ServicesEnvironment Delivery Management  Services
Environment Delivery Management Services
 
Approximating Change Sets at Philips Healthcare: A Case Study
Approximating Change Sets at Philips Healthcare: A Case StudyApproximating Change Sets at Philips Healthcare: A Case Study
Approximating Change Sets at Philips Healthcare: A Case Study
 
20110903 candycane
20110903 candycane20110903 candycane
20110903 candycane
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution
 
Sv jug - mar 2013 - sl
Sv jug - mar 2013 - slSv jug - mar 2013 - sl
Sv jug - mar 2013 - sl
 
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development ProcessZararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
 
Getting started with GIT
Getting started with GITGetting started with GIT
Getting started with GIT
 
New York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for KubernetesNew York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for Kubernetes
 
How to Introduce Continuous Delivery
How to Introduce Continuous DeliveryHow to Introduce Continuous Delivery
How to Introduce Continuous Delivery
 
Value-Stream-Mapping,
Value-Stream-Mapping, Value-Stream-Mapping,
Value-Stream-Mapping,
 
Atril-Déjà Vu Tea mserver 2 general presentation
Atril-Déjà Vu Tea mserver 2   general presentationAtril-Déjà Vu Tea mserver 2   general presentation
Atril-Déjà Vu Tea mserver 2 general presentation
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11
 
Go Training
Go TrainingGo Training
Go Training
 
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
 
Subversion last minute survival crash course
Subversion  last minute survival crash courseSubversion  last minute survival crash course
Subversion last minute survival crash course
 
AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs
 
Kubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptxKubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptx
 
Lean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software DevelopersLean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software Developers
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Zab dsn-2011

  • 1. Zab: High-performance broadcast for primary-backup systems Flavio Junqueira, Benjamin Reed, Marco Serafini Yahoo! Research June 2011
  • 2. Setting up the stage • Background: ZooKeeper • Coordination service ! Web-scale applications ! Intensive use (high performance) ! Source of truth for many applications June 2011 2
  • 3. ZooKeeper • Open source Apache project • Used in production ! Yahoo! ! Facebook ! Rackspace ! ... http://zookeeper.apache.org June 2011 3
  • 4. ZooKeeper • ... is a leader-based, replicated service ! Processes crash and recover • Leader ! Executes requests Leader Follower Follower ! Propagates state updates Broadcast Deliver Deliver • Follower Atomic broadcast ! Applies state updates June 2011 4
  • 5. ZooKeeper • Client Client ! Submits operations to a server Request ! If follower, forwards to Leader Follower Follower leader Broadcast Deliver Deliver ! Leader executes and propagates state update Atomic broadcast June 2011 5
  • 6. ZooKeeper • State updates ! All followers apply the same updates ! All followers apply them in the same order ! Atomic broadcast • Performance requirements ! Multiple outstanding operations ! Low latency and high throughput June 2011 6
  • 7. ZooKeeper • Update configuration and create ready • If ready exists, then configuration is consistent setData del setData /cfg/client /cfg/ready /cfg/server create B B Follower /cfg/ready Leader create /cfg/ready setData Follower /cfg/server setData B /cfg/client del B /cfg/ready • If 1 doesn’t commit, then 2+3 can’t • If 2+3 don’t commit, then 4 must not commit commit June 2011 7
  • 8. ZooKeeper • Exploring Paxos ! Efficient consensus protocol ! State-machine replication ! Multiple consecutive instances • Why is it not suitable out of the box? ! Does not guarantee order ! Multiple outstanding operations June 2011 8
  • 9. Paxos at a glance 1b: Acceptor promises 2b: If quorum, value not to accept lower is chosen ballots Acceptor + Learner 1a 1b 2a 2b 3a Acceptor + Proposer + Learner 1a 1b 2a 2b 3a Acceptor + Learner Phase 1: Phase 2: Phase 3: Selects Proposes Value value to a value learned propose June 2011 9
  • 10. Paxos run Interleaves operations of P1, 27: <1a,3> 27: <2a, 3, C> P2, and and P3 28: <1a,3> 28: <2a, 3, B> 29: <1a,3> 29: <2a, 3, D> P3 Has accepted A and B from P1 A1 27: <1, A> 27: <1b, 1, A> 28: <1, B> 28: <1b, 1, B> 29: <1b, _, _> A2 Has 27: <3, C> 27: <2, C> accepted C 28: <3, B> from P2 29: <3, D> A3 27: <2, C> 27: <1b, 2, C> 27: <3, C> 28: <1b, _, _> 28: <3, B> 29: <1b, _, _> 29: <3, D> June 2011 10
  • 11. ZooKeeper • Another requirement ! Minimize downtime ! Efficient recovery • Reduce the amount of state transfered • Zab ! One identifier ! Missing values for each process June 2011 11
  • 12. Zab and PO Broadcast
  • 13. Definitions • Processes: Lead or Follow • Followers ! Maintain a history of transactions (updates) • Transaction identifiers: !e,c" ! e : epoch number of the leader ! c : epoch counter June 2011 13
  • 14. Properties of PO Broadcast • Integrity ! Only broadcast transactions are delivered ! Leader recovers before broadcasting new transactions • Total order and agreement ! Followers deliver the same transactions and in the same order June 2011 14
  • 15. Primary order • Local: Transactions of a leader accepted in order • Global: Transactions in history respect the order of epochs June 2011 15
  • 16. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") abcast(!e,12") Leader Follower June 2011 16
  • 17. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") abcast(!e,12") Leader Follower June 2011 17
  • 18. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") Leader abcast(!e’,1") Leader’ Follower June 2011 18
  • 19. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") Leader abcast(!e’,1") Leader’ Follower June 2011 19
  • 20. Zab in Phases • Phase 0 - Leader election ! Prospective leader elected • Phase 1- Discovery ! Followers promise not to go back to previous epochs ! Followers send to their last epoch and history ! selects longest history of latest epoch June 2011 20
  • 21. Zab in Phases • Phase 2 - Synchronization ! sends new history to followers ! Followers confirm leadership • Phase 3 - Broadcast ! proposes new transactions ! commits if quorum acknowledges June 2011 21
  • 22. Zab in Phases • Phases 1 and 2: Recovery ! Critical to guarantee order with multiple outstanding transactions • Phase 3: Broadcast ! Just like Phases 2 and 3 of Paxos June 2011 22
  • 23. Zab: Sample run f1 f2 f3 !0,1" !0,1" !0,1" !0,2" !0,2" !0,3" New epoch f1.a = 0, f2.a = 0, f3.a = 0, !0,3" !0,2" !0,1" Initial history of new epoch June 2011 23
  • 24. Zab: Sample run f1 f2 f3 !0,1" !0,1" !0,1" !0,2" !0,2" !0,2" Chosen! !1,1" !1,1" !1,2" New epoch f1.a = 1, f2.a = 1, f3.a = 2, !1,2" !1,1" !0,2" Can’t happen! June 2011 24
  • 25. Paxos run (revisited) Epoch 1, Phase 3 Epoch 2, Phase 3 Epoch 3, Phase 3 L1 History: # Phases 1 L2 History: # Phases 1 L3 History: !2,1",C and 2 and 2 of Epoch 2 of Epoch 3 Follower 1 Epoch: 1 Epoch: 1 Epoch: 3 !1,1",A !1,1",A !2,1",C !1,2",B !1,2",B !3,1",D Follower 2 Epoch: 1 Epoch: 2 Epoch: 2 # !2,1",C !2,1",C Follower 3 Epoch: 3 Epoch: 1 Epoch: 2 # !2,1",C !2,1",C !3,1",D June 2011 25
  • 26. Notes on implementation • Use of TCP ! Ordered delivery, retransmissions, etc. ! Notion of session • Elect leader with most committed txns ! No follower ! leader copies • Recovery ! Last zxid is sufficient ! In Phase 2, leader commands to add or truncate June 2011 26
  • 28. Experimental setup • Implementation in Java • 13 identical servers ! Xeon 2.50GHz, Gigabit interface, two SATA disks http://zookeeper.apache.org June 2011 28
  • 29. Throughput Continuous saturated throughput 70000 Net only Net + Disk 60000 Net + Disk (no write cache) Net cap 50000 Operations per second 40000 30000 20000 10000 0 2 4 6 8 10 12 14 Number of servers in ensemble June 2011 29
  • 30. Latency June 2011 30
  • 32. Conclusion • Zookeeper ! Multiple outstanding operations ! Dependencies between consecutive updates • Zab ! Primary Order Broadcast ! Synchronization phase ! Efficient recovery June 2011 32