SlideShare une entreprise Scribd logo
1  sur  58
Télécharger pour lire hors ligne
Synchronous Log Shipping
             Replication

      Takahiro Itagaki and Masao Fujii
     NTT Open Source Software Center

          PGCon 2008
Agenda

• Introduction: What is this?
        – Background
        – Compare with other replication solutions
• Details: How it works
        – Struggles in development
• Demo
• Future work: Where are we going?
• Conclusion



Copyright © 2008 NTT, Inc. All Rights Reserved.      2
What is this?
What is this?
• Successor of warm-standby servers
        – Replication system using WAL shipping.
                  • using Point-in-Time Recovery mechanism
        – However, no data loss after failover because of
          synchronous log-shipping.

                                                   WAL
                                   Active Server         Standby Server


• Based on PostgreSQL 8.2 with a patch and
  including several scripts
        – Patch: Add two processes into postgres
        – Scripts: For management commands

Copyright © 2008 NTT, Inc. All Rights Reserved.                           4
Warm-Standby Servers (v8.2~)
      Active Server (ACT)                                      Standby Server (SBY)

Commit              1
                                2 Flush WAL to disk
(Return)            3
                                                    Failover

            Crash!
                                   WAL seg                        Sent after commits

                                           4 archive_command
                                                                    WAL seg
                                                                                Redo



 The last segment is not                                        We need to wait for remounting
 available in the standby server                                active’s storage on the standby
 if the active crashes before                                   server, or we wait the active’s
 archiving it.                                                  reboot.
  Copyright © 2008 NTT, Inc. All Rights Reserved.                                                 5
Synchronous Log Shipping Servers
      Active Server (ACT)                                      Standby Server (SBY)

                                    WAL entries are sent
                                    before returning from
Commit              1               commits by records.
                                                                                 Segments are formed
                                2 Flush WAL to disk                              from records in the
                                                                  WAL records    standby server.
                                3 Send WAL records

(Return)             4
                                                                     WAL seg
                                                    Failover

            Crash!
                                                                               Redo

                                                                We can start the standby server
                                                                after redoing remaining segments;
                                                                We’ve received all transaction logs
                                                                already in it.
  Copyright © 2008 NTT, Inc. All Rights Reserved.                                                  6
Background: Why new solution?
• We have many migration projects from Oracles
  and compete with them with postgres.
        – So, we hope postgres to be SUPERIOR TO ORACLE!
• Our activity in PostgreSQL 8.3
        – Performance stability
                  • Smoothed checkpoint
        – Usability; Ease to tune server parameters
                  • Multiple autovacuum workers
                  • JIT bgwriter – automatic tuning of bgwriter



• Where are alternatives of RAC?
        – Oracle Real Application Clusters

Copyright © 2008 NTT, Inc. All Rights Reserved.                   7
Background: Alternatives of RAC
• Oracle RAC is a multi-purpose solution
        – … but we don’t need all of the properties.
• In our use:
        –     No downtime                         <-   Very Important
        –     No data loss                        <-   Very Important
        –     Automatic failover                  <-   Important
        –     Performance in updates              <-   Important
        –     Inexpensive hardware                <-   Important
        –     Performance scalability             <-   Not important
• Goal
        – Minimizing system downtime
        – Minimizing performance impact in updated-workloads
Copyright © 2008 NTT, Inc. All Rights Reserved.                         8
Compare with other replication solutions

                                  No data   No SQL                    Performance    Update      How to
                                                          Failover     scalability performance
                                    loss  restriction                                             copy?
Log Shipping                           OK           OK   Auto, Fast      No          Good
                                                                                                  Log
warm-standby                           NG           OK    Manual         No         Async
     Slony-I                           NG           OK    Manual        Good        Async        Trigger
                                                           Auto,
  pgpool-II                            OK           NG    Hard to       Good       Medium         SQL
                                                         re-attach
Shared Disks                           OK           OK   Auto, Slow      No          Good         Disk

• Log Shipping is excellent except performance scalability.
• Also, Re-attaching a repaired server is simple.
       – Just same as normal hot-backup procedure
                 • Copy active server’s data into standby and just wait for WAL replay.
       – No service stop during re-attaching

  Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     9
Compare downtime with shared disks
 • Cold standby with shared disks is an alternative solution
    – but it takes long time to failover in heavily-updated load.
    – Log-shipping saves time for mounting disks and recovery.
Shared disk system


Crash!
                                                               20 sec                  60 ~ 180 sec (*)
                                                             to umount                    to recover
Log-shipping system                                         and remount                 from the last
                                                            shared disks                 checkpoint

Crash!                                           Ok, the service is restarted!
                                     5 sec
   10 sec                         to recover                           (*) Measured in PostgreSQL 8.2.
  to detect                         the last                           8.3 would take less time because
server down                       segement                             of less i/o during recovery.

   Copyright © 2008 NTT, Inc. All Rights Reserved.                                                        10
Advantages and Disadvantages
• Advantages
        – Synchronous
                  • No data loss on failover
        – Log-based (Physically same structure)
                  • No functional restrictions in SQL
                  • Simple, Robust, and Easy to setup
        – Shared-nothing
                  • No Single Point of Failure
                  • No need for expensive shared disks
        – Automatic Fast Failover (within 15 seconds)
                  • “Automatic” is essential not to wait human operations
        – Less impact against update performance (less than 7%)
• Disadvantages
        – No performance scalability (for now)
        – Physical replication. Cannot use for upgrading purposes.

Copyright © 2008 NTT, Inc. All Rights Reserved.                             11
Where is it used?
• Interactive teleconference management package
        – Commercial service in active
        – Manage conference booking and file transfer
        – Log-shipping is an optional module for users requiring
          high availability




                                                                 Internet networks
                                                  Communicator




Copyright © 2008 NTT, Inc. All Rights Reserved.                                  12
How it works
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                          Standby
                                                  Heartbeat             Heartbeat
                      VIP                            RA                    RA

                                                  PostgreSQL           PostgreSQL
                      DB                                                                 DB

                     WAL                           postgres              startup        WAL

                 wal
                buffers                                        WAL
                                                  WALSender            WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     14
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1                                  In our replicator, there are two
        –     Open source high-availability software manages the resources via resource agent(RA)
        –
                                             nodes, active and standby
              Heartbeat provides a virtual IP(VIP)
     Active                                                             Standby
                                                  Heartbeat                 Heartbeat
                      VIP                            RA                        RA

                                                  PostgreSQL               PostgreSQL
                      DB                                                                     DB

                     WAL                           postgres                  startup         WAL

                 wal
                buffers                                           WAL
                                                  WALSender                WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     15
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                           Standby
                                                  Heartbeat               Heartbeat
                      VIP                            RA                      RA

                                                               In the active node, postgres is
                                                  PostgreSQL              PostgreSQL
                      DB                                       running in normal mode with newDB
                                                               child process WALSender
                     WAL                           postgres                startup         WAL

                 wal
                buffers                                         WAL
                                                  WALSender              WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     16
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                         Standby
                                                  Heartbeat            Heartbeat
                      VIP                            RA                   RA

In the standby node, postgres is running
                       PostgreSQL                                     PostgreSQL
in continuous recovery mode with new
           DB                                                                            DB
daemon WALReceiver
                     WAL                           postgres             startup         WAL

                 wal
                buffers                                       WAL
                                                  WALSender           WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     17
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                          Standby
                                                  Heartbeat             Heartbeat
                      VIP                            RA                    RA

                                                  PostgreSQL           PostgreSQL
                      DB                                                                 DB
    In order to manage these resources,
             WAL           postgres                                      startup        WAL
    there is heartbeat in both nodes
                 wal
                buffers                                        WAL
                                                  WALSender            WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     18
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                          Standby
                                                  Heartbeat             Heartbeat
                      VIP                            RA                    RA

                                                  PostgreSQL           PostgreSQL
                      DB                                                                 DB

                     WAL                           postgres              startup        WAL

                 wal
                buffers                                        WAL
                                                  WALSender            WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     19
WALSender
        Active
                                 postgres                    walbuffers         WALSender
               Update
                                                  Insert


               Commit
                                                  Flush
                                                                          WAL
                                                  Request

                                                                            Read
                                                                                     Send / Recv
                                                  (Return)

               (Return)




Copyright © 2008 NTT, Inc. All Rights Reserved.                                                    20
WALSender
        Active
                                 postgres                    walbuffers      WALSender
               Update                                                 XLogInsert()
                                                  Insert


               Commit
                                                  Flush       Update command triggers XLogInsert()
                                                              and inserts WAL into walbuffers
                                                                          WAL
                                                  Request

                                                                           Read
                                                                                     Send / Recv
                                                  (Return)

               (Return)




Copyright © 2008 NTT, Inc. All Rights Reserved.                                                      21
WALSender
        Active
                                 postgres                    walbuffers          WALSender
               Update                                                 XLogInsert()
                                                  Insert

                                                                      XLogWrite()
               Commit
                                                  Flush
                                                                           WAL
                                                  Request

                                                                              Read
                                                                                      Send / Recv
                                                  (Return)
                                                                          Commit command triggers
                                                                          XLogWrite() and flushs WAL to disk
               (Return)




Copyright © 2008 NTT, Inc. All Rights Reserved.                                                                22
WALSender
        Active
                                 postgres                    walbuffers         WALSender
               Update                                                 XLogInsert()
                                                  Insert

                                                                      XLogWrite()
               Commit
                                                  Flush
                                                                          WAL
             Changed                              Request

                                                                            Read
                                                                                     Send / Recv
                                                  (Return)

               (Return)
                                                    We changed XLogWrite() to request
                                                    WALSender to transfer WAL

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                    23
WALSender
        Active
                                 postgres                    walbuffers      WALSender
               Update                                                 XLogInsert()
                                                  Insert

                                                                      XLogWrite()
               Commit                                                  WALSender reads WAL from
                                                  Flush
                                                                       walbuffers and transfer them
                                                                        WAL
             Changed                              Request

                                                                           Read
                                                                                     Send / Recv
                                                  (Return)

               (Return)
                                                   After transfer finishes, commit
                                                   command returns

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                       24
WALReceiver
       Standby
                           WALReceiver                     WAL Disk      startup
           Recv / Send
                                                  Flush

                                                  Inform

                                                                      Read


                                                                              Replay




Copyright © 2008 NTT, Inc. All Rights Reserved.                                        25
WALReceiver
       Standby
                           WALReceiver                     WAL Disk      startup
           Recv / Send
                                                  Flush

                                                  Inform
    WALReceiver receives WAL from
                                                                      Read
    WALSender and flushes them to disk

                                                                              Replay




Copyright © 2008 NTT, Inc. All Rights Reserved.                                        26
WALReceiver
       Standby
                           WALReceiver                     WAL Disk      startup
           Recv / Send
                                                  Flush

                                                  Inform

                                                                      Read
    WALReceiver informs startup
    process of the latest LSN.
                                                                              Replay




Copyright © 2008 NTT, Inc. All Rights Reserved.                                        27
WALReceiver
       Standby
                           WALReceiver                     WAL Disk      startup
           Recv / Send
                                                  Flush                        ReadRecord()
                                                  Inform

                                                                      Read         Changed


Startup process reads WAL up to the latest LSN                                Replay
and replays.
We changed ReadRecord() so that startup
process could communicate with WALReceiver
and replay by each WAL record.




Copyright © 2008 NTT, Inc. All Rights Reserved.                                               28
Why replay by each WAL record?
• Minimize downtime
• Shorter delay in read-only queries (at the standby)

                                                          Our replicator   Warm-Standby
  Replay by each                                          WAL record       WAL segment
  Needed to be replayed at failover a few records                          the latest one segment
  Delay in read-only queries                              shorter          longer


       Our replicator
       Warm-standby
                                         segment1             segment2


                                                  WAL block
                                                  WAL which can be replayed now
                                                  WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     29
Why replay by each WAL record?
• Minimize downtime
• ShorterIn our replicator, becausequeries by each WAL record,
          delay in read-only of replay (at the standby)
                           the standby only has to replay a few records at failover

                                                          Our replicator   Warm-Standby
  Replay by each                                          WAL record       WAL segment
  Needed to be replayed at failover a few records                          the latest one segment
  Delay in read-only queries                              shorter          longer


       Our replicator
       Warm-standby
                                         segment1             segment2


                                                  WAL block
                                                  WAL which can be replayed now
                                                  WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     30
Why replay by each WAL record?
• Minimize downtime
               On the other hand, in warm-standby, because of
• Shorter delayreplay by each WAL segment, thethe standby)
                in read-only queries (at standby has to
                                            replay the latest one segment

                                                           Our replicator   Warm-Standby
  Replay by each                                           WAL record       WAL segment
  Needed to be replayed at failover a few records                           the latest one segment
  Delay in read-only queries                               shorter          longer


       Our replicator
       Warm-standby
                                         segment1               segment2


                                                   WAL block
                                                   WAL which can be replayed now
                                                   WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                      31
Why replay by each WAL record?
• Minimize downtime
• Shorter delay in read-only queries (at the standby)

                                                          Our replicator   Warm-Standby
  Replay by each                                          WAL record       WAL segment
  Needed to be replayed at failover a few records                          the latest one segment
  Delay in read-only queries                        In thisshorter
                                                            example, warm-standby needed to
                                                                            longer
                                                    replay most 'segment2' at failover.

       Our replicator
       Warm-standby
                                         segment1             segment2


                                                  WAL block
                                                  WAL which can be replayed now
                                                  WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     32
Why replay by each WAL record?
 • Minimize downtime
And,Shorter delay in read-only queries (at the standby)
 • in our replicator, because of
replay by each WAL record, delay
in read-only queries is shorter
                                                           Our replicator   Warm-Standby
   Replay by each                                          WAL record       WAL segment
   Needed to be replayed at failover a few records                          the latest one segment
   Delay in read-only queries                              shorter          longer


        Our replicator
        Warm-standby
                                          segment1             segment2


                                                   WAL block
                                                   WAL which can be replayed now
                                                   WAL needed to be replayed at failover

 Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     33
Why replay by each WAL record?
• Minimize downtime
• Shorter delay in read-only queries (at the standby)

                                   Our replicator
               Therefore, we implemeted replay                             Warm-Standby
  Replay by each each WAL record WAL record
               by                                                          WAL segment
  Needed to be replayed at failover a few records                          the latest one segment
  Delay in read-only queries                              shorter          longer


       Our replicator
       Warm-standby
                                         segment1             segment2


                                                  WAL block
                                                  WAL which can be replayed now
                                                  WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     34
Heartbeat and resource agent
• Heartbeat needs resource agent (RA) to manage
  PostgreSQL(with WALSender) and WALReceiver as a
  resource
• RA is an executable providing the following feature
       Feature Description
       start                  start the resources as standby
       promote change the status from standby to active
       demote                 change the status from active to standby
       stop                   stop the resources
       monitor                check if the resource is running normally

                                Invoke            RA                Resources
         Heartbeat
                                                  start   promote    PostgreSQL
                                                  stop    demote     (WALSender)
                                                       monitor       WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                    35
Failover
• Failover occurs when heartbeat detects that the active
  node is not running normally
• After failover, clients can restart transactions only by
  reconnecting to virtual IP provided by Heartbeat
                                                  Standby
                                                      Heartbeat       startup

                                                   Detect
                    Act                                                     ReadRecord()

                                                            Request
                                                                                Changed
     At failover, heartbeat requests startup
     process to finish WAL replay.
                                                                                Replay
     We changed ReadRecord() to deal with
     this request.



Copyright © 2008 NTT, Inc. All Rights Reserved.                                            36
Failover
• Failover occurs when heartbeat detects that the active
  node is not running normally
• After failover, clients can restart transactions only by
  reconnecting to virtual IP provided by Heartbeat
                                                  Standby
                                                      Heartbeat       startup

                                                   Detect
                    Act                                                     ReadRecord()

                                                            Request
                                                                                Changed


                            After finishing WAL replay,                         Replay
                            the standby becomes active




Copyright © 2008 NTT, Inc. All Rights Reserved.                                            37
Struggles in development
Downtime caused by the standby down
• The active down triggers a failover and causes downtime
• Additionally, the standby down might also cause downtime
        – WALSender waits for the response from the standby after sending WAL
        – So, when the standby down occurs, unless WALSender detects the
          failure, WALSender is blocked
        – i.e. WALSender keeps waiting for the response which never comes

• How to detect
        – Timeout notification is needed to detect
        – Keepalive, but it doesn't work occasionally on Linux (Linux bug!?)
        – Original timeout
                                                     Active

                                                        postgres      WALSender

                                                  Commit
                                                               Request
                                                                             Send WAL             Down!!
                                                                                        Standby

                                                                             Wait
                                                               (Return)
                                                  (Return)                   Blocked

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                            39
Downtime caused by clients
• Even if the database finishes a failover immediately,
  downtime might still be long by clients reason
        – Clients wait for the response from the database
        – So, when a failover occurs, unless clients detect a failover, they
          can't reconnect to the new active and restart the transaction
        – i.e. clients keeps waiting for the response which never comes

• How to detect
        – Timeout notification is needed to detect
        – Keepalive
                  • Our setKeepAlive patch was accepted in JDBC 8.4dev
        – Socket timeout
        – Query timeout                                                                  Client
                                                  We want to implement
                                                  these timeouts!!       Down!! Active        Active


Copyright © 2008 NTT, Inc. All Rights Reserved.                                                   40
Split-brain
• High-availability clusters must be able to handle split-
  brain
• Split-brain causes data inconsistency
        – Both nodes are active and provide the virtual IP
        – So, clients might update inconsistently each node

• Our replicator also causes split-brain unless the standby
  can distinguish network failure from the active down
                       Failure!!
    Active                                    Active     Active          Standby
                                                          Down!!


 If network failure are mis-detected as                If the active down are mis-detected as
 the active down, the standby becomes                  network failure, a failover doesn't start
 active even if the other active is still              even if the other active is down.
 running normally.                                     This scenario is also problem though
 This is split-brain scenario.                         split-brain doesn't occur.
Copyright © 2008 NTT, Inc. All Rights Reserved.                                                    41
Split-brain
• How to distinguish
        –       Combining the following solution

1. Redundant network between two nodes
        –       The standby can distinguish unless all networks fail

2. STONITH(Shoot The Other Node In The Head)
        –       Heartbeat's default solution for avoiding split-brain
        –       STONITH always forcibly turns off the active when activating the
                standby
        –       Split-brain doesn't occur because the active node is always only
                one
                                                        Active        Active

                                                  Turn off!!
                                                                 STONITH



Copyright © 2008 NTT, Inc. All Rights Reserved.                                42
What delays the activation of the standby
• In order to activate the standby immediately, recovery
  time at failover must be short!!

• In 8.2, recovery is very slow
        – A lot of WAL needed to be replayed at failover might be
          accumulated
        – Another problem: disk full failure might happen

• In 8.3, reocvery is fast☺
        – Because of avoiding unnecessary reads
        – But, there are still two problems




Copyright © 2008 NTT, Inc. All Rights Reserved.                     43
What delays the activation of the standby
1. Checkpoint during recovery
        –         It took 1min or more (in the worst case) and occupied 21% of
                  recovery time
        –         What is worse is that WAL replay is blocked during checkpoint
                  •       Because only startup process performs both checkpoint and WAL
                          replay
        -> Checkpoint delays recovery...

•        [Just idea] bgwriter during recovery
        –         Leaving checkpoint to bgwriter, and making startup process
                  concentrate on WAL replay




Copyright © 2008 NTT, Inc. All Rights Reserved.                                           44
What delays the activation of the standby
2. Checkpoint at the end of recovery
        – Activation of the standby is blocked during checkpoint
        -> Downtime might take 1min or more...

•        [Just idea] Skip of the checkpoint at the end of recovery
        –         But, postgres works fine if it fails before at least one checkpoint
                  after recovery?
        –         We have to reconsider why checkpoint is needed at the end of
                  recovery

!!! Of course, because recovery is a critical part for DBMS,
     more careful investigation is needed to realize these
     ideas



Copyright © 2008 NTT, Inc. All Rights Reserved.                                    45
How we choose the node with the later LSN

• When starting both two nodes, we should synchronize
  from the node with the later LSN to the other
        – But, it's unreliable to depend on server logs (e.g. heartbeat log)
          or a human memory in order to choose the node

• We choose the node from WAL which is most reliable
        – Find the latest LSN from WAL files in each node by using our
          original tool like xlogdump and compare them




Copyright © 2008 NTT, Inc. All Rights Reserved.                                46
Bottleneck
• Bad performance after failover
        – No FSM
        – A little commit hint bits in heap tuples
        – A little dead hint bits in indexes




Copyright © 2008 NTT, Inc. All Rights Reserved.      47
Demo
Demo
Environment
                                                   Client
•    2 nodes, 1 client
How to watch                               Node0        Node1
•   there are two kind of terminals
•   the terminal at the top of the screen displays the cluster status

                                                      The node with
                                                      • 3 lines is active
                                                      • 1 line is standby
                                                      • no line is not started yet


                                                                  active

                                                                 standby
•        the other terminal is for operation
        –         Client
        –         Node0
        –         Node1

Copyright © 2008 NTT, Inc. All Rights Reserved.                               49
Demo
Operation
1. start only node0 as the active

2. createdb and pgbench -i (from client)

3. online backup

4. copy the backup from node0 to node1

5. pgbench -c2 -t2000

6. start node1 as the standby during pgbench ->
   synchronization starts

7. killall -9 postgres (in active node0) -> failover occurs

Copyright © 2008 NTT, Inc. All Rights Reserved.               50
Future work
- Where are we going? -
Where are we going?
• We’re thinking to make it Open Source Software.
        – To be a multi-purpose replication framework
        – Collaborators welcome.
• TODO items
        – For 8.4 development
                  • Re-implement WAL-Sender and WAL-Receiver as extensions
                    using two new hooks
                  • Xlogdump to be an official contrib module
        – For performance
                  • Improve checkpointing during recovery
                  • Handling un-logged operations
        – For usability
                  • Improve detection of server down in client library
                  • Automatic retrying abundant transactions in client library


Copyright © 2008 NTT, Inc. All Rights Reserved.                                  52
For 8.4 : WAL-writing Hook
• Purpose
        – Make WAL-Sender to be one of general extensions
                  • WAL-Sender sends WAL records before commits
• Proposal
        – Introduce “WAL-subscriber model”
        – “WAL-writing Hook” enables to replace or filter WAL
          records just before they are written down to disks.
• Other extensions using this hook
        – “Software RAID” WAL writer for redundancy
                  • Writes WAL into two files for durability (it might be a
                    paranoia…)
        – Filter to make a bitmap for partial backup
                  • Writes changed pages into on-disk bitmaps
        – …

Copyright © 2008 NTT, Inc. All Rights Reserved.                               53
For 8.4 : WAL-reading Hook
• Purpose
        – Make WAL-Receiver to be one of general extensions
                  • WAL-Receiver redo in each record, not in each segment
• Proposal
        – “WAL-reading Hook” enables to filter WAL records
          during they are read in recovery.
• Other extensions using this hook
        – Read-ahead WAL reader
                  • Read a segment at once and pre-fetch required pages that are
                    not a full-page-writes and not in shared buffers
        – …



Copyright © 2008 NTT, Inc. All Rights Reserved.                                54
Future work : Multiple Configurations
 • Supports several synchronization modes
          – One configuration is not fit all,
            but one framework could fit many uses!
                                                       Before/After Commit in ACT
         No.                        Configuration    Send     Flush    Flush    Redo
                                                    to SBY   in ACT   in SBY   in SBY
              1 Speed                               After    After    After    After
              2 Speed + Durability                  After    Before   After    After
              3 HA + Speed                          Before   After    After    After
Now
              4 HA + Durability                     Before   Before   After    After
              5 HA + More durability                Before   Before   Before   After
              6 Synchronous Reads in SBY            Before   Before   Before   Before



  Copyright © 2008 NTT, Inc. All Rights Reserved.                                      55
Future work : Horizontal scalability
• Horizontal scalability is not our primary goal, but
  for potential users.
• Postgres TODO: “Allow a warm standby system to
  also allow read-only statements” helps us.
• NOTE: We need to support 3 or more servers
  if we need both scalability and availability.

           2 servers
                                                  2 * 50%   1 * 100%

           3 servers

                                                  3 * 66%   2 * 100%


Copyright © 2008 NTT, Inc. All Rights Reserved.                        56
Conclusion
• Synchronous log-shipping is the best for HA.
        – A direction of future warm-standby
        – Less downtime, No data loss, and Automatic failover.
• There remains rooms for improvements.
        – Minimize downtime and performance scalability.
        – Improvements for recovery also helps Log-shipping.


• We’ve shown requirements, advantages, and
  remaining tasks.
        – It has potential to improvements, but requires some
          works to be more useful solution
        – We’ll make it open source! Collaborators welcome!
Copyright © 2008 NTT, Inc. All Rights Reserved.                  57
Fin.



Contact
  itagaki.takahiro@oss.ntt.co.jp
  fujii.masao@oss.ntt.co.jp

Contenu connexe

Tendances

VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialRichard McDougall
 
Virtualizacao de Servidores - Windows
Virtualizacao de Servidores - WindowsVirtualizacao de Servidores - Windows
Virtualizacao de Servidores - WindowsSergio Maia
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Santosh Kangane
 
The Magic of Hot Streaming Replication, Bruce Momjian
The Magic of Hot Streaming Replication, Bruce MomjianThe Magic of Hot Streaming Replication, Bruce Momjian
The Magic of Hot Streaming Replication, Bruce MomjianFuenteovejuna
 
KVM Tuning @ eBay
KVM Tuning @ eBayKVM Tuning @ eBay
KVM Tuning @ eBayXu Jiang
 
Docking postgres
Docking postgresDocking postgres
Docking postgresrycamor
 
Comandos
ComandosComandos
Comandos1 2d
 
Amazon EC2 in der Praxis
Amazon EC2 in der PraxisAmazon EC2 in der Praxis
Amazon EC2 in der PraxisJonathan Weiss
 
VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts
VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts
VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts VMworld
 
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld
 
090507.New Replication Features(2)
090507.New Replication Features(2)090507.New Replication Features(2)
090507.New Replication Features(2)liufabin 66688
 
Atril-Déjà Vu Tea mserver 2 general presentation
Atril-Déjà Vu Tea mserver 2   general presentationAtril-Déjà Vu Tea mserver 2   general presentation
Atril-Déjà Vu Tea mserver 2 general presentationcohlmann
 
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.Takatoshi Matsuo
 

Tendances (20)

VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A Tutorial
 
Virtualizacao de Servidores - Windows
Virtualizacao de Servidores - WindowsVirtualizacao de Servidores - Windows
Virtualizacao de Servidores - Windows
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
The Magic of Hot Streaming Replication, Bruce Momjian
The Magic of Hot Streaming Replication, Bruce MomjianThe Magic of Hot Streaming Replication, Bruce Momjian
The Magic of Hot Streaming Replication, Bruce Momjian
 
Xen community update
Xen community updateXen community update
Xen community update
 
KVM Tuning @ eBay
KVM Tuning @ eBayKVM Tuning @ eBay
KVM Tuning @ eBay
 
XS Japan 2008 Oracle VM English
XS Japan 2008 Oracle VM EnglishXS Japan 2008 Oracle VM English
XS Japan 2008 Oracle VM English
 
Docking postgres
Docking postgresDocking postgres
Docking postgres
 
Comandos
ComandosComandos
Comandos
 
Amazon EC2 in der Praxis
Amazon EC2 in der PraxisAmazon EC2 in der Praxis
Amazon EC2 in der Praxis
 
VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts
VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts
VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts
 
Xen ATG case study
Xen ATG case studyXen ATG case study
Xen ATG case study
 
PVOps Update
PVOps Update PVOps Update
PVOps Update
 
XS Boston 2008 VT-D PCI
XS Boston 2008 VT-D PCIXS Boston 2008 VT-D PCI
XS Boston 2008 VT-D PCI
 
XS Japan 2008 Citrix English
XS Japan 2008 Citrix EnglishXS Japan 2008 Citrix English
XS Japan 2008 Citrix English
 
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
 
High Availability Solutions in SQL 2012
High Availability Solutions in SQL 2012High Availability Solutions in SQL 2012
High Availability Solutions in SQL 2012
 
090507.New Replication Features(2)
090507.New Replication Features(2)090507.New Replication Features(2)
090507.New Replication Features(2)
 
Atril-Déjà Vu Tea mserver 2 general presentation
Atril-Déjà Vu Tea mserver 2   general presentationAtril-Déjà Vu Tea mserver 2   general presentation
Atril-Déjà Vu Tea mserver 2 general presentation
 
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
 

Similaire à Synchronous Log Shipping Replication

Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQLMasao Fujii
 
Advanced MySQL Replication Architectures - Luis Soares
Advanced MySQL Replication Architectures - Luis SoaresAdvanced MySQL Replication Architectures - Luis Soares
Advanced MySQL Replication Architectures - Luis SoaresMySQL Brasil
 
Less14 br concepts
Less14 br conceptsLess14 br concepts
Less14 br conceptsAmit Bhalla
 
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...SQLExpert.pl
 
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) Ontico
 
The Magic of Hot Streaming Replication (Bruce Momjian)
The Magic of Hot Streaming Replication (Bruce Momjian)The Magic of Hot Streaming Replication (Bruce Momjian)
The Magic of Hot Streaming Replication (Bruce Momjian)Ontico
 
Sql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffySql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffyAnuradha
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterSrihari Sriraman
 
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Masao Fujii
 
Sql server logshipping
Sql server logshippingSql server logshipping
Sql server logshippingZeba Ansari
 
Persistence Is Futile - Implementing Delayed Durability
Persistence Is Futile - Implementing Delayed DurabilityPersistence Is Futile - Implementing Delayed Durability
Persistence Is Futile - Implementing Delayed DurabilityMark Broadbent
 
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2PAccelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2PNovell
 
Sp2010 high availlability_sql
Sp2010 high availlability_sqlSp2010 high availlability_sql
Sp2010 high availlability_sqlSamuel Zürcher
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...LarryZaman
 
Sql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalSql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalJoseph D'Antoni
 
Sql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalSql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalJoseph D'Antoni
 
Oracle SOA Suite 12.2.1 new features
Oracle SOA Suite 12.2.1 new featuresOracle SOA Suite 12.2.1 new features
Oracle SOA Suite 12.2.1 new featuresMaarten Smeets
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld
 
[Altibase] 10 replication part3 (system design)
[Altibase] 10 replication part3 (system design)[Altibase] 10 replication part3 (system design)
[Altibase] 10 replication part3 (system design)altistory
 

Similaire à Synchronous Log Shipping Replication (20)

Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQL
 
Advanced MySQL Replication Architectures - Luis Soares
Advanced MySQL Replication Architectures - Luis SoaresAdvanced MySQL Replication Architectures - Luis Soares
Advanced MySQL Replication Architectures - Luis Soares
 
Less14 br concepts
Less14 br conceptsLess14 br concepts
Less14 br concepts
 
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
 
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
 
The Magic of Hot Streaming Replication (Bruce Momjian)
The Magic of Hot Streaming Replication (Bruce Momjian)The Magic of Hot Streaming Replication (Bruce Momjian)
The Magic of Hot Streaming Replication (Bruce Momjian)
 
Sql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffySql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffy
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
 
PostgreSQL replication
PostgreSQL replicationPostgreSQL replication
PostgreSQL replication
 
Sql server logshipping
Sql server logshippingSql server logshipping
Sql server logshipping
 
Persistence Is Futile - Implementing Delayed Durability
Persistence Is Futile - Implementing Delayed DurabilityPersistence Is Futile - Implementing Delayed Durability
Persistence Is Futile - Implementing Delayed Durability
 
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2PAccelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2P
 
Sp2010 high availlability_sql
Sp2010 high availlability_sqlSp2010 high availlability_sql
Sp2010 high availlability_sql
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
 
Sql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalSql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_final
 
Sql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalSql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_final
 
Oracle SOA Suite 12.2.1 new features
Oracle SOA Suite 12.2.1 new featuresOracle SOA Suite 12.2.1 new features
Oracle SOA Suite 12.2.1 new features
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
[Altibase] 10 replication part3 (system design)
[Altibase] 10 replication part3 (system design)[Altibase] 10 replication part3 (system design)
[Altibase] 10 replication part3 (system design)
 

Plus de elliando dias

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slideselliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScriptelliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de containerelliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Librarieselliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Webelliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduinoelliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorceryelliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Designelliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makeselliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Studyelliando dias
 

Plus de elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Dernier

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Synchronous Log Shipping Replication

  • 1. Synchronous Log Shipping Replication Takahiro Itagaki and Masao Fujii NTT Open Source Software Center PGCon 2008
  • 2. Agenda • Introduction: What is this? – Background – Compare with other replication solutions • Details: How it works – Struggles in development • Demo • Future work: Where are we going? • Conclusion Copyright © 2008 NTT, Inc. All Rights Reserved. 2
  • 4. What is this? • Successor of warm-standby servers – Replication system using WAL shipping. • using Point-in-Time Recovery mechanism – However, no data loss after failover because of synchronous log-shipping. WAL Active Server Standby Server • Based on PostgreSQL 8.2 with a patch and including several scripts – Patch: Add two processes into postgres – Scripts: For management commands Copyright © 2008 NTT, Inc. All Rights Reserved. 4
  • 5. Warm-Standby Servers (v8.2~) Active Server (ACT) Standby Server (SBY) Commit 1 2 Flush WAL to disk (Return) 3 Failover Crash! WAL seg Sent after commits 4 archive_command WAL seg Redo The last segment is not We need to wait for remounting available in the standby server active’s storage on the standby if the active crashes before server, or we wait the active’s archiving it. reboot. Copyright © 2008 NTT, Inc. All Rights Reserved. 5
  • 6. Synchronous Log Shipping Servers Active Server (ACT) Standby Server (SBY) WAL entries are sent before returning from Commit 1 commits by records. Segments are formed 2 Flush WAL to disk from records in the WAL records standby server. 3 Send WAL records (Return) 4 WAL seg Failover Crash! Redo We can start the standby server after redoing remaining segments; We’ve received all transaction logs already in it. Copyright © 2008 NTT, Inc. All Rights Reserved. 6
  • 7. Background: Why new solution? • We have many migration projects from Oracles and compete with them with postgres. – So, we hope postgres to be SUPERIOR TO ORACLE! • Our activity in PostgreSQL 8.3 – Performance stability • Smoothed checkpoint – Usability; Ease to tune server parameters • Multiple autovacuum workers • JIT bgwriter – automatic tuning of bgwriter • Where are alternatives of RAC? – Oracle Real Application Clusters Copyright © 2008 NTT, Inc. All Rights Reserved. 7
  • 8. Background: Alternatives of RAC • Oracle RAC is a multi-purpose solution – … but we don’t need all of the properties. • In our use: – No downtime <- Very Important – No data loss <- Very Important – Automatic failover <- Important – Performance in updates <- Important – Inexpensive hardware <- Important – Performance scalability <- Not important • Goal – Minimizing system downtime – Minimizing performance impact in updated-workloads Copyright © 2008 NTT, Inc. All Rights Reserved. 8
  • 9. Compare with other replication solutions No data No SQL Performance Update How to Failover scalability performance loss restriction copy? Log Shipping OK OK Auto, Fast No Good Log warm-standby NG OK Manual No Async Slony-I NG OK Manual Good Async Trigger Auto, pgpool-II OK NG Hard to Good Medium SQL re-attach Shared Disks OK OK Auto, Slow No Good Disk • Log Shipping is excellent except performance scalability. • Also, Re-attaching a repaired server is simple. – Just same as normal hot-backup procedure • Copy active server’s data into standby and just wait for WAL replay. – No service stop during re-attaching Copyright © 2008 NTT, Inc. All Rights Reserved. 9
  • 10. Compare downtime with shared disks • Cold standby with shared disks is an alternative solution – but it takes long time to failover in heavily-updated load. – Log-shipping saves time for mounting disks and recovery. Shared disk system Crash! 20 sec 60 ~ 180 sec (*) to umount to recover Log-shipping system and remount from the last shared disks checkpoint Crash! Ok, the service is restarted! 5 sec 10 sec to recover (*) Measured in PostgreSQL 8.2. to detect the last 8.3 would take less time because server down segement of less i/o during recovery. Copyright © 2008 NTT, Inc. All Rights Reserved. 10
  • 11. Advantages and Disadvantages • Advantages – Synchronous • No data loss on failover – Log-based (Physically same structure) • No functional restrictions in SQL • Simple, Robust, and Easy to setup – Shared-nothing • No Single Point of Failure • No need for expensive shared disks – Automatic Fast Failover (within 15 seconds) • “Automatic” is essential not to wait human operations – Less impact against update performance (less than 7%) • Disadvantages – No performance scalability (for now) – Physical replication. Cannot use for upgrading purposes. Copyright © 2008 NTT, Inc. All Rights Reserved. 11
  • 12. Where is it used? • Interactive teleconference management package – Commercial service in active – Manage conference booking and file transfer – Log-shipping is an optional module for users requiring high availability Internet networks Communicator Copyright © 2008 NTT, Inc. All Rights Reserved. 12
  • 14. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA PostgreSQL PostgreSQL DB DB WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 14
  • 15. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 In our replicator, there are two – Open source high-availability software manages the resources via resource agent(RA) – nodes, active and standby Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA PostgreSQL PostgreSQL DB DB WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 15
  • 16. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA In the active node, postgres is PostgreSQL PostgreSQL DB running in normal mode with newDB child process WALSender WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 16
  • 17. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA In the standby node, postgres is running PostgreSQL PostgreSQL in continuous recovery mode with new DB DB daemon WALReceiver WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 17
  • 18. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA PostgreSQL PostgreSQL DB DB In order to manage these resources, WAL postgres startup WAL there is heartbeat in both nodes wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 18
  • 19. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA PostgreSQL PostgreSQL DB DB WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 19
  • 20. WALSender Active postgres walbuffers WALSender Update Insert Commit Flush WAL Request Read Send / Recv (Return) (Return) Copyright © 2008 NTT, Inc. All Rights Reserved. 20
  • 21. WALSender Active postgres walbuffers WALSender Update XLogInsert() Insert Commit Flush Update command triggers XLogInsert() and inserts WAL into walbuffers WAL Request Read Send / Recv (Return) (Return) Copyright © 2008 NTT, Inc. All Rights Reserved. 21
  • 22. WALSender Active postgres walbuffers WALSender Update XLogInsert() Insert XLogWrite() Commit Flush WAL Request Read Send / Recv (Return) Commit command triggers XLogWrite() and flushs WAL to disk (Return) Copyright © 2008 NTT, Inc. All Rights Reserved. 22
  • 23. WALSender Active postgres walbuffers WALSender Update XLogInsert() Insert XLogWrite() Commit Flush WAL Changed Request Read Send / Recv (Return) (Return) We changed XLogWrite() to request WALSender to transfer WAL Copyright © 2008 NTT, Inc. All Rights Reserved. 23
  • 24. WALSender Active postgres walbuffers WALSender Update XLogInsert() Insert XLogWrite() Commit WALSender reads WAL from Flush walbuffers and transfer them WAL Changed Request Read Send / Recv (Return) (Return) After transfer finishes, commit command returns Copyright © 2008 NTT, Inc. All Rights Reserved. 24
  • 25. WALReceiver Standby WALReceiver WAL Disk startup Recv / Send Flush Inform Read Replay Copyright © 2008 NTT, Inc. All Rights Reserved. 25
  • 26. WALReceiver Standby WALReceiver WAL Disk startup Recv / Send Flush Inform WALReceiver receives WAL from Read WALSender and flushes them to disk Replay Copyright © 2008 NTT, Inc. All Rights Reserved. 26
  • 27. WALReceiver Standby WALReceiver WAL Disk startup Recv / Send Flush Inform Read WALReceiver informs startup process of the latest LSN. Replay Copyright © 2008 NTT, Inc. All Rights Reserved. 27
  • 28. WALReceiver Standby WALReceiver WAL Disk startup Recv / Send Flush ReadRecord() Inform Read Changed Startup process reads WAL up to the latest LSN Replay and replays. We changed ReadRecord() so that startup process could communicate with WALReceiver and replay by each WAL record. Copyright © 2008 NTT, Inc. All Rights Reserved. 28
  • 29. Why replay by each WAL record? • Minimize downtime • Shorter delay in read-only queries (at the standby) Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 29
  • 30. Why replay by each WAL record? • Minimize downtime • ShorterIn our replicator, becausequeries by each WAL record, delay in read-only of replay (at the standby) the standby only has to replay a few records at failover Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 30
  • 31. Why replay by each WAL record? • Minimize downtime On the other hand, in warm-standby, because of • Shorter delayreplay by each WAL segment, thethe standby) in read-only queries (at standby has to replay the latest one segment Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 31
  • 32. Why replay by each WAL record? • Minimize downtime • Shorter delay in read-only queries (at the standby) Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries In thisshorter example, warm-standby needed to longer replay most 'segment2' at failover. Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 32
  • 33. Why replay by each WAL record? • Minimize downtime And,Shorter delay in read-only queries (at the standby) • in our replicator, because of replay by each WAL record, delay in read-only queries is shorter Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 33
  • 34. Why replay by each WAL record? • Minimize downtime • Shorter delay in read-only queries (at the standby) Our replicator Therefore, we implemeted replay Warm-Standby Replay by each each WAL record WAL record by WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 34
  • 35. Heartbeat and resource agent • Heartbeat needs resource agent (RA) to manage PostgreSQL(with WALSender) and WALReceiver as a resource • RA is an executable providing the following feature Feature Description start start the resources as standby promote change the status from standby to active demote change the status from active to standby stop stop the resources monitor check if the resource is running normally Invoke RA Resources Heartbeat start promote PostgreSQL stop demote (WALSender) monitor WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 35
  • 36. Failover • Failover occurs when heartbeat detects that the active node is not running normally • After failover, clients can restart transactions only by reconnecting to virtual IP provided by Heartbeat Standby Heartbeat startup Detect Act ReadRecord() Request Changed At failover, heartbeat requests startup process to finish WAL replay. Replay We changed ReadRecord() to deal with this request. Copyright © 2008 NTT, Inc. All Rights Reserved. 36
  • 37. Failover • Failover occurs when heartbeat detects that the active node is not running normally • After failover, clients can restart transactions only by reconnecting to virtual IP provided by Heartbeat Standby Heartbeat startup Detect Act ReadRecord() Request Changed After finishing WAL replay, Replay the standby becomes active Copyright © 2008 NTT, Inc. All Rights Reserved. 37
  • 39. Downtime caused by the standby down • The active down triggers a failover and causes downtime • Additionally, the standby down might also cause downtime – WALSender waits for the response from the standby after sending WAL – So, when the standby down occurs, unless WALSender detects the failure, WALSender is blocked – i.e. WALSender keeps waiting for the response which never comes • How to detect – Timeout notification is needed to detect – Keepalive, but it doesn't work occasionally on Linux (Linux bug!?) – Original timeout Active postgres WALSender Commit Request Send WAL Down!! Standby Wait (Return) (Return) Blocked Copyright © 2008 NTT, Inc. All Rights Reserved. 39
  • 40. Downtime caused by clients • Even if the database finishes a failover immediately, downtime might still be long by clients reason – Clients wait for the response from the database – So, when a failover occurs, unless clients detect a failover, they can't reconnect to the new active and restart the transaction – i.e. clients keeps waiting for the response which never comes • How to detect – Timeout notification is needed to detect – Keepalive • Our setKeepAlive patch was accepted in JDBC 8.4dev – Socket timeout – Query timeout Client We want to implement these timeouts!! Down!! Active Active Copyright © 2008 NTT, Inc. All Rights Reserved. 40
  • 41. Split-brain • High-availability clusters must be able to handle split- brain • Split-brain causes data inconsistency – Both nodes are active and provide the virtual IP – So, clients might update inconsistently each node • Our replicator also causes split-brain unless the standby can distinguish network failure from the active down Failure!! Active Active Active Standby Down!! If network failure are mis-detected as If the active down are mis-detected as the active down, the standby becomes network failure, a failover doesn't start active even if the other active is still even if the other active is down. running normally. This scenario is also problem though This is split-brain scenario. split-brain doesn't occur. Copyright © 2008 NTT, Inc. All Rights Reserved. 41
  • 42. Split-brain • How to distinguish – Combining the following solution 1. Redundant network between two nodes – The standby can distinguish unless all networks fail 2. STONITH(Shoot The Other Node In The Head) – Heartbeat's default solution for avoiding split-brain – STONITH always forcibly turns off the active when activating the standby – Split-brain doesn't occur because the active node is always only one Active Active Turn off!! STONITH Copyright © 2008 NTT, Inc. All Rights Reserved. 42
  • 43. What delays the activation of the standby • In order to activate the standby immediately, recovery time at failover must be short!! • In 8.2, recovery is very slow – A lot of WAL needed to be replayed at failover might be accumulated – Another problem: disk full failure might happen • In 8.3, reocvery is fast☺ – Because of avoiding unnecessary reads – But, there are still two problems Copyright © 2008 NTT, Inc. All Rights Reserved. 43
  • 44. What delays the activation of the standby 1. Checkpoint during recovery – It took 1min or more (in the worst case) and occupied 21% of recovery time – What is worse is that WAL replay is blocked during checkpoint • Because only startup process performs both checkpoint and WAL replay -> Checkpoint delays recovery... • [Just idea] bgwriter during recovery – Leaving checkpoint to bgwriter, and making startup process concentrate on WAL replay Copyright © 2008 NTT, Inc. All Rights Reserved. 44
  • 45. What delays the activation of the standby 2. Checkpoint at the end of recovery – Activation of the standby is blocked during checkpoint -> Downtime might take 1min or more... • [Just idea] Skip of the checkpoint at the end of recovery – But, postgres works fine if it fails before at least one checkpoint after recovery? – We have to reconsider why checkpoint is needed at the end of recovery !!! Of course, because recovery is a critical part for DBMS, more careful investigation is needed to realize these ideas Copyright © 2008 NTT, Inc. All Rights Reserved. 45
  • 46. How we choose the node with the later LSN • When starting both two nodes, we should synchronize from the node with the later LSN to the other – But, it's unreliable to depend on server logs (e.g. heartbeat log) or a human memory in order to choose the node • We choose the node from WAL which is most reliable – Find the latest LSN from WAL files in each node by using our original tool like xlogdump and compare them Copyright © 2008 NTT, Inc. All Rights Reserved. 46
  • 47. Bottleneck • Bad performance after failover – No FSM – A little commit hint bits in heap tuples – A little dead hint bits in indexes Copyright © 2008 NTT, Inc. All Rights Reserved. 47
  • 48. Demo
  • 49. Demo Environment Client • 2 nodes, 1 client How to watch Node0 Node1 • there are two kind of terminals • the terminal at the top of the screen displays the cluster status The node with • 3 lines is active • 1 line is standby • no line is not started yet active standby • the other terminal is for operation – Client – Node0 – Node1 Copyright © 2008 NTT, Inc. All Rights Reserved. 49
  • 50. Demo Operation 1. start only node0 as the active 2. createdb and pgbench -i (from client) 3. online backup 4. copy the backup from node0 to node1 5. pgbench -c2 -t2000 6. start node1 as the standby during pgbench -> synchronization starts 7. killall -9 postgres (in active node0) -> failover occurs Copyright © 2008 NTT, Inc. All Rights Reserved. 50
  • 51. Future work - Where are we going? -
  • 52. Where are we going? • We’re thinking to make it Open Source Software. – To be a multi-purpose replication framework – Collaborators welcome. • TODO items – For 8.4 development • Re-implement WAL-Sender and WAL-Receiver as extensions using two new hooks • Xlogdump to be an official contrib module – For performance • Improve checkpointing during recovery • Handling un-logged operations – For usability • Improve detection of server down in client library • Automatic retrying abundant transactions in client library Copyright © 2008 NTT, Inc. All Rights Reserved. 52
  • 53. For 8.4 : WAL-writing Hook • Purpose – Make WAL-Sender to be one of general extensions • WAL-Sender sends WAL records before commits • Proposal – Introduce “WAL-subscriber model” – “WAL-writing Hook” enables to replace or filter WAL records just before they are written down to disks. • Other extensions using this hook – “Software RAID” WAL writer for redundancy • Writes WAL into two files for durability (it might be a paranoia…) – Filter to make a bitmap for partial backup • Writes changed pages into on-disk bitmaps – … Copyright © 2008 NTT, Inc. All Rights Reserved. 53
  • 54. For 8.4 : WAL-reading Hook • Purpose – Make WAL-Receiver to be one of general extensions • WAL-Receiver redo in each record, not in each segment • Proposal – “WAL-reading Hook” enables to filter WAL records during they are read in recovery. • Other extensions using this hook – Read-ahead WAL reader • Read a segment at once and pre-fetch required pages that are not a full-page-writes and not in shared buffers – … Copyright © 2008 NTT, Inc. All Rights Reserved. 54
  • 55. Future work : Multiple Configurations • Supports several synchronization modes – One configuration is not fit all, but one framework could fit many uses! Before/After Commit in ACT No. Configuration Send Flush Flush Redo to SBY in ACT in SBY in SBY 1 Speed After After After After 2 Speed + Durability After Before After After 3 HA + Speed Before After After After Now 4 HA + Durability Before Before After After 5 HA + More durability Before Before Before After 6 Synchronous Reads in SBY Before Before Before Before Copyright © 2008 NTT, Inc. All Rights Reserved. 55
  • 56. Future work : Horizontal scalability • Horizontal scalability is not our primary goal, but for potential users. • Postgres TODO: “Allow a warm standby system to also allow read-only statements” helps us. • NOTE: We need to support 3 or more servers if we need both scalability and availability. 2 servers 2 * 50% 1 * 100% 3 servers 3 * 66% 2 * 100% Copyright © 2008 NTT, Inc. All Rights Reserved. 56
  • 57. Conclusion • Synchronous log-shipping is the best for HA. – A direction of future warm-standby – Less downtime, No data loss, and Automatic failover. • There remains rooms for improvements. – Minimize downtime and performance scalability. – Improvements for recovery also helps Log-shipping. • We’ve shown requirements, advantages, and remaining tasks. – It has potential to improvements, but requires some works to be more useful solution – We’ll make it open source! Collaborators welcome! Copyright © 2008 NTT, Inc. All Rights Reserved. 57
  • 58. Fin. Contact itagaki.takahiro@oss.ntt.co.jp fujii.masao@oss.ntt.co.jp