linux.conf.au-HAminiconf-pgsql91-20120116

LCA 2012 HA Miniconf
Building a non-shared storage HA cluster with
Pacemaker and PostgreSQL 9.1

2012/01/16
Keisuke MORI
NTT DATA Intellilink Corporation.
Linux-HA Japan Project.
http://linux-ha.sourceforge.jp/
Copyright(c) 2012 Linux-HA Japan Project

Introduction
 PostgreSQL database now supports Streaming
Replication (SR)
 2010.09 Release 9.0 - "Asynchronous" replication supported.
 2011.09 Release 9.1 - "Synchronous" replication supported.
 NTT contributed the feature to the PostgreSQL community.
 Integration with a HA cluster software is necessary to
accomplish automatic fail-over

 We have developed an enhancement version of
“pgsql” resource agent (RA) for the integration with
Pacemaker.

＋
2

Existing HA Configuration for PostgreSQL

te
Wri
d
Rea

Answers to query

PostgreSQL
PostgreSQL PostgreSQL
(Active)
(Active) (Standby)
Not running: started
when a failure occurred

pgsql RA Manages PostgreSQL pgsql RA
pgsql RA

start / stop / monitor

Pacemaker
Pacemaker Pacemaker
Pacemaker

Node #1 Node #2
The Database
on Shared Storage

3

Advanced HA Configuration with PostgreSQL SR

te
Wri
Re
d ad
Rea

PRI sends “WAL” HS is running; also can
records to HS answer to read queries.

PostgreSQL PostgreSQL
PostgreSQL
PostgreSQL
[Primary(PRI)] [Hot-Standby(HS)]
[Hot-Standby(HS)]
[Primary(PRI)] Streaming Replication (SR)

New Enhancement!
pgsql RA pgsql RA
pgsql RA
pgsql RA Manages PRI/HS
[Master] status in PostgreSQL
state in PostgreSQL
[Slave]
[Slave]
[Master]

start / stop / monitor
promote / demote

Pacemaker
Pacemaker Pacemaker
Pacemaker

Node #1 No shared storage
No shared storage Node #2

4

Benefits of Streaming Replication
 Removing Single Point Of Failure (SPOF)
 Shared storage could be a SPOF.

 Reduce the cost
 Shared storages are very expensive!

 Faster Fail-Over / Shorter Downtime
 by eliminating crash recovery time of the database.
 Crash recovery time is the most dominant factor of the downtime
particularly in the large database.

 Load Balancing for read-only query

5

Comparison of Replication Technology

PostgreSQL 9.1 DRBD
Shared Storage Slony-I
SR (sync) (Protocol C)
High Availability N.A.
usage (async only)
mount/umount mount/umount
Fail-Over Time crash recovery N.A. crash recovery

cluster-aware cluster-aware
Read Scalability applications required applications required

Non-DB usage N.A. N.A.

Throughput approx. approx.
100%
Performance 90%-99% (*1) 70%-80% (*2)
(*1) Varies on the workload (*2) Assumes DB usage,
Varies on the workload

6

Key Features of the new pgsql RA
 Manages Primary/Hot-Standby status in PostgreSQL
 works as a MasterSlave resource in Pacemaker.
 Data protection
 Preventing from starting PostgreSQL when the stored data is
considered unreliable or inconsistent.
 The RA creates the “lock file”and intentionally leaves it over reboots.
for indicating the data on the node is likely unreliable.
 Lock file: /var/lib/pgsql/PGSQL.lock
 Display PostgreSQL status on crm_mon
 running status, data status
 makes operation easier.
 Determine which node has the latest data
 when both nodes has started at the same time.
 but not recommended to depend on it to simplify the operation.

7

Sample Resource Configuration

VIP master VIP slave

IPaddr2
IPaddr2 IPaddr2
IPaddr2
IPaddr2
IPaddr2 IPaddr2
IPaddr2
vip-slave
vip-slave vip-slave
vip-slave
vip-master
vip-master vip-master
vip-master
(Optional)
IPaddr2
IPaddr2 IPaddr2
IPaddr2
vip-rep
vip-rep vip-rep
vip-rep

VIP rep
pgsql
pgsql pgsql
pgsql
(Master)
(Master) Streaming Replication (Slave)
(Slave)

Pacemaker
Pacemaker Pacemaker
Pacemaker

Node #1
Node #2

8

Sample CRM configuration for pgsql RA
ms msPostgresql postgresql
meta
master-max="1" rep_mode:
master-node-max="1" “sync” enables the SR support
clone-max="2"
clone-node-max="1"
notify="true" master_ip:
primitive postgresql ocf:heartbeat:pgsql the virtual IP for the replication
params
pgctl="/usr/pgsql-9.1/bin/pg_ctl"
psql="/usr/pgsql-9.1/bin/psql"
stop_on_demote:
pgctldata="/usr/pgsql-9.1/bin/pg_controldata"
“yes” is recommended.
pgdata="/var/lib/pgsql/9.1/data/" See discussion later.
start_opt="-p 5432"
rep_mode="sync"
node_list="devnode1 devnode2"
restore_command="cp /var/lib/pgsql/9.1/data/pg_archive/%f %p"
master_ip="192.168.122.103"
stop_on_demote="yes"
op start timeout="60s" interval="0s" on-fail="restart"
op monitor timeout="60s" interval="7s" on-fail="restart"
op monitor timeout="60s" interval="2s" on-fail="restart" role="Master"
op promote timeout="60s" interval="0s" on-fail="restart"
op demote timeout="60s" interval="0s" on-fail="block"
op stop timeout="60s" interval="0s" on-fail="block"
op notify timeout="60s" interval="0s"

9

Sample CRM configuration for VIPs
group master-group vip-master vip-rep

primitive vip-master ocf:heartbeat:IPaddr2
params ip="192.168.100.101" nic="eth0" cidr_netmask="24"
op start timeout="60s" interval="0s" on-fail="restart"
op monitor timeout="60s" interval="10s" on-fail="restart"
op stop timeout="60s" interval="0s" on-fail="block"

primitive vip-rep ocf:heartbeat:IPaddr2
(ditto)

primitive vip-slave ocf:heartbeat:IPaddr2
meta resource-stickiness="1"
(ditto)

colocation rsc_colocation-2 inf: master-group msPostgresql:Master

order rsc_order-2 0: msPostgresql:promote master-group:start symmetrical=false
order rsc_order-3 0: msPostgresql:demote master-group:stop symmetrical=false

location rsc_location-1 vip-slave
rule 200: pgsql-status eq "HS:sync"
rule 100: pgsql-status eq "PRI"
rule -inf: not_defined pgsql-status
rule -inf: pgsql-status ne "HS:sync" and pgsql-status ne "PRI"

10

New introduced parameters for pgsql RA

Name Description

rep_mode R Replication mode: none(default) / async / sync.

node_list R All node names. Please separate each node name with a space.

restore_command R restore_command for recovery.conf.

Master's floating IP address to be connected from hot standby.
master_ip R
This parameter is used for "primary_conninfo" in recovery.conf.
User used to connect to the master server.
repuser
This parameter is used for "primary_conninfo" in recovery.conf. Default: postgres
Whether or not to stop PostgreSQL with instead of restarting it
stop_on_demote
on demote, to speed up failover (yes/no(default)).

primary_conninfo_opt primary_conninfo options of recovery.conf except host, port, user and application_name.

tmpdir Path to temporary directory. Default: /var/lib/pgsql

pgctldata Path to pg_controldata command. Default: /usr/bin/pg_controldata

xlog_check_count Number of checking xlog on monitor before promote. Default: 3

crm_attr_timeout The timeout of crm_attribute forever update command. Default: 5

R: Required when the streaming replication is enabled.

11

Sample PostgreSQL configuration
 postgresql.conf (excerpt)
 Only related part to the streaming replication.
See the PostgreSQL manual for details.

listen_addresses = '*'
wal_level = hot_standby
synchronous_commit = on
archive_mode = on
archive_command = '/bin/cp %p /var/lib/pgsql/9.1/data/pg_archive/%f'
max_wal_senders=5
wal_keep_segments = 32
hot_standby = on
include = '/var/lib/pgsql/rep_mode.conf'
restart_after_crash = off
replication_timeout = 5000 # mseconds
wal_receiver_status_interval = 2 # seconds

rep_mode.conf:
pgsql RA will create this file
to control the postgresql mode.

12

Sample crm_mon output
Online: [ devnode1 devnode2 ]

vip-slave (ocf::heartbeat:IPaddr2): Started devnode2
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started devnode1
vip-rep (ocf::heartbeat:IPaddr2): Started devnode1
Master/Slave Set: msPostgresql
Masters: [ devnode1 ]
Slaves: [ devnode2 ]
Clone Set: clnPingCheck
Started: [ devnode1 devnode2 ]

Node Attributes:
* Node devnode1:
+ default_ping_set : 100
+ master-postgresql:0 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 16:000000002B000EC0
+ pgsql-status : PRI
* Node devnode2:
+ default_ping_set : 100
+ master-postgresql:1 : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : HS:sync

13

Status attributes
 pgsql-status: running status of PostgreSQL on each nodes
Value Description
STOP Not running.
HS:alone Running as Hot-Standby, not connected to Primary.
HS:connected Running as Hot-Standby, connected to Primary, transient state.
HS:async Running as Hot-Standby in Asynchronous replication mode.
HS:potential Running as Hot-Standby (only appears when 3 or more nodes).
HS:sync Running as Hot-Standby in Synchronous replication mode.
PRI Running as Primary.

 pgsql-data-status: data status on each nodes
Value Description
DISCONNECTED The data is out-of-dated. Must not become Primary.
The data is replicating from Primary but may not be up-to-date.
STREAMING|ASYNC
Must not become Primary.
The data is replicating from Primary and it is up-to-date.
STREAMING|SYNC
Ready to become Primary.
LATEST The data is up-to-date and the node is now Primary.

14

Best Practice of the Operation Procedure
 General Recommendations
 Invoke the cluster nodes one by one: PRI first, HS second.
 Always copy the database from the PRI node before starting the
HS node to make sure the data is consistent.

 Initial Invocation
 (0) Initialize the database on #1
 or determine which node has the latest data manually

 (1) Invoke Pacemaker and pgsql resource on #1(PRI)
 (2) Copy the database from #1(PRI) to #2
 pg_basebackup -h $MASTER_IP -U postgres -D $PGDATA --xlog
 or any other backup/restore methods should work (e.g. rsync)
 (3) Invoke Pacemaker on #2(HS)
 (4) Make sure the replication working successfully on #2(HS)
 wait until as “pgsql-status : HS:sync” on #2

15

Best Practice of the Operation Procedure
 Recovery from a failure
 (0) fail-over or switch-over occurred; #2 is now PRI

 (1) Stop Pacemaker on failed node #1
 (2) Repair broken things if needed
 (3) Copy the database from PRI(#2) to #1
 (4) Clear the "lock" file created by the RA
 rm /var/lib/pgsql/PGSQL.lock
 letting the RA know we've made sure that the data is consistent.
 (5) Invoke Pacemaker on #1(HS)
 (6) Make sure the replication working successfully on #1(HS)
 wait until as “pgsql-status : HS:sync” on #1

16

Implementation Challenges
 State Transition Models are diferent between
Pacemaker and PostgreSQL
 PostgreSQL can not “demote”
 can not transit from Primary state to Hot-Standby state
 Primary state is only allowed to transit to Stop state
 Diference of Concepts
 Pacemaker: “Master” is an additional state
 PostgreSQL: “Slave(Hot-Standby)” is an additional state

 Current Solution
 “stop_on_demote” parameter

17

Status Mapping between Pacemaker and PostgreSQL
Stopped Slave Master

(not in use) / pg_ctl start promote / pg_ctl promote

(state transition inside PostgreSQL) Ready to failover
start /
pg_ctl start
+ recovery.conf HS: HS: HS: promote / pg_ctl promote
STOP PRI
alone (others) sync
stop /
pg_ctl stop

notify post-demote demote
stop / / pg_ctl start + recovery.conf STOP / pg_ctl stop
(do nothing)
(transient) demote_on_stop
=no
STOP
(unmatched demote / pg_ctl stop demote_on_stop
state)
=yes

18 18

“stop_on_demote” parameter
 yes: obey the PostgreSQL transition model
(recommended)
 Always stop the PostgreSQL application when demote.
 Simplify the operation.
 A monitor may fail if it's run between a demote and a stop
operation.

 no: obey the Pacemaker transition model
 Stop the PostgreSQL once and invoke it again when demote.
 Takes very long time to complete the fail-over.
 Requires the archive files being kept up-to-date via scp or an
shared storage (to build a Non-shared storage cluster!) .
 Rather complicated configuration and operation.

19

Proposal for the future enhancement of Pacemaker
 Add new state transition paths by either:

 (1) add the return codes semantics of operations
 when “start” returns OCF_RUNNING_MASTER:
 Stopped → Master state
 when “demote” returns OCF_NOT_RUNNING:
 Master → Stopped state

 RAs decide the next transition state.

 (2) change the operations semantics
 “promote” may be invoked at Stopped state
 Supposed to move to Stopped state after “demote”has succeeded.

 Pacemaker decides the semantics via a configuration parameter
 This method does NOT work for pgsql RA
 can not decide which node should be promoted before it's started.

20

Other Issues
 (1) crm_attribute command rarely hangs
 When invoked at the moment of the DC node is absent by a node
failure
 Workaround: a wrapper function to pop up the timeout
 Details will be filed to the bugzilla soon

 (2) Can not obtain the Master uname in monitor
 OCF_RESKEY_CRM_meta_notify_master_uname is not set in
monitor
 pgsql RA needs to get to know the uname of Master in monitor
operation to manage the current status of PostgreSQL.
 Workaround: parse crm_mon output
 Filed to bugzilla LF#2607
 (The Linux Foundation site is still down though...)

21

Conclusion
 TODO
 merging to the upstream of the resource-agents package.
 code clean-up and refactoring
 Development code
 https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql
 Tested version:
 postgresql-9.1.1 (or later)
 pacemaker-1.0.11 and heartbeat-3.0.5
(should be independent from cluster stack / versions)
 The key developer: Takatoshi MATSUO
 Documents / Sample configuration
 https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for-
PostgreSQL-9.1-streaming-replication
 Discussions
 pacemaker or linux-ha-dev Mailing Lists

 Any comments and improvements are welcome!
22

linux.conf.au-HAminiconf-pgsql91-20120116

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à linux.conf.au-HAminiconf-pgsql91-20120116

Similaire à linux.conf.au-HAminiconf-pgsql91-20120116 (20)

Plus de ksk_ha

Plus de ksk_ha (7)

Dernier

Dernier (20)

linux.conf.au-HAminiconf-pgsql91-20120116