1. LCA 2012 HA Miniconf
Building a non-shared storage HA cluster with
Pacemaker and PostgreSQL 9.1
2012/01/16
Keisuke MORI
NTT DATA Intellilink Corporation.
Linux-HA Japan Project.
http://linux-ha.sourceforge.jp/
Copyright(c) 2012 Linux-HA Japan Project
2. Introduction
PostgreSQL database now supports Streaming
Replication (SR)
2010.09 Release 9.0 - "Asynchronous" replication supported.
2011.09 Release 9.1 - "Synchronous" replication supported.
NTT contributed the feature to the PostgreSQL community.
Integration with a HA cluster software is necessary to
accomplish automatic fail-over
We have developed an enhancement version of
“pgsql” resource agent (RA) for the integration with
Pacemaker.
+
Copyright(c) 2011 Linux-HA Japan Project
2
3. Existing HA Configuration for PostgreSQL
te
Wri
d
Rea
Answers to query
PostgreSQL
PostgreSQL PostgreSQL
(Active)
(Active) (Standby)
Not running: started
when a failure occurred
pgsql RA Manages PostgreSQL pgsql RA
pgsql RA
start / stop / monitor
Pacemaker
Pacemaker Pacemaker
Pacemaker
Node #1 Node #2
The Database
on Shared Storage
Copyright(c) 2011 Linux-HA Japan Project
3
4. Advanced HA Configuration with PostgreSQL SR
te
Wri
Re
d ad
Rea
PRI sends “WAL” HS is running; also can
records to HS answer to read queries.
PostgreSQL PostgreSQL
PostgreSQL
PostgreSQL
[Primary(PRI)] [Hot-Standby(HS)]
[Hot-Standby(HS)]
[Primary(PRI)] Streaming Replication (SR)
New Enhancement!
pgsql RA pgsql RA
pgsql RA
pgsql RA Manages PRI/HS
[Master] status in PostgreSQL
state in PostgreSQL
[Slave]
[Slave]
[Master]
start / stop / monitor
promote / demote
Pacemaker
Pacemaker Pacemaker
Pacemaker
Node #1 No shared storage
No shared storage Node #2
Copyright(c) 2011 Linux-HA Japan Project
4
5. Benefits of Streaming Replication
Removing Single Point Of Failure (SPOF)
Shared storage could be a SPOF.
Reduce the cost
Shared storages are very expensive!
Faster Fail-Over / Shorter Downtime
by eliminating crash recovery time of the database.
Crash recovery time is the most dominant factor of the downtime
particularly in the large database.
Load Balancing for read-only query
Copyright(c) 2011 Linux-HA Japan Project
5
6. Comparison of Replication Technology
PostgreSQL 9.1 DRBD
Shared Storage Slony-I
SR (sync) (Protocol C)
High Availability N.A.
usage (async only)
mount/umount mount/umount
Fail-Over Time crash recovery N.A. crash recovery
cluster-aware cluster-aware
Read Scalability applications required applications required
Non-DB usage N.A. N.A.
Throughput approx. approx.
100%
Performance 90%-99% (*1) 70%-80% (*2)
(*1) Varies on the workload (*2) Assumes DB usage,
Varies on the workload
Copyright(c) 2011 Linux-HA Japan Project
6
7. Key Features of the new pgsql RA
Manages Primary/Hot-Standby status in PostgreSQL
works as a MasterSlave resource in Pacemaker.
Data protection
Preventing from starting PostgreSQL when the stored data is
considered unreliable or inconsistent.
The RA creates the “lock file”and intentionally leaves it over reboots.
for indicating the data on the node is likely unreliable.
Lock file: /var/lib/pgsql/PGSQL.lock
Display PostgreSQL status on crm_mon
running status, data status
makes operation easier.
Determine which node has the latest data
when both nodes has started at the same time.
but not recommended to depend on it to simplify the operation.
Copyright(c) 2011 Linux-HA Japan Project
7
9. Sample CRM configuration for pgsql RA
ms msPostgresql postgresql
meta
master-max="1" rep_mode:
master-node-max="1" “sync” enables the SR support
clone-max="2"
clone-node-max="1"
notify="true" master_ip:
primitive postgresql ocf:heartbeat:pgsql the virtual IP for the replication
params
pgctl="/usr/pgsql-9.1/bin/pg_ctl"
psql="/usr/pgsql-9.1/bin/psql"
stop_on_demote:
pgctldata="/usr/pgsql-9.1/bin/pg_controldata"
“yes” is recommended.
pgdata="/var/lib/pgsql/9.1/data/" See discussion later.
start_opt="-p 5432"
rep_mode="sync"
node_list="devnode1 devnode2"
restore_command="cp /var/lib/pgsql/9.1/data/pg_archive/%f %p"
master_ip="192.168.122.103"
stop_on_demote="yes"
op start timeout="60s" interval="0s" on-fail="restart"
op monitor timeout="60s" interval="7s" on-fail="restart"
op monitor timeout="60s" interval="2s" on-fail="restart" role="Master"
op promote timeout="60s" interval="0s" on-fail="restart"
op demote timeout="60s" interval="0s" on-fail="block"
op stop timeout="60s" interval="0s" on-fail="block"
op notify timeout="60s" interval="0s"
Copyright(c) 2011 Linux-HA Japan Project
9
10. Sample CRM configuration for VIPs
group master-group vip-master vip-rep
primitive vip-master ocf:heartbeat:IPaddr2
params ip="192.168.100.101" nic="eth0" cidr_netmask="24"
op start timeout="60s" interval="0s" on-fail="restart"
op monitor timeout="60s" interval="10s" on-fail="restart"
op stop timeout="60s" interval="0s" on-fail="block"
primitive vip-rep ocf:heartbeat:IPaddr2
params ip="192.168.122.103" nic="eth3" cidr_netmask="24"
(ditto)
primitive vip-slave ocf:heartbeat:IPaddr2
params ip="192.168.100.102" nic="eth0" cidr_netmask="24"
meta resource-stickiness="1"
(ditto)
colocation rsc_colocation-2 inf: master-group msPostgresql:Master
order rsc_order-2 0: msPostgresql:promote master-group:start symmetrical=false
order rsc_order-3 0: msPostgresql:demote master-group:stop symmetrical=false
location rsc_location-1 vip-slave
rule 200: pgsql-status eq "HS:sync"
rule 100: pgsql-status eq "PRI"
rule -inf: not_defined pgsql-status
rule -inf: pgsql-status ne "HS:sync" and pgsql-status ne "PRI"
Copyright(c) 2011 Linux-HA Japan Project
10
11. New introduced parameters for pgsql RA
Name Description
rep_mode R Replication mode: none(default) / async / sync.
node_list R All node names. Please separate each node name with a space.
restore_command R restore_command for recovery.conf.
Master's floating IP address to be connected from hot standby.
master_ip R
This parameter is used for "primary_conninfo" in recovery.conf.
User used to connect to the master server.
repuser
This parameter is used for "primary_conninfo" in recovery.conf. Default: postgres
Whether or not to stop PostgreSQL with instead of restarting it
stop_on_demote
on demote, to speed up failover (yes/no(default)).
primary_conninfo_opt primary_conninfo options of recovery.conf except host, port, user and application_name.
tmpdir Path to temporary directory. Default: /var/lib/pgsql
pgctldata Path to pg_controldata command. Default: /usr/bin/pg_controldata
xlog_check_count Number of checking xlog on monitor before promote. Default: 3
crm_attr_timeout The timeout of crm_attribute forever update command. Default: 5
R: Required when the streaming replication is enabled.
Copyright(c) 2011 Linux-HA Japan Project
11
12. Sample PostgreSQL configuration
postgresql.conf (excerpt)
Only related part to the streaming replication.
See the PostgreSQL manual for details.
listen_addresses = '*'
wal_level = hot_standby
synchronous_commit = on
archive_mode = on
archive_command = '/bin/cp %p /var/lib/pgsql/9.1/data/pg_archive/%f'
max_wal_senders=5
wal_keep_segments = 32
hot_standby = on
include = '/var/lib/pgsql/rep_mode.conf'
restart_after_crash = off
replication_timeout = 5000 # mseconds
wal_receiver_status_interval = 2 # seconds
rep_mode.conf:
pgsql RA will create this file
to control the postgresql mode.
Copyright(c) 2011 Linux-HA Japan Project
12
14. Status attributes
pgsql-status: running status of PostgreSQL on each nodes
Value Description
STOP Not running.
HS:alone Running as Hot-Standby, not connected to Primary.
HS:connected Running as Hot-Standby, connected to Primary, transient state.
HS:async Running as Hot-Standby in Asynchronous replication mode.
HS:potential Running as Hot-Standby (only appears when 3 or more nodes).
HS:sync Running as Hot-Standby in Synchronous replication mode.
PRI Running as Primary.
pgsql-data-status: data status on each nodes
Value Description
DISCONNECTED The data is out-of-dated. Must not become Primary.
The data is replicating from Primary but may not be up-to-date.
STREAMING|ASYNC
Must not become Primary.
The data is replicating from Primary and it is up-to-date.
STREAMING|SYNC
Ready to become Primary.
LATEST The data is up-to-date and the node is now Primary.
Copyright(c) 2011 Linux-HA Japan Project
14
15. Best Practice of the Operation Procedure
General Recommendations
Invoke the cluster nodes one by one: PRI first, HS second.
Always copy the database from the PRI node before starting the
HS node to make sure the data is consistent.
Initial Invocation
(0) Initialize the database on #1
or determine which node has the latest data manually
(1) Invoke Pacemaker and pgsql resource on #1(PRI)
(2) Copy the database from #1(PRI) to #2
pg_basebackup -h $MASTER_IP -U postgres -D $PGDATA --xlog
or any other backup/restore methods should work (e.g. rsync)
(3) Invoke Pacemaker on #2(HS)
(4) Make sure the replication working successfully on #2(HS)
wait until as “pgsql-status : HS:sync” on #2
Copyright(c) 2011 Linux-HA Japan Project
15
16. Best Practice of the Operation Procedure
Recovery from a failure
(0) fail-over or switch-over occurred; #2 is now PRI
(1) Stop Pacemaker on failed node #1
(2) Repair broken things if needed
(3) Copy the database from PRI(#2) to #1
(4) Clear the "lock" file created by the RA
rm /var/lib/pgsql/PGSQL.lock
letting the RA know we've made sure that the data is consistent.
(5) Invoke Pacemaker on #1(HS)
(6) Make sure the replication working successfully on #1(HS)
wait until as “pgsql-status : HS:sync” on #1
Copyright(c) 2011 Linux-HA Japan Project
16
17. Implementation Challenges
State Transition Models are diferent between
Pacemaker and PostgreSQL
PostgreSQL can not “demote”
can not transit from Primary state to Hot-Standby state
Primary state is only allowed to transit to Stop state
Diference of Concepts
Pacemaker: “Master” is an additional state
PostgreSQL: “Slave(Hot-Standby)” is an additional state
Current Solution
“stop_on_demote” parameter
Copyright(c) 2011 Linux-HA Japan Project
17
19. “stop_on_demote” parameter
yes: obey the PostgreSQL transition model
(recommended)
Always stop the PostgreSQL application when demote.
Simplify the operation.
A monitor may fail if it's run between a demote and a stop
operation.
no: obey the Pacemaker transition model
Stop the PostgreSQL once and invoke it again when demote.
Takes very long time to complete the fail-over.
Requires the archive files being kept up-to-date via scp or an
shared storage (to build a Non-shared storage cluster!) .
Rather complicated configuration and operation.
Copyright(c) 2011 Linux-HA Japan Project
19
20. Proposal for the future enhancement of Pacemaker
Add new state transition paths by either:
(1) add the return codes semantics of operations
when “start” returns OCF_RUNNING_MASTER:
Stopped → Master state
when “demote” returns OCF_NOT_RUNNING:
Master → Stopped state
RAs decide the next transition state.
(2) change the operations semantics
“promote” may be invoked at Stopped state
Supposed to move to Stopped state after “demote”has succeeded.
Pacemaker decides the semantics via a configuration parameter
This method does NOT work for pgsql RA
can not decide which node should be promoted before it's started.
Copyright(c) 2011 Linux-HA Japan Project
20
21. Other Issues
(1) crm_attribute command rarely hangs
When invoked at the moment of the DC node is absent by a node
failure
Workaround: a wrapper function to pop up the timeout
Details will be filed to the bugzilla soon
(2) Can not obtain the Master uname in monitor
OCF_RESKEY_CRM_meta_notify_master_uname is not set in
monitor
pgsql RA needs to get to know the uname of Master in monitor
operation to manage the current status of PostgreSQL.
Workaround: parse crm_mon output
Filed to bugzilla LF#2607
(The Linux Foundation site is still down though...)
Copyright(c) 2011 Linux-HA Japan Project
21
22. Conclusion
TODO
merging to the upstream of the resource-agents package.
code clean-up and refactoring
Development code
https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql
Tested version:
postgresql-9.1.1 (or later)
pacemaker-1.0.11 and heartbeat-3.0.5
(should be independent from cluster stack / versions)
The key developer: Takatoshi MATSUO
Documents / Sample configuration
https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for-
PostgreSQL-9.1-streaming-replication
Discussions
pacemaker or linux-ha-dev Mailing Lists
Any comments and improvements are welcome!
Copyright(c) 2011 Linux-HA Japan Project
22