1. Plaza Semanggi 9 Fl, Unit 9
Jl. Jend Sudirman Kav 50, Jakarta - 12930 Indonesia
8 Eu Tong Street, #14-94, THE CENTRAL Singapore
+6221-22866662 | info@equnix.asia
Mission Critical Production
High Availability PostgreSQL: 10 secs Failover!
PGConf.ASIA 2019 Bali
3. Topic
1. Linux-HA Concepts
2. Failover and Recovery Mechanism
3. Combining Replication & Linux - HA on HA Implementation
Table of Contents
4. High Availability
What is High Availability (HA)?
➢ HA is a “concept”
➢ A percentage of time that a given system is providing service since it has been
deployed (production)
➢ For example: A system is 99% available if the downtime is 4 days in a year
➢ Everyone craves for the five 9s (downtime of less than 5 minutes in a year –
99.999%)
➢ HA is NOT designed for high performance
➢ HA is NOT designed for high throughput (aka load balancing)
➢ OS Level
5. High Availability
Why do we bother with HA?
❖ Downtime is VERY EXPENSIVE!
❖ Cost you a good name, … and reputation!
❖ Users might not return!
❖ DBA and SYSADMIN is also human
#SaveDBAandSYSADMIN
6. HA - PostgreSQL
HA - PostgreSQL
❖ HA not in-built/in-core in PostgreSQL
❖ But PostgreSQL support HA mechanism (thanks to promote)
❖ Require tools for heartbeat to achieve HA in PostgreSQL
➢ Linux-HA
➢ Pacemaker (from Linux-HA)
➢ Scripts (yes only shell script)
❖ Require Floating IP
7. HA - Floating IP
Floating IP (a.k.a VIP)
❖ Used to support failover in a high-availability cluster
❖ Used by application to access database server
❖ Also refers as Master (only need 1 Floating IP)
8. HA - Failover
Failover ( not swing-over)
❖ Replica promoted to Master when Real Master down (touch trigger_file)
Only 10 seconds to Failover
9. HA - Failover
Post Failover
❖ Master become slave and follow new master (slave)
10. HA - Cycle Mode
Cycle Mode (3 or More Replicas)
Same Sites
11. HA - Cycle Mode
Master Down, Replica 1 Takeover become Master
Same Sites
12. HA - Disaster Recovery
Disaster Recovery Configuration
13. HA - Disaster Recovery
Production Site is DOWN! Failover (don’t panic)
15. HA - Hands on
Setup PostgreSQL Streaming Replication (SYNC) FIRST
16. HA - Hands on
Open Port 694
# iptables -A INPUT -p udp --dport 694 -j ACCEPT
Rename Hostname
# vi /etc/hostname (change hostname)
Reboot server
17. HA - Hands on
Register Hostname
# vi /etc/hosts (both server)
192.168.8.20 node1
192.168.8.21 node2
Floating IP
192.168.8.22 (Reminder)
Check Servers Connections (both server)
# ping node1
# ping node2
18. HA - Hands on
Install Heartbeat
# apt-get install heartbeat (both server)
1. Configure Heartbeat (ha.cf)
# vi /etc/ha.d/ha.cf
logfile /var/log/ha.log
keepalive 2
deadtime 15 # 15 seconds not respon = dead
initdead 120
bcast ethername # interface for broadcast ex: eth0 or bond0
udpport 694
auto_failback off
node node1 # for check node run # uname -n in Bash
node node2
19. HA - Hands on
2. File haresources
# vi /etc/ha.d/haresources
node1 192.168.8.22 activate_standby.sh
Note :
- node1 is node master
- 192.168.8.22 is Floating IP
- activate_standby.sh is a script to promote and located in (“/etc/init.d/”)
3. File authkeys
# vi /etc/ha.d/authkeys (chmod 600 both servers)
auth2
2 sha1 hakeys (hakeys is a key)
20. HA - Hands on
Send HA Configuration to Standby Servers
❖ ha.cf
❖ haresources
❖ authkeys
command:
# scp /etc/ha.d/ha.cf root@node2:/etc/ha.d/
# scp /etc/ha.d/haresources root@node2:/etc/ha.d/
# scp /etc/ha.d/authkeys root@node2:/etc/ha.d/
Note:
Please check ha.cf on standby server and change "bcast interface" (if different)
21. HA - Hands on
Create PostgreSQL Trigger (activate_standby.sh)
❖ Location “/etc/init.d/”
❖ Need PostgreSQL startup script “/etc/init.d/postgres.service”
❖ Executed on master when master heartbeat is up (# first)
❖ Executed on slave when master server failure (network down)
❖ Both servers
❖ Only need 1 activate_standby.sh if master and slave same environment
❖ Executable file (chmod 755)
22. HA - Hands on
Create PostgreSQL Trigger Script (activate_standby.sh)
# vi /etc/init.d/activate_standby.sh
#!/bin/bash
case $1 in
start)
#touch /equnix/data/trigger_file
#sed -i “s/synchronous_standby_names/#synchronous_standby_names/g” /equnix/data/postgresql.conf
/etc/init.d/postgres.service reload
exit 0
;;
stop)
#sed -i “s/synchronous_standby_names/synchronous_standby_names/g” /equnix/data/postgresql.conf
/etc/init.d/postgres.service reload
;;
*)
exit 0;
esac;
23. HA - Hands on
Start Heartbeat Service on Both Servers
# /etc/init.d/heartbeat start
Starting High-Availability services: IPaddr[13659]: INFO: Running OK
ResourceManager[13635]: CRITICAL: Resource 192.168.8.22 is active, and should not be!
ResourceManager[13635]: CRITICAL: Non-idle resources will affect resource takeback!
ResourceManager[13635]: CRITICAL: Non-idle resources may affect data integrity!
Done.
Check Heartbeat Log on Both Servers
less /var/log/ha.log
Mar 27 18:32:07 node1 heartbeat: [13715]: info: Local status now set to: 'up'
Mar 27 18:32:07 node1 heartbeat: [13715]: info: Link node1:eth2 up.
#un-comment activate_standby.sh -> will be executed for the next failover
27. HA - Recovery
What Will Happen Master Recovery?
❖ Master doesn’t failback (auto_failback = off)
❖ Master become new slave (slave already become master)
❖ Master follow new master (using rsync)
❖ Floating IP should be down (takeover by slave via haresources)
28. HA - Recovery
Master Follow (resynchronizedb)
1. Stop PostgreSQL service at Master (new slave though)
2. Login to PostgreSQL database on Slave (new master!)
3. Do pg_start_backup on new master
postgres=# select pg_start_backup('new_master');
4. RSYNC “data” directory on new master to new slave
# rsync --exclude 'backup_label' --exclude 'postmaster.pid' -argv
/equnix/data postgres@node1:/equnix/data
5. Do pg_stop_backup on new master
# postgres=# select pg_stop_backup();
29. HA - Recovery
Master Follow (DBFOLLOW)
6. Login to Master server (new slave)
7. On new standby “data” directory create recovery.conf or rename recovery.done to
recovery.conf :
standby_mode='on'
primary_conninfo='host=192.168.8.21 port=5432 user=pgsql application_name=replica1'
trigger_file='/equnix/data/triggerfile'
8. Start PostgreSQL service on new standby
9. Don’t forget check replication status
30. Configure public and private interface
$ vi /etc/hostname => node1 or node2
$ vi /etc/hosts
10.0.0.1 node1
10.0.0.2 node2
As pgsql user on node 1 and node 2:
$ ssh-keygen -t rsa
$ scp .ssh/id_rsa.pub [peernode]:~/.ssh/authorized_keys
User pgsql must have sudoers list
pgsql ALL=(root) NOPASSWD:/sbin/reboot,/sbin/pcs *,/sbin/ip *,/bin/kill *
Configure network on both Node
31. Install Pacemaker, Corosync, and PCS
$ yum install pacemaker corosync pcs
$ apt-get install pacemaker corosync pcs pacemaker-cli-utils
Open Firewall on both Nodes (if any)
$ iptables -I INPUT -m state --state NEW -p udp -m multiport --dports 5404,5405 -j ACCEPT
$ iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 2224 -j ACCEPT
Start PCS Daemon
$ systemctl start pcsd
Authorize hacluster user on Pacemaker
$ pcs auth cluster [node 1] [node 2]
Create Cluster
$ pcs cluster setup --name nodegroup node1 node2
Pacemaker Corosync Installation