SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
krzysztof@severalnines.com
Copyright 2018 Severalnines AB
Presenter
Krzysztof Książek, Senior Support Engineer @Severalnines
How to Manage Replication Failover
Processes for MySQL, MariaDB &
PostgreSQL
December 11th, 2018
Copyright 2018 Severalnines AB
I'm JJ from the Severalnines Team and I'm your host for
today's webinar!
Feel free to ask any questions in the Questions section
of this application or via the Chat box.
You can also contact me directly via the chat box or via
email: jj@severalnines.com during or after the
webinar.
Your host & some logistics
Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB
About Severalnines & ClusterControl
Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
About ClusterControl
# Free to Download
# Initial 30 Days Enterprise
Trial
# Reverts to Free
Community Edition
# Enterprise / Paid Versions
Available
Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
ClusterControl Automation & Management
Deployment (Free Community)
# Deploy a Cluster in Minutes
○ On-Prem
○ Cloud (AWS/Azure/Google) - paid

Monitoring (Free Community)
# Systems View with 1 sec Resolution
# Agentless via SSH, or agent-based with Prometheus
# DB / OS stats & Performance Advisors
# Configurable Dashboards
# Query Analyzer
# Real-time / historical
Management (Paid Features)
# Backup Management
# Upgrades & Patching
# Security & Compliance
# Operational Reports
# Automatic Recovery & Repair
# Performance Management
# Automatic Performance Advisors
Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB
Supported Databases
Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
Our Customers
krzysztof@severalnines.com
Copyright 2018 Severalnines AB
Presenter
Krzysztof Książek, Senior Support Engineer @Severalnines
How to Manage Replication Failover
Processes for MySQL, MariaDB &
PostgreSQL
December 11th, 2018
Copyright 2018 Severalnines AB
•An introduction to failover - what, when, how
in MySQL / MariaDB
in PostgreSQL
•To automate or not to automate
•Understanding the failover process
•Orchestrating failover across the whole HA stack
•Difficult problems
Network partitioning
Missed heartbeats
Split brain
•From assisted to fully automated failover with ClusterControl
Demo
Agenda
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
An introduction to failover - what, when, how
Copyright 2018 Severalnines AB
•A switchover is the process of switching a
master role to another server through the
process of a slave promotion
•A failover is the process of switching a master
role to another server through the process of a
slave promotion. Old master is not available or
its availability is limited
This is worse scenario as you cannot
assume all the slaves are in sync
•Today the we will focus on the failover process
An introduction to replication failover - what, when, how
Copyright 2018 Severalnines AB
•The failover is performed when the old master became
unavailable. Both in MySQL and PostgreSQL replication,
writes have to be sent to the master therefore its crash
affects the whole cluster, making it not available
•What is important, you should verify the master
connectivity from the point of the slaves
It may happen that the monitoring node cannot reach
the master while slaves are happily replicating from it
Failover should be triggered only if the master is
indeed not reachable neither by the application nor
by the slaves
An introduction to replication failover - what, when, how
Copyright 2018 Severalnines AB
•After a master crash you end up with one or more slaves
•Verify that the master is indeed not reachable
•Decide which slave is the most up to date and pick it as master candidate
•Ensure there are no errant transactions on the master candidate
•Collect missing data from the master (if it is possible) and replay them on the master
candidate
•Reslave all remaining slaves off the new master
•Ensure to the best of your abilities that the old master will not be started again before it can
be investigated
•Rebuild the old master as a slave using the data from the new master
Failover in MySQL
Copyright 2018 Severalnines AB
•After an active server crash you end up with one or more standby servers
•Verify that the active server is indeed not reachable
•Find the most advanced standby server
•Trigger the failover using either pg_ctl promote or the trigger_file
•pg_rewind for remaining standby servers to make them in sync with the new master
•Reslave remaining standby servers to the new master
•Ensure to the best of your abilities that the old master will not be started again before it can
be investigated
•Rebuild the old master as a slave using the data from the new master
Failover in PostgreSQL
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
To automate or not to automate?
Copyright 2018 Severalnines AB
•As shown in last two slides, the failover requires couple of steps to be performed
As usual, more steps and more complex they are, the higher chance for human error
•Scripts can easily perform all the tasks required, run all the checks and do it way faster and
more reliable than human can do
•Scripts are as smart as we wrote them, though. Humans tend to be more flexible and can
handle unpredictable situations better
•Should we automate the failover or not? That’s the question!
•Let’s go through some pros and cons of automated failover
To automate or not to automate?
Copyright 2018 Severalnines AB
•Pros
Way faster reaction on the issue
Higher reliability for typical situations
When configured correctly, may handle
majority of the cases in a proper way
Reduce oncall burnout - even though
you page your staff, it’s not as critical
given that the systems are up and
running
To automate or not to automate?
•Cons
Limited situation awareness - does not
understand the large picture (or
understand what has been coded in)
Decisions made are not always correct
Requires intensive tests to ensure
reliability
Has to be maintained (if it is your own
script)
Copyright 2018 Severalnines AB
•The main differencing factors are the reaction time and lack of the situation awareness
•Automated failover will be faster but may take actions user would not take
•But the logic can be improved and safety features like white/blacklists can be use in attempt
to reduce incorrect behaviour
•Better visibility can also be implemented:
Access tests through multiple hosts (slaves, proxies)
Utilising clustering protocol like Raft or Paxos for network split detection
•Don’t expect automated failover to cover correctly 100% of the cases though
•A third way may also be applicable - assisted failover
Does everything automatically but is initiated by the user, after the initial assessment
To automate or not to automate?
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Understanding the failover process
Copyright 2018 Severalnines AB
•Ensuring that the master is indeed down is critical
•You never want to run two writable masters at the same time!
•You may want to implement some sort of STONITH (Shoot The Other Node In The Head) to
ensure dead master will stay dead
•You can leverage data from multiple sources. Are slaves replicating? Do proxies see the
master?
Understanding the failover process

Ensure that the master is indeed down
Copyright 2018 Severalnines AB
•Picking correct slave as the master candidate is critical
•You want to use the most advanced slave to avoid data loss
•You want to ensure there are no errant transactions (in GTID setup)
•You want to allow slave to apply the events from relay logs (as long as it does not take too
long)
•You want to try and reach the master to see if there are non-replicated binary log events
Master failure not always mean you cannot SSH there and parse binlogs for missing
transactions
Understanding the failover process

Pick the correct slave as the master candidate
Copyright 2018 Severalnines AB
•Correct usage of whitelists and blacklists is critical
•You may not want to promote any slave that you have
•Better to stay within the same datacenter to avoid split brain scenario with two masters
•Better to stay within the same datastore version for compatibility reasons
•Better to stay within the same hardware for performance reasons
•While executing a failover use the standard procedures for marking masters and slaves
read_only and super_read_only = 0 or 1?
Understanding the failover process
Correct usage of whitelists and blacklists
Copyright 2018 Severalnines AB
•Automated failover process can sometimes be augmented by the use of pre- or post-failover
actions
•Do you want to perform some action when the master failed?
•Do you need to reconfigure some application when a new master is promoted?
•Do you want to remove old master entry from your Consul key/value store?
•Most of the main tools that support failover handling support also pre- and post-failover
actions
MHA
Orchestrator
ClusterControl
Understanding the failover process
Pre- and post-failover actions
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Orchestrating failover across the whole HA stack
Copyright 2018 Severalnines AB
•Databases do not exist in vacuum, they are surrounded by other services to create a highly
available environment
•Proxies need a way to distinguish between the master and a slave
In PostgreSQL streaming replication this is typically the existence of a recovery.conf file
In MySQL it can be, for example a value of read_only and super_read_only: 1 or 0
•When failover is happening, you have to make sure you manage the variable’s value
correctly
You don’t want loadbalancers to send the traffic to your databases while failover is
happening
Orchestrating failover across the whole HA stack
Copyright 2018 Severalnines AB
Orchestrating failover across the whole HA stack
Copyright 2018 Severalnines AB
Orchestrating failover across the whole HA stack
Copyright 2018 Severalnines AB
•All loadbalancers deployed by ClusterControl follow those rules
recovery.conf file on PostgreSQL
read_only value on MySQL
•ClusterControl ensures that the values in MySQL are defined accordingly to the stage of the
process
in switchover, the master is demoted through read_only=1. In failover this cannot be done
still, read_only=1 is configured in MySQL configuration on all nodes to minimise the chance
of old master returning as writable host
new master is marked with read_only=0
•This process works but it does not cover all the situations
Orchestrating failover across the whole HA stack
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Difficult problems
Copyright 2018 Severalnines AB
•Networks can be unstable and packets may be lost in the transfer
•Replication itself is robust and it will work quite well even if there are network problems
•Health checks performed over the replication also have to take such conditions under
consideration
•Make sure you do not take any actions based on just a single health check
•Make sure you do not take any actions based on just a single host’s point of view
•Expect network problems and try to understand their severity before an action will be taken
Difficult problems - network issues
Copyright 2018 Severalnines AB
•Every cluster type has its own problems.
For MySQL and PostgreSQL replication one
of the biggest issues is the lack of cluster
awareness and lack of quorum support
•Replication clusters are prone to the
network split issues
•Automated topology detection by proxies
can make things even more tricky
•There’s no easy, standard way to avoid this
problem
Difficult problems - network split
Copyright 2018 Severalnines AB
•Network split happens when there’s lack of connectivity between one part of the cluster and
the other part
For example, the master cannot reach slaves, slaves cannot reach the master
•Master is unavailable therefore cluster cannot handle writes
Failover should be performed to restore cluster’s ability to handle traffic
•Master is still running though, when networks converge two writeable hosts will show up
•Standard topology detection logic will not be enough. Two nodes will have read_only=0, two
nodes will not have the recovery.conf file
Without additional measures to ensure the old master won’t get the traffic, a split brain is
imminent
Difficult problems - network split
Copyright 2018 Severalnines AB
•Split brain is a condition in which two writable nodes take the traffic and, as a result, their
data sets drift apart
•There’s no easy solution to recover from such condition
Shut down rogue master as soon as possible to minimise the data drift
Manual action will be required to converge the data sets
•Make sure that whatever solution you choose, it works
You can do better than GitHub!
Difficult problems - split brain
Copyright 2018 Severalnines AB
Difficult problems - split brain
Copyright 2018 Severalnines AB
•There are numerous ways in which you can reduce (but not avoid) the impact and probability
that your data will be affected by the network issues
•Collect as much data about the state of the replication topology before an action is taken
Utilize multiple nodes as the point of view on the topology
•Try to implement STONITH to reduce the chance that old master will show up
Some kind of Lights-Out solution (iLO for example) might work in physical environment
Kill scripts (destroy given virtual instance) may work in the cloud
•Modify configuration of the proxies to remove old master after it’s deemed as dead
•No solution will be 100% bullet proof
You may not be able to reach all the proxies, the node itself or cloud service to kill the master
Difficult problems - how to avoid them?
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Demo
End of Year Promotion
Get Three Months Free
25% In
Savings
Just Sign By December 20th!
with an Annual Contract
Copyright 2018 Severalnines AB
•Blogs that cover failover:
https://severalnines.com/blog/introduction-failover-mysql-replication-101-blog
https://severalnines.com/blog/failover-postgresql-replication-101
https://severalnines.com/blog/how-control-replication-failover-mysql-and-mariadb
https://severalnines.com/blog/controlling-replication-failover-mysql-and-mariadb-pre-or-post-
failover-scripts
•To automate or not to automate?
https://severalnines.com/blog/failover-mysql-replication-and-others-should-it-be-automated
• Contact: jj@severalnines.com
Thank you!

Contenu connexe

Tendances

Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Severalnines
 
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Severalnines
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 

Tendances (20)

Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
 
How Pixid dropped Oracle and went hybrid with MariaDB
How Pixid dropped Oracle and went hybrid with MariaDBHow Pixid dropped Oracle and went hybrid with MariaDB
How Pixid dropped Oracle and went hybrid with MariaDB
 
Advanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona ServerAdvanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona Server
 
RedisConf18 - My Other Car is a Redis Cluster
RedisConf18 - My Other Car is a Redis ClusterRedisConf18 - My Other Car is a Redis Cluster
RedisConf18 - My Other Car is a Redis Cluster
 
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleHow Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQ
 
CCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDBCCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDB
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBHow to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale out
 
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
 
Introducing the R2DBC async Java connector
Introducing the R2DBC async Java connectorIntroducing the R2DBC async Java connector
Introducing the R2DBC async Java connector
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®
 
Best practices: running high-performance databases on Kubernetes
Best practices: running high-performance databases on KubernetesBest practices: running high-performance databases on Kubernetes
Best practices: running high-performance databases on Kubernetes
 
FinOps introduction
FinOps introductionFinOps introduction
FinOps introduction
 
Database Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best PracticesDatabase Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best Practices
 
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ MemoryRedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to Deployment
 

Similaire à Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL

Webinar slides: How to Get Started with Open Source Database Management
Webinar slides: How to Get Started with Open Source Database ManagementWebinar slides: How to Get Started with Open Source Database Management
Webinar slides: How to Get Started with Open Source Database Management
Severalnines
 
Gearman: A Job Server made for Scale
Gearman: A Job Server made for ScaleGearman: A Job Server made for Scale
Gearman: A Job Server made for Scale
Mike Willbanks
 
How Percolate uses CFEngine to Manage AWS Stateless Infrastructure
How Percolate uses CFEngine to Manage AWS Stateless InfrastructureHow Percolate uses CFEngine to Manage AWS Stateless Infrastructure
How Percolate uses CFEngine to Manage AWS Stateless Infrastructure
Percolate
 
Gearman - Northeast PHP 2012
Gearman - Northeast PHP 2012Gearman - Northeast PHP 2012
Gearman - Northeast PHP 2012
Mike Willbanks
 
I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth Bowles
QA or the Highway
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Continuent
 

Similaire à Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL (20)

Webinar slides: How to Get Started with Open Source Database Management
Webinar slides: How to Get Started with Open Source Database ManagementWebinar slides: How to Get Started with Open Source Database Management
Webinar slides: How to Get Started with Open Source Database Management
 
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...
 
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
 
Gearman: A Job Server made for Scale
Gearman: A Job Server made for ScaleGearman: A Job Server made for Scale
Gearman: A Job Server made for Scale
 
Rodney Lester: Well-Architected - Reliability Instructor Led Lab.pdf
Rodney Lester: Well-Architected - Reliability Instructor Led Lab.pdfRodney Lester: Well-Architected - Reliability Instructor Led Lab.pdf
Rodney Lester: Well-Architected - Reliability Instructor Led Lab.pdf
 
Become a MySQL DBA - webinar series - slides: Which High Availability solution?
Become a MySQL DBA - webinar series - slides: Which High Availability solution?Become a MySQL DBA - webinar series - slides: Which High Availability solution?
Become a MySQL DBA - webinar series - slides: Which High Availability solution?
 
Amazon Aurora: Database Week SF
Amazon Aurora: Database Week SFAmazon Aurora: Database Week SF
Amazon Aurora: Database Week SF
 
How Percolate uses CFEngine to Manage AWS Stateless Infrastructure
How Percolate uses CFEngine to Manage AWS Stateless InfrastructureHow Percolate uses CFEngine to Manage AWS Stateless Infrastructure
How Percolate uses CFEngine to Manage AWS Stateless Infrastructure
 
Training Slides: 252 - Monitoring & Troubleshooting
Training Slides: 252 - Monitoring & TroubleshootingTraining Slides: 252 - Monitoring & Troubleshooting
Training Slides: 252 - Monitoring & Troubleshooting
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
Geek Sync | Planning a SQL Server to Azure Migration in 2021 - Brent Ozar
Geek Sync | Planning a SQL Server to Azure Migration in 2021 - Brent OzarGeek Sync | Planning a SQL Server to Azure Migration in 2021 - Brent Ozar
Geek Sync | Planning a SQL Server to Azure Migration in 2021 - Brent Ozar
 
stackconf 2022: Infrastructure Automation (anti) patterns
stackconf 2022: Infrastructure Automation (anti) patternsstackconf 2022: Infrastructure Automation (anti) patterns
stackconf 2022: Infrastructure Automation (anti) patterns
 
Infrastructure as Code Patterns
Infrastructure as Code PatternsInfrastructure as Code Patterns
Infrastructure as Code Patterns
 
Gearman - Northeast PHP 2012
Gearman - Northeast PHP 2012Gearman - Northeast PHP 2012
Gearman - Northeast PHP 2012
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...
 
I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth Bowles
 
Vertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraVertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache Cassandra
 
MySQL DevOps at Outbrain
MySQL DevOps at OutbrainMySQL DevOps at Outbrain
MySQL DevOps at Outbrain
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
 

Plus de Severalnines

Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...
Severalnines
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 
Webinar slides: How to Measure Database Availability?
Webinar slides: How to Measure Database Availability?Webinar slides: How to Measure Database Availability?
Webinar slides: How to Measure Database Availability?
Severalnines
 
Webinar slides: How to Achieve PCI Compliance for MySQL & MariaDB with Cluste...
Webinar slides: How to Achieve PCI Compliance for MySQL & MariaDB with Cluste...Webinar slides: How to Achieve PCI Compliance for MySQL & MariaDB with Cluste...
Webinar slides: How to Achieve PCI Compliance for MySQL & MariaDB with Cluste...
Severalnines
 
Webinar slides: Severalnines & MariaDB present: Automation & Management of Ma...
Webinar slides: Severalnines & MariaDB present: Automation & Management of Ma...Webinar slides: Severalnines & MariaDB present: Automation & Management of Ma...
Webinar slides: Severalnines & MariaDB present: Automation & Management of Ma...
Severalnines
 
Webinar slides: How to automate and manage MongoDB & Percona Server for MongoDB
Webinar slides: How to automate and manage MongoDB & Percona Server for MongoDBWebinar slides: How to automate and manage MongoDB & Percona Server for MongoDB
Webinar slides: How to automate and manage MongoDB & Percona Server for MongoDB
Severalnines
 
Webinar slides: DevOps Tutorial: how to automate your database infrastructure
Webinar slides: DevOps Tutorial: how to automate your database infrastructureWebinar slides: DevOps Tutorial: how to automate your database infrastructure
Webinar slides: DevOps Tutorial: how to automate your database infrastructure
Severalnines
 

Plus de Severalnines (19)

Cloud's future runs through Sovereign DBaaS
Cloud's future runs through Sovereign DBaaSCloud's future runs through Sovereign DBaaS
Cloud's future runs through Sovereign DBaaS
 
Tips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloudTips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloud
 
Working with the Moodle Database: The Basics
Working with the Moodle Database: The BasicsWorking with the Moodle Database: The Basics
Working with the Moodle Database: The Basics
 
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDBSysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
 
Performance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDBPerformance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDB
 
Polyglot Persistence Utilizing Open Source Databases as a Swiss Pocket Knife
Polyglot Persistence Utilizing Open Source Databases as a Swiss Pocket KnifePolyglot Persistence Utilizing Open Source Databases as a Swiss Pocket Knife
Polyglot Persistence Utilizing Open Source Databases as a Swiss Pocket Knife
 
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Webinar slides: How to Measure Database Availability?
Webinar slides: How to Measure Database Availability?Webinar slides: How to Measure Database Availability?
Webinar slides: How to Measure Database Availability?
 
Webinar slides: How to Achieve PCI Compliance for MySQL & MariaDB with Cluste...
Webinar slides: How to Achieve PCI Compliance for MySQL & MariaDB with Cluste...Webinar slides: How to Achieve PCI Compliance for MySQL & MariaDB with Cluste...
Webinar slides: How to Achieve PCI Compliance for MySQL & MariaDB with Cluste...
 
Webinar slides: Severalnines & MariaDB present: Automation & Management of Ma...
Webinar slides: Severalnines & MariaDB present: Automation & Management of Ma...Webinar slides: Severalnines & MariaDB present: Automation & Management of Ma...
Webinar slides: Severalnines & MariaDB present: Automation & Management of Ma...
 
Webinar slides: How to automate and manage MongoDB & Percona Server for MongoDB
Webinar slides: How to automate and manage MongoDB & Percona Server for MongoDBWebinar slides: How to automate and manage MongoDB & Percona Server for MongoDB
Webinar slides: How to automate and manage MongoDB & Percona Server for MongoDB
 
MySQL on Docker - Containerizing the Dolphin
MySQL on Docker - Containerizing the DolphinMySQL on Docker - Containerizing the Dolphin
MySQL on Docker - Containerizing the Dolphin
 
Automating and Managing MongoDB: An Analysis of Ops Manager vs. ClusterControl
Automating and Managing MongoDB: An Analysis of Ops Manager vs. ClusterControlAutomating and Managing MongoDB: An Analysis of Ops Manager vs. ClusterControl
Automating and Managing MongoDB: An Analysis of Ops Manager vs. ClusterControl
 
MySQL Load Balancers - MaxScale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - MaxScale, ProxySQL, HAProxy, MySQL Router & nginx - A ...MySQL Load Balancers - MaxScale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - MaxScale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
 
Webinar slides: DevOps Tutorial: how to automate your database infrastructure
Webinar slides: DevOps Tutorial: how to automate your database infrastructureWebinar slides: DevOps Tutorial: how to automate your database infrastructure
Webinar slides: DevOps Tutorial: how to automate your database infrastructure
 
Webinar slides: How to deploy and manage HAProxy, MaxScale or ProxySQL with C...
Webinar slides: How to deploy and manage HAProxy, MaxScale or ProxySQL with C...Webinar slides: How to deploy and manage HAProxy, MaxScale or ProxySQL with C...
Webinar slides: How to deploy and manage HAProxy, MaxScale or ProxySQL with C...
 
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 

Dernier

哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 

Dernier (20)

哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 

Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL

  • 1. krzysztof@severalnines.com Copyright 2018 Severalnines AB Presenter Krzysztof Książek, Senior Support Engineer @Severalnines How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL December 11th, 2018
  • 2. Copyright 2018 Severalnines AB I'm JJ from the Severalnines Team and I'm your host for today's webinar! Feel free to ask any questions in the Questions section of this application or via the Chat box. You can also contact me directly via the chat box or via email: jj@severalnines.com during or after the webinar. Your host & some logistics
  • 3. Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB About Severalnines & ClusterControl
  • 4. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
  • 5. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB About ClusterControl # Free to Download # Initial 30 Days Enterprise Trial # Reverts to Free Community Edition # Enterprise / Paid Versions Available
  • 6. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB ClusterControl Automation & Management Deployment (Free Community) # Deploy a Cluster in Minutes ○ On-Prem ○ Cloud (AWS/Azure/Google) - paid
 Monitoring (Free Community) # Systems View with 1 sec Resolution # Agentless via SSH, or agent-based with Prometheus # DB / OS stats & Performance Advisors # Configurable Dashboards # Query Analyzer # Real-time / historical Management (Paid Features) # Backup Management # Upgrades & Patching # Security & Compliance # Operational Reports # Automatic Recovery & Repair # Performance Management # Automatic Performance Advisors
  • 7. Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB Supported Databases
  • 8. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB Our Customers
  • 9. krzysztof@severalnines.com Copyright 2018 Severalnines AB Presenter Krzysztof Książek, Senior Support Engineer @Severalnines How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL December 11th, 2018
  • 10. Copyright 2018 Severalnines AB •An introduction to failover - what, when, how in MySQL / MariaDB in PostgreSQL •To automate or not to automate •Understanding the failover process •Orchestrating failover across the whole HA stack •Difficult problems Network partitioning Missed heartbeats Split brain •From assisted to fully automated failover with ClusterControl Demo Agenda
  • 11. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB An introduction to failover - what, when, how
  • 12. Copyright 2018 Severalnines AB •A switchover is the process of switching a master role to another server through the process of a slave promotion •A failover is the process of switching a master role to another server through the process of a slave promotion. Old master is not available or its availability is limited This is worse scenario as you cannot assume all the slaves are in sync •Today the we will focus on the failover process An introduction to replication failover - what, when, how
  • 13. Copyright 2018 Severalnines AB •The failover is performed when the old master became unavailable. Both in MySQL and PostgreSQL replication, writes have to be sent to the master therefore its crash affects the whole cluster, making it not available •What is important, you should verify the master connectivity from the point of the slaves It may happen that the monitoring node cannot reach the master while slaves are happily replicating from it Failover should be triggered only if the master is indeed not reachable neither by the application nor by the slaves An introduction to replication failover - what, when, how
  • 14. Copyright 2018 Severalnines AB •After a master crash you end up with one or more slaves •Verify that the master is indeed not reachable •Decide which slave is the most up to date and pick it as master candidate •Ensure there are no errant transactions on the master candidate •Collect missing data from the master (if it is possible) and replay them on the master candidate •Reslave all remaining slaves off the new master •Ensure to the best of your abilities that the old master will not be started again before it can be investigated •Rebuild the old master as a slave using the data from the new master Failover in MySQL
  • 15. Copyright 2018 Severalnines AB •After an active server crash you end up with one or more standby servers •Verify that the active server is indeed not reachable •Find the most advanced standby server •Trigger the failover using either pg_ctl promote or the trigger_file •pg_rewind for remaining standby servers to make them in sync with the new master •Reslave remaining standby servers to the new master •Ensure to the best of your abilities that the old master will not be started again before it can be investigated •Rebuild the old master as a slave using the data from the new master Failover in PostgreSQL
  • 16. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB To automate or not to automate?
  • 17. Copyright 2018 Severalnines AB •As shown in last two slides, the failover requires couple of steps to be performed As usual, more steps and more complex they are, the higher chance for human error •Scripts can easily perform all the tasks required, run all the checks and do it way faster and more reliable than human can do •Scripts are as smart as we wrote them, though. Humans tend to be more flexible and can handle unpredictable situations better •Should we automate the failover or not? That’s the question! •Let’s go through some pros and cons of automated failover To automate or not to automate?
  • 18. Copyright 2018 Severalnines AB •Pros Way faster reaction on the issue Higher reliability for typical situations When configured correctly, may handle majority of the cases in a proper way Reduce oncall burnout - even though you page your staff, it’s not as critical given that the systems are up and running To automate or not to automate? •Cons Limited situation awareness - does not understand the large picture (or understand what has been coded in) Decisions made are not always correct Requires intensive tests to ensure reliability Has to be maintained (if it is your own script)
  • 19. Copyright 2018 Severalnines AB •The main differencing factors are the reaction time and lack of the situation awareness •Automated failover will be faster but may take actions user would not take •But the logic can be improved and safety features like white/blacklists can be use in attempt to reduce incorrect behaviour •Better visibility can also be implemented: Access tests through multiple hosts (slaves, proxies) Utilising clustering protocol like Raft or Paxos for network split detection •Don’t expect automated failover to cover correctly 100% of the cases though •A third way may also be applicable - assisted failover Does everything automatically but is initiated by the user, after the initial assessment To automate or not to automate?
  • 20. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Understanding the failover process
  • 21. Copyright 2018 Severalnines AB •Ensuring that the master is indeed down is critical •You never want to run two writable masters at the same time! •You may want to implement some sort of STONITH (Shoot The Other Node In The Head) to ensure dead master will stay dead •You can leverage data from multiple sources. Are slaves replicating? Do proxies see the master? Understanding the failover process
 Ensure that the master is indeed down
  • 22. Copyright 2018 Severalnines AB •Picking correct slave as the master candidate is critical •You want to use the most advanced slave to avoid data loss •You want to ensure there are no errant transactions (in GTID setup) •You want to allow slave to apply the events from relay logs (as long as it does not take too long) •You want to try and reach the master to see if there are non-replicated binary log events Master failure not always mean you cannot SSH there and parse binlogs for missing transactions Understanding the failover process
 Pick the correct slave as the master candidate
  • 23. Copyright 2018 Severalnines AB •Correct usage of whitelists and blacklists is critical •You may not want to promote any slave that you have •Better to stay within the same datacenter to avoid split brain scenario with two masters •Better to stay within the same datastore version for compatibility reasons •Better to stay within the same hardware for performance reasons •While executing a failover use the standard procedures for marking masters and slaves read_only and super_read_only = 0 or 1? Understanding the failover process Correct usage of whitelists and blacklists
  • 24. Copyright 2018 Severalnines AB •Automated failover process can sometimes be augmented by the use of pre- or post-failover actions •Do you want to perform some action when the master failed? •Do you need to reconfigure some application when a new master is promoted? •Do you want to remove old master entry from your Consul key/value store? •Most of the main tools that support failover handling support also pre- and post-failover actions MHA Orchestrator ClusterControl Understanding the failover process Pre- and post-failover actions
  • 25. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  • 26. Copyright 2018 Severalnines AB •Databases do not exist in vacuum, they are surrounded by other services to create a highly available environment •Proxies need a way to distinguish between the master and a slave In PostgreSQL streaming replication this is typically the existence of a recovery.conf file In MySQL it can be, for example a value of read_only and super_read_only: 1 or 0 •When failover is happening, you have to make sure you manage the variable’s value correctly You don’t want loadbalancers to send the traffic to your databases while failover is happening Orchestrating failover across the whole HA stack
  • 27. Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  • 28. Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  • 29. Copyright 2018 Severalnines AB •All loadbalancers deployed by ClusterControl follow those rules recovery.conf file on PostgreSQL read_only value on MySQL •ClusterControl ensures that the values in MySQL are defined accordingly to the stage of the process in switchover, the master is demoted through read_only=1. In failover this cannot be done still, read_only=1 is configured in MySQL configuration on all nodes to minimise the chance of old master returning as writable host new master is marked with read_only=0 •This process works but it does not cover all the situations Orchestrating failover across the whole HA stack
  • 30. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Difficult problems
  • 31. Copyright 2018 Severalnines AB •Networks can be unstable and packets may be lost in the transfer •Replication itself is robust and it will work quite well even if there are network problems •Health checks performed over the replication also have to take such conditions under consideration •Make sure you do not take any actions based on just a single health check •Make sure you do not take any actions based on just a single host’s point of view •Expect network problems and try to understand their severity before an action will be taken Difficult problems - network issues
  • 32. Copyright 2018 Severalnines AB •Every cluster type has its own problems. For MySQL and PostgreSQL replication one of the biggest issues is the lack of cluster awareness and lack of quorum support •Replication clusters are prone to the network split issues •Automated topology detection by proxies can make things even more tricky •There’s no easy, standard way to avoid this problem Difficult problems - network split
  • 33. Copyright 2018 Severalnines AB •Network split happens when there’s lack of connectivity between one part of the cluster and the other part For example, the master cannot reach slaves, slaves cannot reach the master •Master is unavailable therefore cluster cannot handle writes Failover should be performed to restore cluster’s ability to handle traffic •Master is still running though, when networks converge two writeable hosts will show up •Standard topology detection logic will not be enough. Two nodes will have read_only=0, two nodes will not have the recovery.conf file Without additional measures to ensure the old master won’t get the traffic, a split brain is imminent Difficult problems - network split
  • 34. Copyright 2018 Severalnines AB •Split brain is a condition in which two writable nodes take the traffic and, as a result, their data sets drift apart •There’s no easy solution to recover from such condition Shut down rogue master as soon as possible to minimise the data drift Manual action will be required to converge the data sets •Make sure that whatever solution you choose, it works You can do better than GitHub! Difficult problems - split brain
  • 35. Copyright 2018 Severalnines AB Difficult problems - split brain
  • 36. Copyright 2018 Severalnines AB •There are numerous ways in which you can reduce (but not avoid) the impact and probability that your data will be affected by the network issues •Collect as much data about the state of the replication topology before an action is taken Utilize multiple nodes as the point of view on the topology •Try to implement STONITH to reduce the chance that old master will show up Some kind of Lights-Out solution (iLO for example) might work in physical environment Kill scripts (destroy given virtual instance) may work in the cloud •Modify configuration of the proxies to remove old master after it’s deemed as dead •No solution will be 100% bullet proof You may not be able to reach all the proxies, the node itself or cloud service to kill the master Difficult problems - how to avoid them?
  • 37. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Demo
  • 38. End of Year Promotion Get Three Months Free 25% In Savings Just Sign By December 20th! with an Annual Contract
  • 39. Copyright 2018 Severalnines AB •Blogs that cover failover: https://severalnines.com/blog/introduction-failover-mysql-replication-101-blog https://severalnines.com/blog/failover-postgresql-replication-101 https://severalnines.com/blog/how-control-replication-failover-mysql-and-mariadb https://severalnines.com/blog/controlling-replication-failover-mysql-and-mariadb-pre-or-post- failover-scripts •To automate or not to automate? https://severalnines.com/blog/failover-mysql-replication-and-others-should-it-be-automated • Contact: jj@severalnines.com Thank you!