Planning for Disaster Recovery (DR) with Galera Cluster

Planning for Disaster Recovery
with Galera Cluster
Colin Charles, colin.charles@galeracluster.com

29 October 2019

https://twitter.com/galeracluster | www.galeracluster.com

Codership Webinar

Agenda
• Disasters happen

• Trade oﬀ’s

• A geo-distributed Galera Cluster

• Architecture

• A DR plan

• Is async the best solution for DR?

• Resources

Galera Cluster highlights
• We talk a lot about High Availability

• We talk a lot about multi-master replication

• Synchronous clusters that can ensure you’re always available

• Quorum based failure handling, optimistic concurrency control to commit

• Optimised for the cloud/Wide Area Networks (WANs)

However how does all this work with Disaster
Recovery?
• Galera Cluster does support being run in multiple data centres

• Eﬀectively you can have a 9-node Galera Cluster across 3 data centres to
keep you highly available

• Galera Cluster supports geo-distributed database clusters

• https://galeracluster.com/2015/07/geo-distributed-database-clusters-
with-galera/

Benefits of a geo-distributed Galera Cluster
• Increased redundancy

• All database operations are local
(segmented)

• Network traffic is reduced across
DCs (with optimised bandwidth
consumption)

• Latency penalty as minimal as
possible (when it is time to
COMMIT, hello speed of light, et al)

• Flow control fully configurable

• No split brain issues

• Out of the box encryption

• Can also work with asynchronous
replication

So, architecture…
• If you’re doing 9 Galera Cluster nodes at the minimum, you also have to
have your application clusters in 3 DCs

• Sure, this is great for High Availability, but gets costly after sometime…

• You also have to ensure that your schema is planned sensibly, after all, if
you have hot rows, deadlocks, and less tolerance to performance issues
during rollbacks, this may not be the best solution for a busy application
that does a lot of UPDATEs

We are here to talk Disaster Recovery (DR)
• It is the ability to run your business continuously without any interruptions irrespective of any damage occurring to
your infrastructure

• DR is definitely not cheap, but can you afford to lose business transactions? It is this “backup cost” that you need
to think about

• We’ve seen things inside the Linux kernel that can help with DR too, e.g. DRBD

• Basically a good DR plan is your Business Continuity Plan (BCP)

• Disaster Recovery and Business Continuity, 3rd Edition by Thejandra BS

• Recovery Time Objective (RTO): the time-scale (in hours or days) within which this must be achieved, that is, the
length of time it can afford to cease operating its business.

• Recovery Point Objective (RPO): the point in time when an organisation should recover, for example, it could be
stated as ‘Data can be recovered as of 9 pm last night’ - it defines the amount of data that it can afford to lose.

• You’re building resilience in your infrastructure

Cloud people understand resilience
• Cloud instances tend to be of varying quality

• Sometimes you spin up a poor instance. Best to kill/restart, as long as you know baseline
benchmarks

• The Simian Army (by Netflix) can help make more resilient infrastructure

• Includes Chaos Monkey, Latency Monkey, Chaos Gorilla (drops a whole AZ), etc.

• Even spawned a field, Chaos Engineering

• Chaos engineering is the discipline of experimenting on a software system in
production in order to build confidence in the system's capability to withstand turbulent
and unexpected conditions. (ref: https://en.wikipedia.org/wiki/Chaos_engineering)

What else do you need to think about?
• Keep track of the Mean Time to Recover (MTTR)

• Underrated is the Mean Time to Detect (MTTD) — how long do you know
a disaster has struck and can move your workloads?

• What is your SLA?

So what do you need in a plan?
• In terms of a Galera Cluster, you’re really thinking about ensuring you have
another data centre to take over

• You could already be running a 3-DC cluster…

• But presumably, you’re planning for disaster recovery, likely via
asynchronous replication to another data centre (as it saves the cost of
having yet another DC)

• You also want to make sure all this is 100% fully automated…

You’ll have to think about your entire stack
• Beyond the database, you have to ensure that there will be quick DNS
switchover (so low TTL on your DNS)

• Application servers need to be running and ready to take on the load at
the other data centre

• If using a proxy, this too will have to be awaiting at the other data centre

• So to mitigate from a complete disaster AND have great performance, you
are going to want to create a replica of your setup at a remote site

Why async replication between data centres for
DR?
• Async replication in MySQL 5.7/8 are really quite fast (same with MariaDB
10.3/10.4)

• The idea of “lagging slaves” should not be too much of an issue… this
can be tuned and conﬁgured

• You must ask — is fully synchronous replication right for your application?

• Callaghan’s Law: [In a Galera cluster] a given row can’t be modiﬁed more
than once per RTT.

A practical case study
• A more practical example, by Marco Tusa — https://www.percona.com/
blog/2018/11/15/how-not-to-do-mysql-high-availability-geographic-node-
distribution-with-galera-based-replication-misuse/ AND https://
www.percona.com/blog/2018/11/15/mysql-high-availability-on-premises-
a-geographically-distributed-scenario/

Simple reasons…
• A Galera Cluster across 3-DCs is pricier
than the previous solution, and it gives
you data consistency across all nodes.
You however do need to ensure your
application can take the commit time
penalties, you have a high performant
link for replication…

• The other approach is more focused on
“local commits” (just to you 3-node
cluster in one DC), you’ll see some data
state difference thanks to async
replication, you don’t need a great
replication link, DR works, and also this
works better across geographies

• We always think latencies, even 5ms
isn’t high, but it actually is!

• We have to remember a Galera writeset
can be as small as a 1 row INSERT but
large with many UPDATEs too

• We have to think about IP frames

• In Galera, flow control is the receiving
queue. There is a queue of events and
the longer this queue is, the longer it
takes for certification too.

All this doesn’t absolve you from other things…
• Like some kind of “automatic failover framework” when you go the async
route for DR

• A good backup and restore solution

• A good rule based solution for load balancing (ProxySQL, MariaDB
MaxScale)

The Galera Arbitrator Daemon (garbd)
• If you have access to a 3rd data centre, or put a one-node garbd in your
DR site, you could also have a 2-paired cluster in 2 DCs, thus bringing
your node count to a mere 7 nodes (instead of 9)

• When you have an even number of nodes, garbd functions as an odd
node, to avoid split-brain situations. It can also request a consistent
application state snapshot, which help with backups

So what are your choices for ultimate DR?
• If you have the money, 3 data centres so you have synchronous clusters
with 9 Galera Cluster nodes… This is also in addition to your application
servers, proxies, etc.

• 2 data centres, 7 nodes, with the Galera Arbitrator is a possibility

• If you don’t have as much budget, consider the async replication option
between 2 DCs. Just remember all the “manual glue” you may need to go
with this!

• “The dread of a disaster makes everybody act in a way that increases the
disaster.” — Bertrand Russell

Some Galera Cluster speciﬁc resources
• https://galeracluster.com/library/documentation/managing-fc.html

• https://galeracluster.com/library/documentation/auto-eviction.html

• https://galeracluster.com/library/documentation/using-sr.html (Galera 4
new feature)

• https://galeracluster.com/library/documentation/backup-cluster.html

• https://galeracluster.com/library/training/tutorials/geo-distributed-
clusters.html

Resources
• Disaster Recovery and Business Continuity, 3rd Edition by Thejandra BS

• Disaster Recovery, Crisis Response, and Business Continuity: A
Management Desk Reference by Jamie Watters

• Business Continuity and Disaster Recovery Planning for IT Professionals,
2nd Edition by Susan Snedaker

• Eﬀective MySQL Backup and Recovery by Ronald Bradford

Questions?
Colin Charles, colin.charles@galeracluster.com

https://twitter.com/galeracluster | www.galeracluster.com
27

Planning for Disaster Recovery (DR) with Galera Cluster

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Planning for Disaster Recovery (DR) with Galera Cluster

Similaire à Planning for Disaster Recovery (DR) with Galera Cluster (20)

Plus de Codership Oy - Creators of Galera Cluster

Plus de Codership Oy - Creators of Galera Cluster (13)

Dernier

Dernier (20)

Planning for Disaster Recovery (DR) with Galera Cluster