We talk a lot about Galera Cluster being great for High Availability, but what about Disaster Recovery (DR)? Database outages can occur when you lose a data centre due to data center power outages or natural disaster, so why not plan appropriately in advance?
In this webinar, we will discuss the business considerations including achieving the highest possible uptime, analysis business impact as well as risk, focus on disaster recovery itself, as well as discussing various scenarios, from having no offsite data to having synchronous replication to another data centre.
This webinar will cover MySQL with Galera Cluster, as well as branches MariaDB Galera Cluster as well as Percona XtraDB Cluster (PXC). We will focus on architecture solutions, DR scenarios and have you on your way to success at the end of it.
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Planning for Disaster Recovery (DR) with Galera Cluster
1. Planning for Disaster Recovery
with Galera Cluster
Colin Charles, colin.charles@galeracluster.com
29 October 2019
https://twitter.com/galeracluster | www.galeracluster.com
Codership Webinar
2. Agenda
• Disasters happen
• Trade off’s
• A geo-distributed Galera Cluster
• Architecture
• A DR plan
• Is async the best solution for DR?
• Resources
3.
4.
5.
6. Galera Cluster highlights
• We talk a lot about High Availability
• We talk a lot about multi-master replication
• Synchronous clusters that can ensure you’re always available
• Quorum based failure handling, optimistic concurrency control to commit
• Optimised for the cloud/Wide Area Networks (WANs)
7.
8. However how does all this work with Disaster
Recovery?
• Galera Cluster does support being run in multiple data centres
• Effectively you can have a 9-node Galera Cluster across 3 data centres to
keep you highly available
• Galera Cluster supports geo-distributed database clusters
• https://galeracluster.com/2015/07/geo-distributed-database-clusters-
with-galera/
9.
10. Benefits of a geo-distributed Galera Cluster
• Increased redundancy
• All database operations are local
(segmented)
• Network traffic is reduced across
DCs (with optimised bandwidth
consumption)
• Latency penalty as minimal as
possible (when it is time to
COMMIT, hello speed of light, et al)
• Flow control fully configurable
• No split brain issues
• Out of the box encryption
• Can also work with asynchronous
replication
11. So, architecture…
• If you’re doing 9 Galera Cluster nodes at the minimum, you also have to
have your application clusters in 3 DCs
• Sure, this is great for High Availability, but gets costly after sometime…
• You also have to ensure that your schema is planned sensibly, after all, if
you have hot rows, deadlocks, and less tolerance to performance issues
during rollbacks, this may not be the best solution for a busy application
that does a lot of UPDATEs
12. We are here to talk Disaster Recovery (DR)
• It is the ability to run your business continuously without any interruptions irrespective of any damage occurring to
your infrastructure
• DR is definitely not cheap, but can you afford to lose business transactions? It is this “backup cost” that you need
to think about
• We’ve seen things inside the Linux kernel that can help with DR too, e.g. DRBD
• Basically a good DR plan is your Business Continuity Plan (BCP)
• Disaster Recovery and Business Continuity, 3rd Edition by Thejandra BS
• Recovery Time Objective (RTO): the time-scale (in hours or days) within which this must be achieved, that is, the
length of time it can afford to cease operating its business.
• Recovery Point Objective (RPO): the point in time when an organisation should recover, for example, it could be
stated as ‘Data can be recovered as of 9 pm last night’ - it defines the amount of data that it can afford to lose.
• You’re building resilience in your infrastructure
13. Cloud people understand resilience
• Cloud instances tend to be of varying quality
• Sometimes you spin up a poor instance. Best to kill/restart, as long as you know baseline
benchmarks
• The Simian Army (by Netflix) can help make more resilient infrastructure
• Includes Chaos Monkey, Latency Monkey, Chaos Gorilla (drops a whole AZ), etc.
• Even spawned a field, Chaos Engineering
• Chaos engineering is the discipline of experimenting on a software system in
production in order to build confidence in the system's capability to withstand turbulent
and unexpected conditions. (ref: https://en.wikipedia.org/wiki/Chaos_engineering)
14. What else do you need to think about?
• Keep track of the Mean Time to Recover (MTTR)
• Underrated is the Mean Time to Detect (MTTD) — how long do you know
a disaster has struck and can move your workloads?
• What is your SLA?
15. So what do you need in a plan?
• In terms of a Galera Cluster, you’re really thinking about ensuring you have
another data centre to take over
• You could already be running a 3-DC cluster…
• But presumably, you’re planning for disaster recovery, likely via
asynchronous replication to another data centre (as it saves the cost of
having yet another DC)
• You also want to make sure all this is 100% fully automated…
16. You’ll have to think about your entire stack
• Beyond the database, you have to ensure that there will be quick DNS
switchover (so low TTL on your DNS)
• Application servers need to be running and ready to take on the load at
the other data centre
• If using a proxy, this too will have to be awaiting at the other data centre
• So to mitigate from a complete disaster AND have great performance, you
are going to want to create a replica of your setup at a remote site
17.
18. Why async replication between data centres for
DR?
• Async replication in MySQL 5.7/8 are really quite fast (same with MariaDB
10.3/10.4)
• The idea of “lagging slaves” should not be too much of an issue… this
can be tuned and configured
• You must ask — is fully synchronous replication right for your application?
• Callaghan’s Law: [In a Galera cluster] a given row can’t be modified more
than once per RTT.
19. A practical case study
• A more practical example, by Marco Tusa — https://www.percona.com/
blog/2018/11/15/how-not-to-do-mysql-high-availability-geographic-node-
distribution-with-galera-based-replication-misuse/ AND https://
www.percona.com/blog/2018/11/15/mysql-high-availability-on-premises-
a-geographically-distributed-scenario/
20. Simple reasons…
• A Galera Cluster across 3-DCs is pricier
than the previous solution, and it gives
you data consistency across all nodes.
You however do need to ensure your
application can take the commit time
penalties, you have a high performant
link for replication…
• The other approach is more focused on
“local commits” (just to you 3-node
cluster in one DC), you’ll see some data
state difference thanks to async
replication, you don’t need a great
replication link, DR works, and also this
works better across geographies
• We always think latencies, even 5ms
isn’t high, but it actually is!
• We have to remember a Galera writeset
can be as small as a 1 row INSERT but
large with many UPDATEs too
• We have to think about IP frames
• In Galera, flow control is the receiving
queue. There is a queue of events and
the longer this queue is, the longer it
takes for certification too.
21. All this doesn’t absolve you from other things…
• Like some kind of “automatic failover framework” when you go the async
route for DR
• A good backup and restore solution
• A good rule based solution for load balancing (ProxySQL, MariaDB
MaxScale)
22. The Galera Arbitrator Daemon (garbd)
• If you have access to a 3rd data centre, or put a one-node garbd in your
DR site, you could also have a 2-paired cluster in 2 DCs, thus bringing
your node count to a mere 7 nodes (instead of 9)
• When you have an even number of nodes, garbd functions as an odd
node, to avoid split-brain situations. It can also request a consistent
application state snapshot, which help with backups
23. So what are your choices for ultimate DR?
• If you have the money, 3 data centres so you have synchronous clusters
with 9 Galera Cluster nodes… This is also in addition to your application
servers, proxies, etc.
• 2 data centres, 7 nodes, with the Galera Arbitrator is a possibility
• If you don’t have as much budget, consider the async replication option
between 2 DCs. Just remember all the “manual glue” you may need to go
with this!
24. • “The dread of a disaster makes everybody act in a way that increases the
disaster.” — Bertrand Russell
26. Resources
• Disaster Recovery and Business Continuity, 3rd Edition by Thejandra BS
• Disaster Recovery, Crisis Response, and Business Continuity: A
Management Desk Reference by Jamie Watters
• Business Continuity and Disaster Recovery Planning for IT Professionals,
2nd Edition by Susan Snedaker
• Effective MySQL Backup and Recovery by Ronald Bradford