SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
MySQL Scalability and Reliability
for Replicated Environment
by Jean-François Gagné - Presented at dataops.barcelona (June 2019)
Senior Infrastructure Engineer / System and MySQL Expert
jeanfrancois AT messagebird DOT com / @jfg956
2
MessageBird
MessageBird is a cloud communications platform
founded in Amsterdam 2011.
Example of our messaging and voice SaaS:
SMS in and out, call in (IVR) and out (alert), SIP,
WhatsApp, Facebook, Telegram, Twitter, WeChat, …
Omni-Channel Conversation
Details at www.messagebird.com
225+ Direct-to-Carrier Agreements
With operators from around the world
15,000+ Customers
In over 60+ countries
200+ Employees
Engineering office in Amsterdam
Sales and support offices worldwide
We are expanding:
{Software, Front-End, Infrastructure, Data, Security, Telecom, QA} Engineers
{Team, Tech, Product} Leads, Product Owners, Customer Support
{Commercial, Connectivity, Partnership} Managers
www.messagebird.com/careers
3
Summary
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
This talk is an overview of the MySQL Replication landscape:
• Introduction to replication (what, why and how)
• Replications generalities (including tooling and lag)
• Zoom in some replication use cases (read scaling, resilience and failover)
• War story + more failover topics (bad ideas, master/master, semi-sync, …)
• Some closing remarks on scalability and sharding
I will also mention Advances Replication Subjects and provide pointers
(No need to take pictures of the slides, they will be online)
Overview of MySQL Replication
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
In a replicated environment, you have one master with one or more slaves:
• The master records transactions in a journal (binary logs); and each slave:
• Downloads the journal and saves it locally in the relay logs (done by the IO thread)
• Executes the relay logs on its local database (done by the SQL thread)
• Could also produce binary logs to be an intermediate master (log-slave-updates)
• Delay between writing to the binary log on the master and execution on slave (lag)
5
Why would you want replication
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
• Point-in-time recovery with the master binary logs
• Stability: taking backups from a slave
• Risk management: data from X hours ago on “delayed” slaves
• Reliability of reads: reading from a slave or from the master
• Scalability: spread read load on many slaves
• Availability of writes: failing over to a slave if the master crashes
• Other reasons involving MySQL slaves (aggregation with multi-source)
• Other reasons for leveraging the binlogs (gh-ost, replicating to Hadoop)
• And much more (testing new version before upgrade, rollback of upgrade, …)
6
Setting-up replication [1 of 3]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Three things to configure for replication:
• Enabling binary logging (log_bin is not dynamic K)
(disabled by default before MySQL 8.0 K, enabled in 8.0 J)
• Setting a unique server id (server_id is a dynamic global variable J)
(the last 3 numbers of the IP address is a safe-ish choice: 10.98.76.54 à 98076054)
• Create a replication user (How-to in the manual)
• MySQL 5.6+ also have a server uuid (for GITDs)
(generated magically, saved in auto.cnf, remember to delete if cloning from file cp)
• Many binlog format (binlog_format is dynamic but care as global & session)
(ROW default in MySQL 5.7+, STATEMENT before à consider RBR if 5.6 or 5.5)
(because Group Replication and latest parallel replication need RBR)
(less important for MariaDB where parallel replication works well with SBR)
Then restore a backup, setup the slave, and start replicating !
Setting-up replication [2 of 3]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
GTIDs are a useful reliability/repointing tool, you should probably use them !
• If new database, I advise to enable GTIDs (and use 5.7+)
(gtid_mode is OFF by default in all of MySQL 5.6, 5.7 and 8.0)
• If existing application with MySQL, enabling GTIDs is not straightforward
• Needs enforce_gtid_consistency (not dynamic in 5.6 and only hard fail)
(Dynamic in 5.7 and has a warning option: hopefully you are logging warnings)
(https://jfg-mysql.blogspot.com/2017/01/do-not-ignore-warnings-in-mysql-mariadb.html)
• If MariaDB 10.0+, then you are lucky: GTIDs come for free
I have a love/hate relationship with GTIDs: they are not my favourite tool
• I prefer Binlog Servers, but the open-source implementation is very young (Ripple)
(And it is an Advance Replication Subject not detailed here)
https://jfg-mysql.blogspot.com/2015/10/binlog-servers-for-backups-and-point-in-time-recovery.html 7
Setting-up replication [3 of 3]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Care with binlog-do-db and binlog-ignore-db:
• They compromise PiT recovery and might not do what you expect
https://www.percona.com/blog/2009/05/14/why-mysqls-binlog-do-db-option-is-dangerous/
A few notes about slaves (for when we will get there):
• Consider enabling binary logging on slaves (log-slave-updates)
• For failover to slaves and a good way to keep copies of binary logs for PiTR
• Replication crash-safety: enabled by default in MySQL 8.0, not before
• If not 8.0 and for enabling replication crash safety on slaves
set relay-log-info-repository = TABLE and relay-log-recovery = 1
https://www.slideshare.net/JeanFranoisGagn/demystifying-mysql-replication-crash-safety
• But with GTID, it needs sync_binlog = 1 and trx_commit = 1 L (Bug#92109)
• Be careful with replication filters:
• Might not do what you expect (see binlog-do-db and binlog-ignore-db above)
9
Replication visualization tooling
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Orchestrator…
10.3.54.241:3306
+- 10.23.12.1:3306
+- 10.2.18.42:3306
+- 10.23.78.5:3306
+- 10.2.34.56:3306
• Orchestrator needs some effort in deployment and configuration
• And it can also be used as a failover tool
• But in the meantime, pt-slave-find can help
… and pt-slave-find:
10
Replication lag [1 of 7]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Biggest enemy in a replication setup: replication lag
• Temporary lag: caused by a cold cache after a restart
• Occasional lag: caused by a write burst or some long transactions
• Perpetual lag of death: when slaves almost never catch-up
Monitoring lag:
• Second Behind Master from SHOW SLAVE STATUS
• From the Percona Toolkit: pt-heartbeat (or variation)
• (Also a new feature in MySQL 8.0 which I am not familiar with)
Replication lag [2 of 7]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Replication lag [3 of 7]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
13
Replication lag [4 of 7]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Techniques for preventing lag:
• Avoid long/big transactions
• Manage your batches
• Increase the replication throughput
If all above fail, you have a capacity problem à sharding
14
Replication lag [5 of 7]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Avoiding replication lag by not generating long transactions:
• Avoid unbounded/unpredictable writes
(we do not like FOREIGN KEY … ON DELETE/UPDATE CASCADE)
(we do not like unbounded UPDATE/DELETE and we do not like LOAD DATA)
• Avoid ALTER TABLE, use non-blocking methods (pt-osc / ghost)
(note that an online ALTER on the master blocks replication on slaves)
• Row-Based-Replication needs a primary key to run efficiently on slaves
https://jfg-mysql.blogspot.com/2017/08/danger-no-pk-with-RBR-and-mariadb-protection.html
• Long transactions compound with Intermediate Masters
• And they prevent parallel replication from working well
https://medium.com/booking-com-infrastructure/better-parallel-replication-for-mysql-14e2d7857813
15
Replication lag [6 of 7]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Avoiding replication lag by throttling batches:
• Split a batch in many small chunks
• Have a parameter to tune the chunk sizes (small transactions)
• Do not hit the master hard with all the small transactions
• Slow down: sleep between chucks (parameter), but by how much…
Advanced replication lag avoidance techniques for batches:
• Throttling with DB-Waypoints (how to get the sleep between chunks right)
https://www.slideshare.net/JeanFranoisGagn/how-bookingcom-avoids-and-deals-with-replication-lag-74772032
• Also the MariaDB No-Slave-Left-Behind Patch/Feature
https://jira.mariadb.org/browse/MDEV-8112
16
Replication lag [7 of 7]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Avoiding lag by increasing write capacity:
• Make writes more efficient
(Many database/table optimisations possible: https://dbahire.com/pleu18/ )
• Use a faster storage (SSD for read before write & cache for quicker syncs)
• Consider parallel replication for improved slave throughput
https://www.slideshare.net/JeanFranoisGagn/the-full-mysql-and-mariadb-parallel-replication-tutorial
• Care with sync_binlog = 0 and trx_commit = 2
They have consequences…
https://jfg-mysql.blogspot.com/2018/10/consequences-sync-binlog-neq-1-part-1.html
Let’s dive in the use cases
of MySQL Replication
18
Backups and related subjects [1 of 2]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Without slaves, backups must be done on the master
• Options are mysqldump/pump, XtraBackup, stop MySQL + file cp…
• All are disruptive in some ways
It is better to do backup on slaves (less disruptive):
• Same solutions as above (in the worst case, cause some replication lag)
• If you are HA, you should not afraid to stop a slave for taking a backup
(mysqldump & XtraBackup have drawbacks avoided by stop MySQL + file cp)
or from wiping data on a slave for restoring – and testing – a backup
(tested backups are the only good backups)
19
Backups and related subjects [2 of 2]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Also, having slaves is a way to keep extra copies of the binlogs:
(Binlog Servers might be a better way to achieve that, more on this later)
• Allows PiT Recovery of a backup (restore and apply the binlogs)
• Care with sync_binlog != 1: missing trx in binlogs on OS crash
https://jfg-mysql.blogspot.com/2018/10/consequences-sync-binlog-neq-1-part-1.html
and not replication crash safe with GTIDs: Bug#70659 and #92109
Delayed replication:
• Useful to recover from whoopsies: DROPing wrong table/database
• Original tool in the Percona Toolkit: pt-slave-delay
• Not needed since MySQL 5.6 and MariaDB 10.2
(MASTER_DELAY option included in CHANGE MASTER TO)
20
Reading from Slave(s) [1 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Why would you want reading from slave(s):
• Resilience: still being able to read when a node is down
• Capacity/Scaling: when the master is overloaded by reads
• Performance: splitting fast/slow queries to different slaves
• Latency: bringing data closer to readers in a multi data-center setup
Complexity of reading from slaves: dealing with lag (consistency)
• Stale read on a lagging slave (not always a problem if locally consistent)
• Temporal inconsistencies (going from a lagging to a non-lagging slave)
Reading from Slave(s) [2 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
More complexity: service discovery at scale (where to read from)
• DNS, proxy, pooling… (not discussed in detail in this talk)
(ProxySQL with a LoadBalancer for High Availability is a good start)
• At scale, all centralised solutions will be a bottleneck (Proxies and DNS)
Bad way for increasing Read Reliability:
• Normally read on the master and when it fails read on a slave
Ø This hides consistency/lagging problem most of the time
and makes these visible only when the master fails
• When few slaves à read from them and fallback to the master
• When many slaves à never fallback to the master (embrace lag)
22
Reading from Slave(s) [3 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Bad solution #1 to cope with lag:
Falling back to reading on the master
• If slaves are lagging, maybe we should read from the master…
• This looks like an attractive solution to avoid stale reads, but…
• It does not scale (why are you reading from slaves ?)
• This will cause a sudden load on the master (in case of lag)
• And it might cause an outage on the master (and this would be bad)
• It might be better to fail a read than to fallback to (and kill) the master
• Reading from the master might be okay in very specific cases
(but make sure to avoid the problem described above)
23
Reading from Slave(s) [4 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Bad solution #2 to cope with lag:
Retry on another slave
• When reading from a slave: if lag, then retry on another slave
• This scales better and is OK-ish (when few slaves are lagging)
• But what happens if many slaves are lagging ?
• Increased load (retries) can slowdown replication
• This might overload the slaves and cause a good slave to start lagging
• In the worst case, this might kill slaves and cause a domino effect
• Again: probably better to fail a read than to cause a bigger problem
• Consider separate “interactive” and “reporting” slaves (for expensive reads)
24
Reading from Slave(s) [5 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Advanced Replication Lag Coping Solutions:
• ProxySQL GTID consistent reads
https://proxysql.com/blog/proxysql-gtid-causal-reads
• DB-Waypoints for consistent read (Booking.com solution)
(subject of Simon’s talk)
Read views:
• Lag might not be a problem if you can “read-your-own-writes”…
(subject of Simon’s talk)
25
High Availability of Writes [1 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
In a MySQL replication deployment, the master is a SPOF
• Being able to react to its failure is crucial for high availability of writes
(another solution is to INSERT in many places – much harder for UPDATEs)
(not just master/master but also possible 2 completely separate databases)
Solutions to restore the availability of writes after a master failure
• Restart the master
(what if hardware fails, delay of crash recovery, and the cache starts cold)
• Database on shared storage (storage fails, crash recovery, and cold cache)
• Failover to a slave (what if DROP DATABASE)
26
High Availability of Writes [2 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Failing-over the master to a slave is my favorite high availability method
• But it is not as easy as it sounds, and it is hard to automate well
• There are many corner cases, and things will sometimes go wrong
(what if slaves are lagging, what if some data is only on the master, …)
• An example of complete failover solution in production:
https://githubengineering.com/mysql-high-availability-at-github/
The five considerations when failing-over the master to a slave:
(https://jfg-mysql.blogspot.com/2019/02/mysql-master-high-availability-and-failover-more-thoughts.html)
• Plan how are your doing Write high availability
• Decide when you apply your plan (Failure Detection – FD)
• Tell the application about the change (Service Discovery – SD)
• Protect against the limit of FD and SD for avoiding split-brains (fencing)
• Fix your data if something goes wrong
27
High Availability of Writes [3 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Failure detection (FD) is the 1st part (and 1st challenge) of failover
• It is a very hard problem: partial failure, unreliable network, partitions, …
• It is impossible to be 100% sure of a failure, and confidence needs time
à quick FD is unreliable, relatively reliable FD implies longer unavailability
Ø You need to accept that FD generates false positive (and/or false negative)
Repointing is the 2nd part of failover:
• Relatively easy with the right tools: MHA, GTID, Pseudo-GTID, Binlog Servers, …
• Complexity grows with the number of direct slaves of the master
(what if you cannot contact some of those slaves…)
• Some software for easing repointing:
• Orchestrator, Ripple Binlog Server,
Replication Manager, MHA, Cluster Control, MaxScale, …
28
High Availability of Writes [4 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
In this configuration, when the master fails, one of the slave needs to be repointed:
This configuration might simplify repointing, but…
… what if the intermediate master fails ?
High Availability of Writes [5 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
… and you cannot avoid this problem in large deployments !
30
High Availability of Writes [6 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Service Discovery (SD) is the 3rd part (and 2nd challenge) of failover:
• If centralised, it is a SPOF; if distributed, impossible to update atomically
• SD will either introduce a bottleneck (including performance limits)
or will be unreliable in some way (pointing to the wrong master)
• Some ways to implement Service Discover: DNS, ViP, Proxy, Zookeeper, …
http://code.openark.org/blog/mysql/mysql-master-discovery-methods-part-1-dns
Ø Unreliable FD and unreliable SD is a receipt for split-brains !
Protecting against split-brains: Adv. Subject – not many solutions at scale
(Proxies and well configured semi-synchronous replication might help)
Fixing your data in case of a split-brain: only you can know how to do this !
(Avoiding auto-increment by using UUID might help)
31
Failover War Story [1 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Master failover does not always go as planned
We will look at a War Story from Booking.com
• Autopsy of a [MySQL] automation disaster (FOSDEM 2017)
https://www.slideshare.net/JeanFranoisGagn/autopsy-of-an-automation-disaster
GitHub also has a recent incident with a public Post-Mortem:
• October 21 post-incident analysis
https://blog.github.com/2018-10-30-oct21-post-incident-analysis/
(Please share your war stories so we can learn from each-others’ experience)
32
Failover War Story [2 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
DNS (master) +---+
points here --> | A |
+---+
|
+---------------------+
| |
Some reads +---+ +---+
happen here --> | B | | X |
+---+ +---+
|
+---+ And other reads
| Y | <-- happen here
+---+
33
Failover War Story [3 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
DNS (master) +-/+
points here --> | A |
but they are +/-+
now failing
Some reads +-/+ +---+
happen here --> | B | | X |
but they are +/-+ +---+
now failing |
+---+ Some reads
| Y | <-- happen here
+---+
34
Failover War Story [4 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
+-/+
| A |
+/-+
+-/+ +---+ DNS (master)
| B | | X | <-- now points here
+/-+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Failover War Story [5 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
After some time...
+---+
| A |
+---+
DNS (master) +---+ +---+
points here --> | B | | X |
+---+ +---+
|
+---+ And reads
| Y | <-- happen here
+---+
36
Failover War Story [6 of 6]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
This is not good:
• When A and B failed, X was promoted as the new master
• Something made DNS point to B à writes are now happening there
• But B is outdated: all writes to X (after the failure of A) did not reach B
• So we have data on X that cannot be read on B
• And we have new data on B that is not on Y
Fixing the data would not have been easy:
• Same auto-increments were allocated on X and then B
Ø UUID would have been better (increasing for INSERTing in PK order)
http://mysql.rjweb.org/doc.php/uuid
https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
+---+
| A |
+---+
+---+ +---+
Writes --> | B | | X |
+---+ +---+
|
+---+
| Y | <-- Reads
+---+
Failover of writes is not easy !
38
Do not do circular replication [1 of 3]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Some people are deploying ring/circular replication for HA of writes
+-----------+
v |
+---+ +---+
Writes ---> | P | | S |
+---+ +---+
| ^
+-----------+
• If the primary fails, the application switches writes (and read) to the standby
This is a bad idea in the general case !
(Friends do not let friends do ring/circular replication)
Do not do circular replication [2 of 3]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Some people are deploying ring/circular replication for HA of writes
But this is a bad idea in the general case !
• Looks attractive as nothing to do on the DB side
• Implies S writable: easy to write there by mistake
• Split-brain if apps write simultaneously on P and S
• What is S is lagging when P fails ?
• Out-of-order writes on S when P comes back up
• Replication breakage on P when it gets writes from S
Ø Maintenance nightmare: do not go there !
https://www.percona.com/blog/2014/10/07/mysql-ring-replication-why-it-is-a-bad-option/
(and many more horror stories in the DBA community)
+-----------+
v |
+---+ +---+
| P | | S |
+---+ +---+
| ^
+-----------+
The right way is
failover to S and discard P
à FD + failover + SD
+---+
Writes --> | P |
+---+
|
+---+
| S |
+---+
The right way is
failover to S and discard P
1) Failure Detection
+-/+
| P |
+/-+
+---+
| S |
+---+
The right way is
failover to S and discard P
2) Failover + Service Discovery
+-/+
| P |
+/-+
+---+
Writes --> | S |
+---+
The right way is
failover to S and discard P
3) Fencing
+---+
| P |
+---+
+---+
Writes --> | S |
+---+
ß No replication here if P is back !
Hopefully no writes here
(if no SD edge-case)
(and/or fencing is done well)
44
Do not do circular replication [3 of 3]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Some people are deploying ring/circular replication for HA of writes:
• If the primary fails, the application switches writes (and read) to the standby
This is a bad idea in the general case !
• The use cases for ring replication are niche (and easy to slip out of by mistake)
If you still want to go there (and think you can tackle the Dragon):
• Embrace split-brain and write on both nodes all the time !
• Enable chaos monkey to see what happens with out-of-order writes
• Be prepared to have replication breakages and be ready to fix them
Galera/Percona XtraDB Cluster and Group Replication
are the only true (and safe) master-master solutions
Galera/XtraDB Cluster and Group Replication
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
(IANAE in those technologies and I have little experience with them)
Galera/Percona XtraDB Cluster and Group Replication:
are the only true and safe master-master solutions
• They do not provide a better failure detection (only agreement on FD)
• They guarantee data consistency in a master/master setup
which is interesting to work around the limitations of FD and Service Discovery
• The drawback is failing conflicting transactions on commit (optimistic locking)
or failing transaction when no quorum (and avoiding split-brain)
• IMHO, these are consistency solutions (failing transactions does not look like HA to me)
which is still very interesting in some respects (fencing + HA of writes by multi-master)
• Does not (yet) work in a multi data-center deployment (still need standard replication)
• This is clearly an Advance Replication Subject
46
Semi-Syncronous Replication
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
When failing-over to a slave, committed transactions can be lost
(Some transactions on the crashed master might not have reached slaves)
àviolation of durability (ACID) in the replication topology
Except if Lossless Semi-Sync is used (MySQL 5.7+)
Semi-sync makes transactions durable by replicating them on commit:
• The cost is higher commit latency
• And the risk of blocking on commit (lower availability, but interesting fencing)
Semi-sync can also be used to fence the master by refusing commit
(And it is an Advanced Subject that I cannot describe in more details here)
https://www.percona.com/community-blog/2018/08/23/question-about-semi-synchronous-replication-
answer-with-all-the-details/
47
Pseudo GTIDs
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Pseudo-GTIDs:
• A way to get GTID-like features without GTIDs
• They work with any version of MySQL/MariaDB (even 5.5)
• But they assume in-order-commit à does not work with MTS
They also provide slave replication crash safety:
• With log-slave-updates and sync_binlog = 1
https://github.com/github/orchestrator/blob/master/docs/pseudo-gtid.md
48
The JFG rules of replication (and failover) [1 of 4]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
A slave that is not a candidate for failover should not directly replicate
from the master (unless you accept to maybe losing it when failing-over)
Because it might be ahead of the others (and semi-sync does not help)
In more details:
• A slave without binlogs should not replicate from the master (unless…)
• A slave with RBR binlogs when the master is SBR should not…
• A slave in a VLAN (or data-center) not suitable for a master should not…
• A slave with replication filters should not…
• A mutli-source slave should not…
• Probably more…
So you need to replicate via Intermediate Master
(Binlog Servers would make this obsolete)
49
The JFG rules of replication (and failover) [2 of 4]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Potential violations of the rules:
+---+
| M |
+---+
|
+-------+-------+
| | |
+---+ +---+ +---+
| S1| | S2| | S3|
+---+ +---+ +---+
If S3 has binary logging disabled,
Or if S3 is RBR and the others SBR,
Or if S3 is in the wrong VLAN/data-center,
Or <many other reasons>…
This is ok:
+---+
| M |
+---+
|
+-------+
| |
+---+ +---+ +---+
| S1| | S2| --> | S3|
+---+ +---+ +---+
50
The JFG rules of replication (and failover) [3 of 4]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
More violations of the rules:
+---+ +---+
| M1| | M2|
+---+ +---+
| |
+-----+ +-----+
| | | |
+---+ +--+------+ | +---+
| S1| | +----|-+-+ | S2|
+---+ | | | | +---+
+---+ +---+
| X1| | X2|
+---+ +---+
+---+
| M1|--- Writes #1
+---+ Writes #2
| |
+-------+-------+----------+ |
| | | | |
+---+ +---+ +---+ +---+
| S1| | S2| | S3| | M2|
+---+ +---+ +---+ +-+-+
|
+------+------+
| | |
+---+ +---+ +---+
| T1| | T2| | T3|
+---+ +---+ +---+
51
The JFG rules of replication (and failover) [4 of 4]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
This is ok:
+---+ +---+
| M1| | M2|
+---+ +---+
| |
| |
| |
+---+ /-+------+ +---+
| S1|->-/ | +----|-+--<--| S2|
+---+ | | | | +---+
+---+ +---+
| X1| | X2|
+---+ +---+
+---+
| M1|--- Writes #1
+---+ Writes #2
| |
+-------+-------+ |
| | | |
+---+ +---+ +---+ +---+
| S1| | S2| | S3|--->---| M2|
+---+ +---+ +---+ +-+-+
|
+------+------+
| | |
+---+ +---+ +---+
| T1| | T2| | T3|
+---+ +---+ +---+
52
Thoughts on Intermediate Masters
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Intermediate Masters (and log-slave-updates) introduce many problems:
• They slow-down replication
• They increase the risk of propagating errant transactions
• They are an expensive solution to artificial problems
• And more…
Binlog Servers are a better alternative
for most (all?) Intermediate Masters use cases
https://jfg-mysql.blogspot.com/2015/10/binlog-servers-for-backups-and-point-in-time-recovery.html
(also includes links to other use cases like simple failover and scaling)
53
Conclusion [1 of 2]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
• Replication is a wide subject
• I hope you are now equipped to dive deeper in the topics you need
• FOREIGN KEYs at scale: DELETE/UPDATE CASCADE à big/long trx L
• The downsides of AUTO_INCREMENTs: hard data reconciliation
• But I did not talk much about
• UNIQUE KEYs at scale
• the limits of archiving
54
Conclusion [2 of 2]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Two types of sharding:
• Vertical (Split): some schemas/tables above, some below
• Horizontal: some data in shard #1, in #2, … #N (left to right)
(Archiving is a type of horizontal sharding with N usually equal 2)
• Splits and archiving divide a problem by 2 or 3 à short term win
• UNIQUE and FOREIGN KEYs are hard to enforce on a sharded dataset
Links [1 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
• Parts of this talk are inspired by some of Nicolai Plum’s talks:
• Booking.com: Evolution of MySQL System Design (PLMCE 2015)
https://www.percona.com/live/mysql-conference-2015/sessions/bookingcom-evolution-mysql-system-design
• MySQL Scalability: Dimensions, Methods and Technologies (PLDPC 2016)
https://www.percona.com/live/data-performance-conference-2016/sessions/mysql-scalability-dimensions-methods-and-technologies
• Datastore axes: Choosing the scalability direction you need (PLAMS 2016)
https://www.percona.com/live/plam16/sessions/datastore-axes-choosing-scalability-direction-you-need
• Combined with material from some of previous talks:
• Demystifying MySQL Replication Crash Safety (PLEU 2018 and HL18Moscow)
https://www.slideshare.net/JeanFranoisGagn/demystifying-mysql-replication-crash-safety
• The Full MySQL Parallel Replication Tutorial (PLSC 2018 and many others)
https://www.slideshare.net/JeanFranoisGagn/the-full-mysql-and-mariadb-parallel-replication-tutorial
• How B.com avoids and deals with replication lag (MariaDB Dev meeting 2017 and more)
https://www.slideshare.net/JeanFranoisGagn/how-bookingcom-avoids-and-deals-with-replication-lag-74772032
• Autopsy of a [MySQL] automation disaster (FOSDEM 2017)
https://www.slideshare.net/JeanFranoisGagn/autopsy-of-an-automation-disaster
55
Links [2 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Some posts about advanced replication topics:
• The five things to consider when implementing master high-availability
https://jfg-mysql.blogspot.com/2019/02/mysql-master-high-availability-and-failover-more-thoughts.html
• Binlog Servers for Simplifying Point in Time Recovery
(also includes links to other use cases like simple failover and scaling)
https://jfg-mysql.blogspot.com/2015/10/binlog-servers-for-backups-and-point-in-time-recovery.html
• On the consequences of sync_binlog != 1 (part #1)
https://jfg-mysql.blogspot.com/2018/10/consequences-sync-binlog-neq-1-part-1.html
• Question about Semi-Synchronous Replication: the Answer with All the Details
https://www.percona.com/community-blog/2018/08/23/question-about-semi-synchronous-replication-answer-with-all-the-details/
• The danger of no Primary Key when replicating in RBR
https://jfg-mysql.blogspot.com/2017/08/danger-no-pk-with-RBR-and-mariadb-protection.html
• Better Parallel Replication for MySQL
https://medium.com/booking-com-infrastructure/better-parallel-replication-for-mysql-14e2d7857813
57
Links [3 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
More posts about an advanced replication topic (parallel replication):
• A Metric for Tuning Parallel Replication in MySQL 5.7
https://jfg-mysql.blogspot.com/2017/02/metric-for-tuning-parallel-replication-mysql-5-7.html
• Evaluating MySQL Parallel Replication Part 4: More Benchmarks in Production
https://medium.com/booking-com-infrastructure/evaluating-mysql-parallel-replication-part-4-more-benchmarks-in-production-49ee255043ab
(links to other parts of the series in the post)
• An update on Write Set (parallel replication) bug fix in MySQL 8.0
https://jfg-mysql.blogspot.com/2018/01/an-update-on-write-set-parallel-replication-bug-fix-in-mysql-8-0.html
Post on related subjects:
• Why I wrote "do not ignore warnings" and "always investigate/fix warnings”
https://jfg-mysql.blogspot.com/2017/01/do-not-ignore-warnings-in-mysql-mariadb.html
58
Links [4 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Other interesting posts:
• MySQL Ripple: The First Impression of a MySQL Binlog Server
https://www.percona.com/blog/2019/03/15/mysql-ripple-first-impression-of-mysql-binlog-server/
• Why MySQL’s binlog-do-db option is dangerous
https://www.percona.com/blog/2009/05/14/why-mysqls-binlog-do-db-option-is-dangerous/
• MySQL ring replication: Why it is a bad option
https://www.percona.com/blog/2014/10/07/mysql-ring-replication-why-it-is-a-bad-option/
• Orchestrator and Pseudo-GTID:
https://github.com/github/orchestrator/blob/master/docs/pseudo-gtid.md
• ProxySQL GTID consistent reads:
https://proxysql.com/blog/proxysql-gtid-causal-reads
• MySQL High Availability at GitHub
https://githubengineering.com/mysql-high-availability-at-github/
• GitHub October 21 post-incident analysis
https://blog.github.com/2018-10-30-oct21-post-incident-analysis/
59
Links [5 of 5]
(MySQL Scalability and Reliability for Replication – do.bcn 2019)
Other interesting tools, etc…:
• Percona Toolkit: https://www.percona.com/doc/percona-toolkit/LATEST/index.html
• pt-heartbeat and pt-online-schema-change
• pt-slave-find, pt-slave-restart, pt-table-checksum and pt-table-sync
• pt-slave-delay (not needed since 5.6 and 10.2)
• GitHub Online Schema Migration/gh-ost: https://github.com/github/gh-ost
• Monotonically increasing UUID
http://mysql.rjweb.org/doc.php/uuid
https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
• Query Optimization: The Basics https://dbahire.com/pleu18/
• About sharding:
• Vitess: https://vitess.io/
• Spider: https://mariadb.com/kb/en/library/spider-storage-engine-overview/
• MySQL Cluster: https://www.mysql.com/products/cluster/
Thanks !
by Jean-François Gagné - Presented at dataops.barcelona (June 2019)
Senior Infrastructure Engineer / System and MySQL Expert
jeanfrancois AT messagebird DOT com / @jfg956

Contenu connexe

Tendances

Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityOSSCube
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)Siglos
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleM|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleMariaDB plc
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability Mydbops
 
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBSeveralnines
 
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLMySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLOlivier DASINI
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...Jean-François Gagné
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingJulien Pivotto
 
MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsFederico Razzoli
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorJean-François Gagné
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversScyllaDB
 
MMUG18 - MySQL Failover and Orchestrator
MMUG18 - MySQL Failover and OrchestratorMMUG18 - MySQL Failover and Orchestrator
MMUG18 - MySQL Failover and OrchestratorSimon J Mudd
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera ClusterAbdul Manaf
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Mydbops
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapPadraig O'Sullivan
 
ProxySQL on Kubernetes
ProxySQL on KubernetesProxySQL on Kubernetes
ProxySQL on KubernetesRené Cannaò
 

Tendances (20)

Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High Availability
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleM|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability
 
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
 
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLMySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Mastering Real-time Linux
Mastering Real-time LinuxMastering Real-time Linux
Mastering Real-time Linux
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabeling
 
MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAs
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
 
MMUG18 - MySQL Failover and Orchestrator
MMUG18 - MySQL Failover and OrchestratorMMUG18 - MySQL Failover and Orchestrator
MMUG18 - MySQL Failover and Orchestrator
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera Cluster
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
ProxySQL on Kubernetes
ProxySQL on KubernetesProxySQL on Kubernetes
ProxySQL on Kubernetes
 

Similaire à MySQL Scalability and Reliability for Replicated Environment

MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
My sql5.7 whatsnew_presentedatgids2015
My sql5.7 whatsnew_presentedatgids2015My sql5.7 whatsnew_presentedatgids2015
My sql5.7 whatsnew_presentedatgids2015Sanjay Manwani
 
My sql cluster case study apr16
My sql cluster case study apr16My sql cluster case study apr16
My sql cluster case study apr16Sumi Ryu
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015Mario Beck
 
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...Olivier DASINI
 
Best practices for MySQL High Availability Tutorial
Best practices for MySQL High Availability TutorialBest practices for MySQL High Availability Tutorial
Best practices for MySQL High Availability TutorialColin Charles
 
MariaDB 10.0 - SkySQL Paris Meetup
MariaDB 10.0 - SkySQL Paris MeetupMariaDB 10.0 - SkySQL Paris Meetup
MariaDB 10.0 - SkySQL Paris MeetupMariaDB Corporation
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
 
What's New in MySQL 5.7
What's New in MySQL 5.7What's New in MySQL 5.7
What's New in MySQL 5.7Olivier DASINI
 
MySQL Day Paris 2018 - Upgrade from MySQL 5.7 to MySQL 8.0
MySQL Day Paris 2018 - Upgrade from MySQL 5.7 to MySQL 8.0MySQL Day Paris 2018 - Upgrade from MySQL 5.7 to MySQL 8.0
MySQL Day Paris 2018 - Upgrade from MySQL 5.7 to MySQL 8.0Olivier DASINI
 
What's new in MySQL Cluster 7.4 webinar charts
What's new in MySQL Cluster 7.4 webinar chartsWhat's new in MySQL Cluster 7.4 webinar charts
What's new in MySQL Cluster 7.4 webinar chartsAndrew Morgan
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...Mydbops
 
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...Continuent
 
20190817 coscup-oracle my sql innodb cluster sharing
20190817 coscup-oracle my sql innodb cluster sharing20190817 coscup-oracle my sql innodb cluster sharing
20190817 coscup-oracle my sql innodb cluster sharingIvan Ma
 
The New MariaDB Offering: MariaDB 10, MaxScale and More
The New MariaDB Offering: MariaDB 10, MaxScale and MoreThe New MariaDB Offering: MariaDB 10, MaxScale and More
The New MariaDB Offering: MariaDB 10, MaxScale and MoreMariaDB Corporation
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1Ivan Ma
 
Using Snap Clone with Enterprise Manager 12c
Using Snap Clone with Enterprise Manager 12cUsing Snap Clone with Enterprise Manager 12c
Using Snap Clone with Enterprise Manager 12cPete Sharman
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterContinuent
 

Similaire à MySQL Scalability and Reliability for Replicated Environment (20)

MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
My sql5.7 whatsnew_presentedatgids2015
My sql5.7 whatsnew_presentedatgids2015My sql5.7 whatsnew_presentedatgids2015
My sql5.7 whatsnew_presentedatgids2015
 
My sql cluster case study apr16
My sql cluster case study apr16My sql cluster case study apr16
My sql cluster case study apr16
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015
 
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
 
Best practices for MySQL High Availability Tutorial
Best practices for MySQL High Availability TutorialBest practices for MySQL High Availability Tutorial
Best practices for MySQL High Availability Tutorial
 
MariaDB 10.0 - SkySQL Paris Meetup
MariaDB 10.0 - SkySQL Paris MeetupMariaDB 10.0 - SkySQL Paris Meetup
MariaDB 10.0 - SkySQL Paris Meetup
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
What's New in MySQL 5.7
What's New in MySQL 5.7What's New in MySQL 5.7
What's New in MySQL 5.7
 
MySQL Day Paris 2018 - Upgrade from MySQL 5.7 to MySQL 8.0
MySQL Day Paris 2018 - Upgrade from MySQL 5.7 to MySQL 8.0MySQL Day Paris 2018 - Upgrade from MySQL 5.7 to MySQL 8.0
MySQL Day Paris 2018 - Upgrade from MySQL 5.7 to MySQL 8.0
 
What's new in MySQL Cluster 7.4 webinar charts
What's new in MySQL Cluster 7.4 webinar chartsWhat's new in MySQL Cluster 7.4 webinar charts
What's new in MySQL Cluster 7.4 webinar charts
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
 
MySQL 5.7 what's new
MySQL 5.7 what's newMySQL 5.7 what's new
MySQL 5.7 what's new
 
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
 
20190817 coscup-oracle my sql innodb cluster sharing
20190817 coscup-oracle my sql innodb cluster sharing20190817 coscup-oracle my sql innodb cluster sharing
20190817 coscup-oracle my sql innodb cluster sharing
 
The New MariaDB Offering: MariaDB 10, MaxScale and More
The New MariaDB Offering: MariaDB 10, MaxScale and MoreThe New MariaDB Offering: MariaDB 10, MaxScale and More
The New MariaDB Offering: MariaDB 10, MaxScale and More
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1
 
Using Snap Clone with Enterprise Manager 12c
Using Snap Clone with Enterprise Manager 12cUsing Snap Clone with Enterprise Manager 12c
Using Snap Clone with Enterprise Manager 12c
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
 

Plus de Jean-François Gagné

The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1Jean-François Gagné
 
Autopsy of a MySQL Automation Disaster
Autopsy of a MySQL Automation DisasterAutopsy of a MySQL Automation Disaster
Autopsy of a MySQL Automation DisasterJean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyJean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyJean-François Gagné
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
The two little bugs that almost brought down Booking.com
The two little bugs that almost brought down Booking.comThe two little bugs that almost brought down Booking.com
The two little bugs that almost brought down Booking.comJean-François Gagné
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagJean-François Gagné
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsJean-François Gagné
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamJean-François Gagné
 

Plus de Jean-François Gagné (11)

The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
 
Autopsy of a MySQL Automation Disaster
Autopsy of a MySQL Automation DisasterAutopsy of a MySQL Automation Disaster
Autopsy of a MySQL Automation Disaster
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
 
The two little bugs that almost brought down Booking.com
The two little bugs that almost brought down Booking.comThe two little bugs that almost brought down Booking.com
The two little bugs that almost brought down Booking.com
 
Autopsy of an automation disaster
Autopsy of an automation disasterAutopsy of an automation disaster
Autopsy of an automation disaster
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lag
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitations
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication Stream
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

MySQL Scalability and Reliability for Replicated Environment

  • 1. MySQL Scalability and Reliability for Replicated Environment by Jean-François Gagné - Presented at dataops.barcelona (June 2019) Senior Infrastructure Engineer / System and MySQL Expert jeanfrancois AT messagebird DOT com / @jfg956
  • 2. 2 MessageBird MessageBird is a cloud communications platform founded in Amsterdam 2011. Example of our messaging and voice SaaS: SMS in and out, call in (IVR) and out (alert), SIP, WhatsApp, Facebook, Telegram, Twitter, WeChat, … Omni-Channel Conversation Details at www.messagebird.com 225+ Direct-to-Carrier Agreements With operators from around the world 15,000+ Customers In over 60+ countries 200+ Employees Engineering office in Amsterdam Sales and support offices worldwide We are expanding: {Software, Front-End, Infrastructure, Data, Security, Telecom, QA} Engineers {Team, Tech, Product} Leads, Product Owners, Customer Support {Commercial, Connectivity, Partnership} Managers www.messagebird.com/careers
  • 3. 3 Summary (MySQL Scalability and Reliability for Replication – do.bcn 2019) This talk is an overview of the MySQL Replication landscape: • Introduction to replication (what, why and how) • Replications generalities (including tooling and lag) • Zoom in some replication use cases (read scaling, resilience and failover) • War story + more failover topics (bad ideas, master/master, semi-sync, …) • Some closing remarks on scalability and sharding I will also mention Advances Replication Subjects and provide pointers (No need to take pictures of the slides, they will be online)
  • 4. Overview of MySQL Replication (MySQL Scalability and Reliability for Replication – do.bcn 2019) In a replicated environment, you have one master with one or more slaves: • The master records transactions in a journal (binary logs); and each slave: • Downloads the journal and saves it locally in the relay logs (done by the IO thread) • Executes the relay logs on its local database (done by the SQL thread) • Could also produce binary logs to be an intermediate master (log-slave-updates) • Delay between writing to the binary log on the master and execution on slave (lag)
  • 5. 5 Why would you want replication (MySQL Scalability and Reliability for Replication – do.bcn 2019) • Point-in-time recovery with the master binary logs • Stability: taking backups from a slave • Risk management: data from X hours ago on “delayed” slaves • Reliability of reads: reading from a slave or from the master • Scalability: spread read load on many slaves • Availability of writes: failing over to a slave if the master crashes • Other reasons involving MySQL slaves (aggregation with multi-source) • Other reasons for leveraging the binlogs (gh-ost, replicating to Hadoop) • And much more (testing new version before upgrade, rollback of upgrade, …)
  • 6. 6 Setting-up replication [1 of 3] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Three things to configure for replication: • Enabling binary logging (log_bin is not dynamic K) (disabled by default before MySQL 8.0 K, enabled in 8.0 J) • Setting a unique server id (server_id is a dynamic global variable J) (the last 3 numbers of the IP address is a safe-ish choice: 10.98.76.54 à 98076054) • Create a replication user (How-to in the manual) • MySQL 5.6+ also have a server uuid (for GITDs) (generated magically, saved in auto.cnf, remember to delete if cloning from file cp) • Many binlog format (binlog_format is dynamic but care as global & session) (ROW default in MySQL 5.7+, STATEMENT before à consider RBR if 5.6 or 5.5) (because Group Replication and latest parallel replication need RBR) (less important for MariaDB where parallel replication works well with SBR) Then restore a backup, setup the slave, and start replicating !
  • 7. Setting-up replication [2 of 3] (MySQL Scalability and Reliability for Replication – do.bcn 2019) GTIDs are a useful reliability/repointing tool, you should probably use them ! • If new database, I advise to enable GTIDs (and use 5.7+) (gtid_mode is OFF by default in all of MySQL 5.6, 5.7 and 8.0) • If existing application with MySQL, enabling GTIDs is not straightforward • Needs enforce_gtid_consistency (not dynamic in 5.6 and only hard fail) (Dynamic in 5.7 and has a warning option: hopefully you are logging warnings) (https://jfg-mysql.blogspot.com/2017/01/do-not-ignore-warnings-in-mysql-mariadb.html) • If MariaDB 10.0+, then you are lucky: GTIDs come for free I have a love/hate relationship with GTIDs: they are not my favourite tool • I prefer Binlog Servers, but the open-source implementation is very young (Ripple) (And it is an Advance Replication Subject not detailed here) https://jfg-mysql.blogspot.com/2015/10/binlog-servers-for-backups-and-point-in-time-recovery.html 7
  • 8. Setting-up replication [3 of 3] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Care with binlog-do-db and binlog-ignore-db: • They compromise PiT recovery and might not do what you expect https://www.percona.com/blog/2009/05/14/why-mysqls-binlog-do-db-option-is-dangerous/ A few notes about slaves (for when we will get there): • Consider enabling binary logging on slaves (log-slave-updates) • For failover to slaves and a good way to keep copies of binary logs for PiTR • Replication crash-safety: enabled by default in MySQL 8.0, not before • If not 8.0 and for enabling replication crash safety on slaves set relay-log-info-repository = TABLE and relay-log-recovery = 1 https://www.slideshare.net/JeanFranoisGagn/demystifying-mysql-replication-crash-safety • But with GTID, it needs sync_binlog = 1 and trx_commit = 1 L (Bug#92109) • Be careful with replication filters: • Might not do what you expect (see binlog-do-db and binlog-ignore-db above)
  • 9. 9 Replication visualization tooling (MySQL Scalability and Reliability for Replication – do.bcn 2019) Orchestrator… 10.3.54.241:3306 +- 10.23.12.1:3306 +- 10.2.18.42:3306 +- 10.23.78.5:3306 +- 10.2.34.56:3306 • Orchestrator needs some effort in deployment and configuration • And it can also be used as a failover tool • But in the meantime, pt-slave-find can help … and pt-slave-find:
  • 10. 10 Replication lag [1 of 7] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Biggest enemy in a replication setup: replication lag • Temporary lag: caused by a cold cache after a restart • Occasional lag: caused by a write burst or some long transactions • Perpetual lag of death: when slaves almost never catch-up Monitoring lag: • Second Behind Master from SHOW SLAVE STATUS • From the Percona Toolkit: pt-heartbeat (or variation) • (Also a new feature in MySQL 8.0 which I am not familiar with)
  • 11. Replication lag [2 of 7] (MySQL Scalability and Reliability for Replication – do.bcn 2019)
  • 12. Replication lag [3 of 7] (MySQL Scalability and Reliability for Replication – do.bcn 2019)
  • 13. 13 Replication lag [4 of 7] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Techniques for preventing lag: • Avoid long/big transactions • Manage your batches • Increase the replication throughput If all above fail, you have a capacity problem à sharding
  • 14. 14 Replication lag [5 of 7] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Avoiding replication lag by not generating long transactions: • Avoid unbounded/unpredictable writes (we do not like FOREIGN KEY … ON DELETE/UPDATE CASCADE) (we do not like unbounded UPDATE/DELETE and we do not like LOAD DATA) • Avoid ALTER TABLE, use non-blocking methods (pt-osc / ghost) (note that an online ALTER on the master blocks replication on slaves) • Row-Based-Replication needs a primary key to run efficiently on slaves https://jfg-mysql.blogspot.com/2017/08/danger-no-pk-with-RBR-and-mariadb-protection.html • Long transactions compound with Intermediate Masters • And they prevent parallel replication from working well https://medium.com/booking-com-infrastructure/better-parallel-replication-for-mysql-14e2d7857813
  • 15. 15 Replication lag [6 of 7] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Avoiding replication lag by throttling batches: • Split a batch in many small chunks • Have a parameter to tune the chunk sizes (small transactions) • Do not hit the master hard with all the small transactions • Slow down: sleep between chucks (parameter), but by how much… Advanced replication lag avoidance techniques for batches: • Throttling with DB-Waypoints (how to get the sleep between chunks right) https://www.slideshare.net/JeanFranoisGagn/how-bookingcom-avoids-and-deals-with-replication-lag-74772032 • Also the MariaDB No-Slave-Left-Behind Patch/Feature https://jira.mariadb.org/browse/MDEV-8112
  • 16. 16 Replication lag [7 of 7] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Avoiding lag by increasing write capacity: • Make writes more efficient (Many database/table optimisations possible: https://dbahire.com/pleu18/ ) • Use a faster storage (SSD for read before write & cache for quicker syncs) • Consider parallel replication for improved slave throughput https://www.slideshare.net/JeanFranoisGagn/the-full-mysql-and-mariadb-parallel-replication-tutorial • Care with sync_binlog = 0 and trx_commit = 2 They have consequences… https://jfg-mysql.blogspot.com/2018/10/consequences-sync-binlog-neq-1-part-1.html
  • 17. Let’s dive in the use cases of MySQL Replication
  • 18. 18 Backups and related subjects [1 of 2] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Without slaves, backups must be done on the master • Options are mysqldump/pump, XtraBackup, stop MySQL + file cp… • All are disruptive in some ways It is better to do backup on slaves (less disruptive): • Same solutions as above (in the worst case, cause some replication lag) • If you are HA, you should not afraid to stop a slave for taking a backup (mysqldump & XtraBackup have drawbacks avoided by stop MySQL + file cp) or from wiping data on a slave for restoring – and testing – a backup (tested backups are the only good backups)
  • 19. 19 Backups and related subjects [2 of 2] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Also, having slaves is a way to keep extra copies of the binlogs: (Binlog Servers might be a better way to achieve that, more on this later) • Allows PiT Recovery of a backup (restore and apply the binlogs) • Care with sync_binlog != 1: missing trx in binlogs on OS crash https://jfg-mysql.blogspot.com/2018/10/consequences-sync-binlog-neq-1-part-1.html and not replication crash safe with GTIDs: Bug#70659 and #92109 Delayed replication: • Useful to recover from whoopsies: DROPing wrong table/database • Original tool in the Percona Toolkit: pt-slave-delay • Not needed since MySQL 5.6 and MariaDB 10.2 (MASTER_DELAY option included in CHANGE MASTER TO)
  • 20. 20 Reading from Slave(s) [1 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Why would you want reading from slave(s): • Resilience: still being able to read when a node is down • Capacity/Scaling: when the master is overloaded by reads • Performance: splitting fast/slow queries to different slaves • Latency: bringing data closer to readers in a multi data-center setup Complexity of reading from slaves: dealing with lag (consistency) • Stale read on a lagging slave (not always a problem if locally consistent) • Temporal inconsistencies (going from a lagging to a non-lagging slave)
  • 21. Reading from Slave(s) [2 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) More complexity: service discovery at scale (where to read from) • DNS, proxy, pooling… (not discussed in detail in this talk) (ProxySQL with a LoadBalancer for High Availability is a good start) • At scale, all centralised solutions will be a bottleneck (Proxies and DNS) Bad way for increasing Read Reliability: • Normally read on the master and when it fails read on a slave Ø This hides consistency/lagging problem most of the time and makes these visible only when the master fails • When few slaves à read from them and fallback to the master • When many slaves à never fallback to the master (embrace lag)
  • 22. 22 Reading from Slave(s) [3 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Bad solution #1 to cope with lag: Falling back to reading on the master • If slaves are lagging, maybe we should read from the master… • This looks like an attractive solution to avoid stale reads, but… • It does not scale (why are you reading from slaves ?) • This will cause a sudden load on the master (in case of lag) • And it might cause an outage on the master (and this would be bad) • It might be better to fail a read than to fallback to (and kill) the master • Reading from the master might be okay in very specific cases (but make sure to avoid the problem described above)
  • 23. 23 Reading from Slave(s) [4 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Bad solution #2 to cope with lag: Retry on another slave • When reading from a slave: if lag, then retry on another slave • This scales better and is OK-ish (when few slaves are lagging) • But what happens if many slaves are lagging ? • Increased load (retries) can slowdown replication • This might overload the slaves and cause a good slave to start lagging • In the worst case, this might kill slaves and cause a domino effect • Again: probably better to fail a read than to cause a bigger problem • Consider separate “interactive” and “reporting” slaves (for expensive reads)
  • 24. 24 Reading from Slave(s) [5 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Advanced Replication Lag Coping Solutions: • ProxySQL GTID consistent reads https://proxysql.com/blog/proxysql-gtid-causal-reads • DB-Waypoints for consistent read (Booking.com solution) (subject of Simon’s talk) Read views: • Lag might not be a problem if you can “read-your-own-writes”… (subject of Simon’s talk)
  • 25. 25 High Availability of Writes [1 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) In a MySQL replication deployment, the master is a SPOF • Being able to react to its failure is crucial for high availability of writes (another solution is to INSERT in many places – much harder for UPDATEs) (not just master/master but also possible 2 completely separate databases) Solutions to restore the availability of writes after a master failure • Restart the master (what if hardware fails, delay of crash recovery, and the cache starts cold) • Database on shared storage (storage fails, crash recovery, and cold cache) • Failover to a slave (what if DROP DATABASE)
  • 26. 26 High Availability of Writes [2 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Failing-over the master to a slave is my favorite high availability method • But it is not as easy as it sounds, and it is hard to automate well • There are many corner cases, and things will sometimes go wrong (what if slaves are lagging, what if some data is only on the master, …) • An example of complete failover solution in production: https://githubengineering.com/mysql-high-availability-at-github/ The five considerations when failing-over the master to a slave: (https://jfg-mysql.blogspot.com/2019/02/mysql-master-high-availability-and-failover-more-thoughts.html) • Plan how are your doing Write high availability • Decide when you apply your plan (Failure Detection – FD) • Tell the application about the change (Service Discovery – SD) • Protect against the limit of FD and SD for avoiding split-brains (fencing) • Fix your data if something goes wrong
  • 27. 27 High Availability of Writes [3 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Failure detection (FD) is the 1st part (and 1st challenge) of failover • It is a very hard problem: partial failure, unreliable network, partitions, … • It is impossible to be 100% sure of a failure, and confidence needs time à quick FD is unreliable, relatively reliable FD implies longer unavailability Ø You need to accept that FD generates false positive (and/or false negative) Repointing is the 2nd part of failover: • Relatively easy with the right tools: MHA, GTID, Pseudo-GTID, Binlog Servers, … • Complexity grows with the number of direct slaves of the master (what if you cannot contact some of those slaves…) • Some software for easing repointing: • Orchestrator, Ripple Binlog Server, Replication Manager, MHA, Cluster Control, MaxScale, …
  • 28. 28 High Availability of Writes [4 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) In this configuration, when the master fails, one of the slave needs to be repointed: This configuration might simplify repointing, but… … what if the intermediate master fails ?
  • 29. High Availability of Writes [5 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) … and you cannot avoid this problem in large deployments !
  • 30. 30 High Availability of Writes [6 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Service Discovery (SD) is the 3rd part (and 2nd challenge) of failover: • If centralised, it is a SPOF; if distributed, impossible to update atomically • SD will either introduce a bottleneck (including performance limits) or will be unreliable in some way (pointing to the wrong master) • Some ways to implement Service Discover: DNS, ViP, Proxy, Zookeeper, … http://code.openark.org/blog/mysql/mysql-master-discovery-methods-part-1-dns Ø Unreliable FD and unreliable SD is a receipt for split-brains ! Protecting against split-brains: Adv. Subject – not many solutions at scale (Proxies and well configured semi-synchronous replication might help) Fixing your data in case of a split-brain: only you can know how to do this ! (Avoiding auto-increment by using UUID might help)
  • 31. 31 Failover War Story [1 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Master failover does not always go as planned We will look at a War Story from Booking.com • Autopsy of a [MySQL] automation disaster (FOSDEM 2017) https://www.slideshare.net/JeanFranoisGagn/autopsy-of-an-automation-disaster GitHub also has a recent incident with a public Post-Mortem: • October 21 post-incident analysis https://blog.github.com/2018-10-30-oct21-post-incident-analysis/ (Please share your war stories so we can learn from each-others’ experience)
  • 32. 32 Failover War Story [2 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) DNS (master) +---+ points here --> | A | +---+ | +---------------------+ | | Some reads +---+ +---+ happen here --> | B | | X | +---+ +---+ | +---+ And other reads | Y | <-- happen here +---+
  • 33. 33 Failover War Story [3 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) DNS (master) +-/+ points here --> | A | but they are +/-+ now failing Some reads +-/+ +---+ happen here --> | B | | X | but they are +/-+ +---+ now failing | +---+ Some reads | Y | <-- happen here +---+
  • 34. 34 Failover War Story [4 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) +-/+ | A | +/-+ +-/+ +---+ DNS (master) | B | | X | <-- now points here +/-+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 35. Failover War Story [5 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) After some time... +---+ | A | +---+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ And reads | Y | <-- happen here +---+
  • 36. 36 Failover War Story [6 of 6] (MySQL Scalability and Reliability for Replication – do.bcn 2019) This is not good: • When A and B failed, X was promoted as the new master • Something made DNS point to B à writes are now happening there • But B is outdated: all writes to X (after the failure of A) did not reach B • So we have data on X that cannot be read on B • And we have new data on B that is not on Y Fixing the data would not have been easy: • Same auto-increments were allocated on X and then B Ø UUID would have been better (increasing for INSERTing in PK order) http://mysql.rjweb.org/doc.php/uuid https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ +---+ | A | +---+ +---+ +---+ Writes --> | B | | X | +---+ +---+ | +---+ | Y | <-- Reads +---+
  • 37. Failover of writes is not easy !
  • 38. 38 Do not do circular replication [1 of 3] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Some people are deploying ring/circular replication for HA of writes +-----------+ v | +---+ +---+ Writes ---> | P | | S | +---+ +---+ | ^ +-----------+ • If the primary fails, the application switches writes (and read) to the standby This is a bad idea in the general case ! (Friends do not let friends do ring/circular replication)
  • 39. Do not do circular replication [2 of 3] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Some people are deploying ring/circular replication for HA of writes But this is a bad idea in the general case ! • Looks attractive as nothing to do on the DB side • Implies S writable: easy to write there by mistake • Split-brain if apps write simultaneously on P and S • What is S is lagging when P fails ? • Out-of-order writes on S when P comes back up • Replication breakage on P when it gets writes from S Ø Maintenance nightmare: do not go there ! https://www.percona.com/blog/2014/10/07/mysql-ring-replication-why-it-is-a-bad-option/ (and many more horror stories in the DBA community) +-----------+ v | +---+ +---+ | P | | S | +---+ +---+ | ^ +-----------+
  • 40. The right way is failover to S and discard P à FD + failover + SD +---+ Writes --> | P | +---+ | +---+ | S | +---+
  • 41. The right way is failover to S and discard P 1) Failure Detection +-/+ | P | +/-+ +---+ | S | +---+
  • 42. The right way is failover to S and discard P 2) Failover + Service Discovery +-/+ | P | +/-+ +---+ Writes --> | S | +---+
  • 43. The right way is failover to S and discard P 3) Fencing +---+ | P | +---+ +---+ Writes --> | S | +---+ ß No replication here if P is back ! Hopefully no writes here (if no SD edge-case) (and/or fencing is done well)
  • 44. 44 Do not do circular replication [3 of 3] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Some people are deploying ring/circular replication for HA of writes: • If the primary fails, the application switches writes (and read) to the standby This is a bad idea in the general case ! • The use cases for ring replication are niche (and easy to slip out of by mistake) If you still want to go there (and think you can tackle the Dragon): • Embrace split-brain and write on both nodes all the time ! • Enable chaos monkey to see what happens with out-of-order writes • Be prepared to have replication breakages and be ready to fix them Galera/Percona XtraDB Cluster and Group Replication are the only true (and safe) master-master solutions
  • 45. Galera/XtraDB Cluster and Group Replication (MySQL Scalability and Reliability for Replication – do.bcn 2019) (IANAE in those technologies and I have little experience with them) Galera/Percona XtraDB Cluster and Group Replication: are the only true and safe master-master solutions • They do not provide a better failure detection (only agreement on FD) • They guarantee data consistency in a master/master setup which is interesting to work around the limitations of FD and Service Discovery • The drawback is failing conflicting transactions on commit (optimistic locking) or failing transaction when no quorum (and avoiding split-brain) • IMHO, these are consistency solutions (failing transactions does not look like HA to me) which is still very interesting in some respects (fencing + HA of writes by multi-master) • Does not (yet) work in a multi data-center deployment (still need standard replication) • This is clearly an Advance Replication Subject
  • 46. 46 Semi-Syncronous Replication (MySQL Scalability and Reliability for Replication – do.bcn 2019) When failing-over to a slave, committed transactions can be lost (Some transactions on the crashed master might not have reached slaves) àviolation of durability (ACID) in the replication topology Except if Lossless Semi-Sync is used (MySQL 5.7+) Semi-sync makes transactions durable by replicating them on commit: • The cost is higher commit latency • And the risk of blocking on commit (lower availability, but interesting fencing) Semi-sync can also be used to fence the master by refusing commit (And it is an Advanced Subject that I cannot describe in more details here) https://www.percona.com/community-blog/2018/08/23/question-about-semi-synchronous-replication- answer-with-all-the-details/
  • 47. 47 Pseudo GTIDs (MySQL Scalability and Reliability for Replication – do.bcn 2019) Pseudo-GTIDs: • A way to get GTID-like features without GTIDs • They work with any version of MySQL/MariaDB (even 5.5) • But they assume in-order-commit à does not work with MTS They also provide slave replication crash safety: • With log-slave-updates and sync_binlog = 1 https://github.com/github/orchestrator/blob/master/docs/pseudo-gtid.md
  • 48. 48 The JFG rules of replication (and failover) [1 of 4] (MySQL Scalability and Reliability for Replication – do.bcn 2019) A slave that is not a candidate for failover should not directly replicate from the master (unless you accept to maybe losing it when failing-over) Because it might be ahead of the others (and semi-sync does not help) In more details: • A slave without binlogs should not replicate from the master (unless…) • A slave with RBR binlogs when the master is SBR should not… • A slave in a VLAN (or data-center) not suitable for a master should not… • A slave with replication filters should not… • A mutli-source slave should not… • Probably more… So you need to replicate via Intermediate Master (Binlog Servers would make this obsolete)
  • 49. 49 The JFG rules of replication (and failover) [2 of 4] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Potential violations of the rules: +---+ | M | +---+ | +-------+-------+ | | | +---+ +---+ +---+ | S1| | S2| | S3| +---+ +---+ +---+ If S3 has binary logging disabled, Or if S3 is RBR and the others SBR, Or if S3 is in the wrong VLAN/data-center, Or <many other reasons>… This is ok: +---+ | M | +---+ | +-------+ | | +---+ +---+ +---+ | S1| | S2| --> | S3| +---+ +---+ +---+
  • 50. 50 The JFG rules of replication (and failover) [3 of 4] (MySQL Scalability and Reliability for Replication – do.bcn 2019) More violations of the rules: +---+ +---+ | M1| | M2| +---+ +---+ | | +-----+ +-----+ | | | | +---+ +--+------+ | +---+ | S1| | +----|-+-+ | S2| +---+ | | | | +---+ +---+ +---+ | X1| | X2| +---+ +---+ +---+ | M1|--- Writes #1 +---+ Writes #2 | | +-------+-------+----------+ | | | | | | +---+ +---+ +---+ +---+ | S1| | S2| | S3| | M2| +---+ +---+ +---+ +-+-+ | +------+------+ | | | +---+ +---+ +---+ | T1| | T2| | T3| +---+ +---+ +---+
  • 51. 51 The JFG rules of replication (and failover) [4 of 4] (MySQL Scalability and Reliability for Replication – do.bcn 2019) This is ok: +---+ +---+ | M1| | M2| +---+ +---+ | | | | | | +---+ /-+------+ +---+ | S1|->-/ | +----|-+--<--| S2| +---+ | | | | +---+ +---+ +---+ | X1| | X2| +---+ +---+ +---+ | M1|--- Writes #1 +---+ Writes #2 | | +-------+-------+ | | | | | +---+ +---+ +---+ +---+ | S1| | S2| | S3|--->---| M2| +---+ +---+ +---+ +-+-+ | +------+------+ | | | +---+ +---+ +---+ | T1| | T2| | T3| +---+ +---+ +---+
  • 52. 52 Thoughts on Intermediate Masters (MySQL Scalability and Reliability for Replication – do.bcn 2019) Intermediate Masters (and log-slave-updates) introduce many problems: • They slow-down replication • They increase the risk of propagating errant transactions • They are an expensive solution to artificial problems • And more… Binlog Servers are a better alternative for most (all?) Intermediate Masters use cases https://jfg-mysql.blogspot.com/2015/10/binlog-servers-for-backups-and-point-in-time-recovery.html (also includes links to other use cases like simple failover and scaling)
  • 53. 53 Conclusion [1 of 2] (MySQL Scalability and Reliability for Replication – do.bcn 2019) • Replication is a wide subject • I hope you are now equipped to dive deeper in the topics you need • FOREIGN KEYs at scale: DELETE/UPDATE CASCADE à big/long trx L • The downsides of AUTO_INCREMENTs: hard data reconciliation • But I did not talk much about • UNIQUE KEYs at scale • the limits of archiving
  • 54. 54 Conclusion [2 of 2] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Two types of sharding: • Vertical (Split): some schemas/tables above, some below • Horizontal: some data in shard #1, in #2, … #N (left to right) (Archiving is a type of horizontal sharding with N usually equal 2) • Splits and archiving divide a problem by 2 or 3 à short term win • UNIQUE and FOREIGN KEYs are hard to enforce on a sharded dataset
  • 55. Links [1 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) • Parts of this talk are inspired by some of Nicolai Plum’s talks: • Booking.com: Evolution of MySQL System Design (PLMCE 2015) https://www.percona.com/live/mysql-conference-2015/sessions/bookingcom-evolution-mysql-system-design • MySQL Scalability: Dimensions, Methods and Technologies (PLDPC 2016) https://www.percona.com/live/data-performance-conference-2016/sessions/mysql-scalability-dimensions-methods-and-technologies • Datastore axes: Choosing the scalability direction you need (PLAMS 2016) https://www.percona.com/live/plam16/sessions/datastore-axes-choosing-scalability-direction-you-need • Combined with material from some of previous talks: • Demystifying MySQL Replication Crash Safety (PLEU 2018 and HL18Moscow) https://www.slideshare.net/JeanFranoisGagn/demystifying-mysql-replication-crash-safety • The Full MySQL Parallel Replication Tutorial (PLSC 2018 and many others) https://www.slideshare.net/JeanFranoisGagn/the-full-mysql-and-mariadb-parallel-replication-tutorial • How B.com avoids and deals with replication lag (MariaDB Dev meeting 2017 and more) https://www.slideshare.net/JeanFranoisGagn/how-bookingcom-avoids-and-deals-with-replication-lag-74772032 • Autopsy of a [MySQL] automation disaster (FOSDEM 2017) https://www.slideshare.net/JeanFranoisGagn/autopsy-of-an-automation-disaster 55
  • 56. Links [2 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Some posts about advanced replication topics: • The five things to consider when implementing master high-availability https://jfg-mysql.blogspot.com/2019/02/mysql-master-high-availability-and-failover-more-thoughts.html • Binlog Servers for Simplifying Point in Time Recovery (also includes links to other use cases like simple failover and scaling) https://jfg-mysql.blogspot.com/2015/10/binlog-servers-for-backups-and-point-in-time-recovery.html • On the consequences of sync_binlog != 1 (part #1) https://jfg-mysql.blogspot.com/2018/10/consequences-sync-binlog-neq-1-part-1.html • Question about Semi-Synchronous Replication: the Answer with All the Details https://www.percona.com/community-blog/2018/08/23/question-about-semi-synchronous-replication-answer-with-all-the-details/ • The danger of no Primary Key when replicating in RBR https://jfg-mysql.blogspot.com/2017/08/danger-no-pk-with-RBR-and-mariadb-protection.html • Better Parallel Replication for MySQL https://medium.com/booking-com-infrastructure/better-parallel-replication-for-mysql-14e2d7857813
  • 57. 57 Links [3 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) More posts about an advanced replication topic (parallel replication): • A Metric for Tuning Parallel Replication in MySQL 5.7 https://jfg-mysql.blogspot.com/2017/02/metric-for-tuning-parallel-replication-mysql-5-7.html • Evaluating MySQL Parallel Replication Part 4: More Benchmarks in Production https://medium.com/booking-com-infrastructure/evaluating-mysql-parallel-replication-part-4-more-benchmarks-in-production-49ee255043ab (links to other parts of the series in the post) • An update on Write Set (parallel replication) bug fix in MySQL 8.0 https://jfg-mysql.blogspot.com/2018/01/an-update-on-write-set-parallel-replication-bug-fix-in-mysql-8-0.html Post on related subjects: • Why I wrote "do not ignore warnings" and "always investigate/fix warnings” https://jfg-mysql.blogspot.com/2017/01/do-not-ignore-warnings-in-mysql-mariadb.html
  • 58. 58 Links [4 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Other interesting posts: • MySQL Ripple: The First Impression of a MySQL Binlog Server https://www.percona.com/blog/2019/03/15/mysql-ripple-first-impression-of-mysql-binlog-server/ • Why MySQL’s binlog-do-db option is dangerous https://www.percona.com/blog/2009/05/14/why-mysqls-binlog-do-db-option-is-dangerous/ • MySQL ring replication: Why it is a bad option https://www.percona.com/blog/2014/10/07/mysql-ring-replication-why-it-is-a-bad-option/ • Orchestrator and Pseudo-GTID: https://github.com/github/orchestrator/blob/master/docs/pseudo-gtid.md • ProxySQL GTID consistent reads: https://proxysql.com/blog/proxysql-gtid-causal-reads • MySQL High Availability at GitHub https://githubengineering.com/mysql-high-availability-at-github/ • GitHub October 21 post-incident analysis https://blog.github.com/2018-10-30-oct21-post-incident-analysis/
  • 59. 59 Links [5 of 5] (MySQL Scalability and Reliability for Replication – do.bcn 2019) Other interesting tools, etc…: • Percona Toolkit: https://www.percona.com/doc/percona-toolkit/LATEST/index.html • pt-heartbeat and pt-online-schema-change • pt-slave-find, pt-slave-restart, pt-table-checksum and pt-table-sync • pt-slave-delay (not needed since 5.6 and 10.2) • GitHub Online Schema Migration/gh-ost: https://github.com/github/gh-ost • Monotonically increasing UUID http://mysql.rjweb.org/doc.php/uuid https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ • Query Optimization: The Basics https://dbahire.com/pleu18/ • About sharding: • Vitess: https://vitess.io/ • Spider: https://mariadb.com/kb/en/library/spider-storage-engine-overview/ • MySQL Cluster: https://www.mysql.com/products/cluster/
  • 60. Thanks ! by Jean-François Gagné - Presented at dataops.barcelona (June 2019) Senior Infrastructure Engineer / System and MySQL Expert jeanfrancois AT messagebird DOT com / @jfg956