SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Group Replication in Uber
Bumpy road to MultiMasterness
Giedrius Jaraminas (giedrius@uber.com)
January 2020
Giedrius Jaraminas
● Previously:
Oracle First Line Support
Oracle Certified Master
Oracle University lector
● Currently:
Engineering Manager of MySQL team in Uber
● There are more than 75 million active Uber
riders across the world.
● Uber is available in more than 80 countries
worldwide.
● Uber has completed more than 5 billion rides.
● Over 3 million people drive for Uber.
● In the United States, Uber fulfills 40 million
rides per month.
Uber Technologies, Inc.
● The main offer as a relational database
● Critical for Infrastructure
○ The first data store to be emerged in a new
Zone
● The platform for home-built offers
MySQL in Uber
Why Group Replication
What is troubling in classic MySQL?
● Primary node - SPOF
● Even planned promotion requires some write
outage
○ Were able to lower this <1 minute
■ Customers still not happy about it
○ Expected 99.99% availability gives only ~53
downtime minutes per year
○ Promotions are regular
● Cross-region latency
● Strong consistency across regions
Datastore Discovery
● Routing vs Smart client (SmartProxy). Holy war
○ Routing -> HAproxy on ~80k hosts
○ SC -> dependency on App rebuilds
● Current winner is Routing
● New offer gave chance for SC
● So, we implemented one
○ Customer said NO
● Needed to extend the existing routing
For customers two changes at the time might be too much
Latency
● In Single Primary two regions may result in a
cross-region latency
○ The favorite topic for Customers to complain
● GR in WAN is not welcomed
● But actually - GR performed better:
○ Than Galera (expected)
○ Than remote-region connection (surprise)
● Performed slower:
○ Than local MySQL for writes. Nothing
personal - just physics.
Apparently - steady latency is more welcomed by customers
than lower, but unpredictable one
The trap of 2 regions
● You can have equal number of nodes in both
regions
○ Loose High Availability
● Or you can have quorum within one region
○ Latency depending on the region
○ Serious problems in case the “favorite” region
goes down
Luckily, the regions have availability zones, so at least
the disappearance of a whole region is less probable
Vs
We are using 3:2 configuration as default, and one of the 3 is
in cloud-zone
Scalability
● Currently GR supports up to 9 members in group
● For a comparison - the biggest regular MySQL
cluster in our fleet has ~150 replicas
● There is a possibility to create a ‘Snowflake’
○ If you are willing to have a heterogenous
architecture
○ (to be honest 150 replicas are also in a
‘Snowflake’)
○ Might be a problem if you have >9 regions
and need to be writable everywhere
Actually, big clusters might not be required for every use-case
Migration is a charm
● Heterogenous topology is supported
○ Even 5.7<->8.0
● Prepared environment can wait
● ~5 minutes write outage during switchover
● Other than that - transparent to service
● Hooray - 1st datastore in production use!
Sugar-free soda
Well-deserved rest
… both ways … luckily
● After 8 hours of using GR errors threshold alert
triggered
● Service was receiving an error almost on every
update
● Migrated back to previous (classic) MySQL
● Apparently tests on staging were false-success
Post-mortems
Testing plan rewrite
Version upgrade
Concurrent DML’s
● Foreign keys are no-no (from 64% to 25% of
errors during concurrent inserts)
● Concurrent updates (even on not-clashing rows)
may fail (up to 40%)
○ “read uncommitted” does not help
○ this is not intuitive
● Big (~ 2M row, NOT concurrent) deletes may fail
● Operations may fail during commit
● 8.0.17, stress-testing, 5 nodes cluster. In
real-world environment the numbers may differ
The App must be able to repeat any failed transaction
3101 1180 1213
DDL + DML
● DDL in one node while DML in other (on the same
object) - “not supported”
● Cluster is destroyed
● To perform DDL you need:
○ Either block DML’s (write outage)
○ Either temporary switch to single-primary
● Luckily - we do not allow users to execute DDL’s
directly (there is a goal-state based process for
that). So we were able to integrate the switching to
Single Primary mode (still GR) in the workflow
To be honest - this was kind-a mentioned in documentation
Epilogue
● In a week the first service was re-onboarded
● There are several production services using Group
Replication for >6 months
● No major incidents
○ Was able to withstand partial network outage
● It is a niche product that made a nice addition to
our portfolio.
● If you can’t allow any downtime but can retry
transactions - the Group Replication is here for
you (especially if you have at least three Zones)
The key to success is the management of expectations
Q/A
Thank You!

Contenu connexe

Tendances

Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla ClusterScyllaDB
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
 
How Incremental Compaction Reduces Your Storage Footprint
How Incremental Compaction Reduces Your Storage FootprintHow Incremental Compaction Reduces Your Storage Footprint
How Incremental Compaction Reduces Your Storage FootprintScyllaDB
 
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflixVinay Kumar Chella
 
Pass Elk: CAP Theorem since 90s and Beyond
Pass Elk: CAP Theorem since 90s and BeyondPass Elk: CAP Theorem since 90s and Beyond
Pass Elk: CAP Theorem since 90s and BeyondRaghavendra Prabhu
 
No sql db comparison
No sql db comparisonNo sql db comparison
No sql db comparisonPurnendu Rath
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityConSanFrancisco123
 
MySQL 高可用性
MySQL 高可用性MySQL 高可用性
MySQL 高可用性YUCHENG HU
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Distributed systems and consistency
Distributed systems and consistencyDistributed systems and consistency
Distributed systems and consistencyseldo
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaiFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaScyllaDB
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jetStreamNative
 
Scylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and ScyllaScylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB
 
Ndb cluster 80_ycsb_mem
Ndb cluster 80_ycsb_memNdb cluster 80_ycsb_mem
Ndb cluster 80_ycsb_memmikaelronstrom
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...ScyllaDB
 
Database Shrading and cassandra architecture
Database Shrading and cassandra architectureDatabase Shrading and cassandra architecture
Database Shrading and cassandra architectureSoupik Chowdhury
 
How Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintHow Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintScyllaDB
 

Tendances (20)

Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
How Incremental Compaction Reduces Your Storage Footprint
How Incremental Compaction Reduces Your Storage FootprintHow Incremental Compaction Reduces Your Storage Footprint
How Incremental Compaction Reduces Your Storage Footprint
 
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
Pass Elk: CAP Theorem since 90s and Beyond
Pass Elk: CAP Theorem since 90s and BeyondPass Elk: CAP Theorem since 90s and Beyond
Pass Elk: CAP Theorem since 90s and Beyond
 
No sql db comparison
No sql db comparisonNo sql db comparison
No sql db comparison
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And Availability
 
Scalable IoT platform
Scalable IoT platformScalable IoT platform
Scalable IoT platform
 
MySQL 高可用性
MySQL 高可用性MySQL 高可用性
MySQL 高可用性
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Distributed systems and consistency
Distributed systems and consistencyDistributed systems and consistency
Distributed systems and consistency
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaiFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
 
Geode - Day 1
Geode - Day 1Geode - Day 1
Geode - Day 1
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jet
 
Scylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and ScyllaScylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and Scylla
 
Ndb cluster 80_ycsb_mem
Ndb cluster 80_ycsb_memNdb cluster 80_ycsb_mem
Ndb cluster 80_ycsb_mem
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
Database Shrading and cassandra architecture
Database Shrading and cassandra architectureDatabase Shrading and cassandra architecture
Database Shrading and cassandra architecture
 
How Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintHow Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter Footprint
 

Similaire à Pre fosdem2020 uber

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfScyllaDB
 
High-availability with Galera Cluster for MySQL
High-availability with Galera Cluster for MySQLHigh-availability with Galera Cluster for MySQL
High-availability with Galera Cluster for MySQLFromDual GmbH
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017Severalnines
 
Experiences testing dev versions of MySQL and why it is good for you
Experiences testing dev versions of MySQL and why it is good for youExperiences testing dev versions of MySQL and why it is good for you
Experiences testing dev versions of MySQL and why it is good for youSimon J Mudd
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGDG Cloud Bengaluru
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerkuchinskaya
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokesGagan Bajpai
 
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBSeveralnines
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningFromDual GmbH
 
M|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change MethodsM|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change MethodsMariaDB plc
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloudRafał Leszko
 

Similaire à Pre fosdem2020 uber (20)

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdf
 
High-availability with Galera Cluster for MySQL
High-availability with Galera Cluster for MySQLHigh-availability with Galera Cluster for MySQL
High-availability with Galera Cluster for MySQL
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
HA with Galera
HA with GaleraHA with Galera
HA with Galera
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
Experiences testing dev versions of MySQL and why it is good for you
Experiences testing dev versions of MySQL and why it is good for youExperiences testing dev versions of MySQL and why it is good for you
Experiences testing dev versions of MySQL and why it is good for you
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemaker
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
 
M|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change MethodsM|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change Methods
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloud
 

Dernier

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Dernier (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Pre fosdem2020 uber

  • 1. Group Replication in Uber Bumpy road to MultiMasterness Giedrius Jaraminas (giedrius@uber.com) January 2020
  • 2. Giedrius Jaraminas ● Previously: Oracle First Line Support Oracle Certified Master Oracle University lector ● Currently: Engineering Manager of MySQL team in Uber
  • 3. ● There are more than 75 million active Uber riders across the world. ● Uber is available in more than 80 countries worldwide. ● Uber has completed more than 5 billion rides. ● Over 3 million people drive for Uber. ● In the United States, Uber fulfills 40 million rides per month. Uber Technologies, Inc.
  • 4. ● The main offer as a relational database ● Critical for Infrastructure ○ The first data store to be emerged in a new Zone ● The platform for home-built offers MySQL in Uber
  • 5. Why Group Replication What is troubling in classic MySQL? ● Primary node - SPOF ● Even planned promotion requires some write outage ○ Were able to lower this <1 minute ■ Customers still not happy about it ○ Expected 99.99% availability gives only ~53 downtime minutes per year ○ Promotions are regular ● Cross-region latency ● Strong consistency across regions
  • 6. Datastore Discovery ● Routing vs Smart client (SmartProxy). Holy war ○ Routing -> HAproxy on ~80k hosts ○ SC -> dependency on App rebuilds ● Current winner is Routing ● New offer gave chance for SC ● So, we implemented one ○ Customer said NO ● Needed to extend the existing routing For customers two changes at the time might be too much
  • 7. Latency ● In Single Primary two regions may result in a cross-region latency ○ The favorite topic for Customers to complain ● GR in WAN is not welcomed ● But actually - GR performed better: ○ Than Galera (expected) ○ Than remote-region connection (surprise) ● Performed slower: ○ Than local MySQL for writes. Nothing personal - just physics. Apparently - steady latency is more welcomed by customers than lower, but unpredictable one
  • 8. The trap of 2 regions ● You can have equal number of nodes in both regions ○ Loose High Availability ● Or you can have quorum within one region ○ Latency depending on the region ○ Serious problems in case the “favorite” region goes down Luckily, the regions have availability zones, so at least the disappearance of a whole region is less probable Vs We are using 3:2 configuration as default, and one of the 3 is in cloud-zone
  • 9. Scalability ● Currently GR supports up to 9 members in group ● For a comparison - the biggest regular MySQL cluster in our fleet has ~150 replicas ● There is a possibility to create a ‘Snowflake’ ○ If you are willing to have a heterogenous architecture ○ (to be honest 150 replicas are also in a ‘Snowflake’) ○ Might be a problem if you have >9 regions and need to be writable everywhere Actually, big clusters might not be required for every use-case
  • 10. Migration is a charm ● Heterogenous topology is supported ○ Even 5.7<->8.0 ● Prepared environment can wait ● ~5 minutes write outage during switchover ● Other than that - transparent to service ● Hooray - 1st datastore in production use! Sugar-free soda Well-deserved rest
  • 11. … both ways … luckily ● After 8 hours of using GR errors threshold alert triggered ● Service was receiving an error almost on every update ● Migrated back to previous (classic) MySQL ● Apparently tests on staging were false-success Post-mortems Testing plan rewrite Version upgrade
  • 12. Concurrent DML’s ● Foreign keys are no-no (from 64% to 25% of errors during concurrent inserts) ● Concurrent updates (even on not-clashing rows) may fail (up to 40%) ○ “read uncommitted” does not help ○ this is not intuitive ● Big (~ 2M row, NOT concurrent) deletes may fail ● Operations may fail during commit ● 8.0.17, stress-testing, 5 nodes cluster. In real-world environment the numbers may differ The App must be able to repeat any failed transaction 3101 1180 1213
  • 13. DDL + DML ● DDL in one node while DML in other (on the same object) - “not supported” ● Cluster is destroyed ● To perform DDL you need: ○ Either block DML’s (write outage) ○ Either temporary switch to single-primary ● Luckily - we do not allow users to execute DDL’s directly (there is a goal-state based process for that). So we were able to integrate the switching to Single Primary mode (still GR) in the workflow To be honest - this was kind-a mentioned in documentation
  • 14. Epilogue ● In a week the first service was re-onboarded ● There are several production services using Group Replication for >6 months ● No major incidents ○ Was able to withstand partial network outage ● It is a niche product that made a nice addition to our portfolio. ● If you can’t allow any downtime but can retry transactions - the Group Replication is here for you (especially if you have at least three Zones) The key to success is the management of expectations
  • 15. Q/A