SlideShare une entreprise Scribd logo
1  sur  27
• One of the oldest and largest Ruby on Rails monoliths
• 1000+ developers
• 1000 Pull Requests per day
• 170K peak RPS
• 2 billion background jobs processed per day
Background Jobs at Shopify
Architecture Overview
Multi-Tenancy, Flash Sales
Scalability Problems & Solutions
Performance Bottlenecks, and Horizontal Scalability
• Asynchronous Units of Work
• Email
• Webhooks
• Checkout and payment processing
• Backfills, maintenance tasks
• Schema Migrations
• Our own library Hedwig
• Ruby
• Similar to Resque: but better fitted to our
architecture
• Queues as Redis Lists
Background Jobs at Shopify
Problem: Error Queue
Solution: Kafka Streaming
Problem: Too Many Connections
Solution: Proxy
Problem: Locks
Solution: Dedicated Instances for Locking
http://gustavocaso.github.io/2019/04/30/migrating-millions-of-redis-keys-without-downtime/
Problem: Queueing Throughput
Solution 1: Dedicated Instances per Queue
Solution 2: Horizontally Scalable Queues
• Single tenants pushing Redis limits
• Case-by-case solutions:
• Error reporting: Kafka
• Connections: Proxy
• Locking: Dedicated Redis Instance
• Queues: exploration phase
• Dedicated instances
• Horizontal scaling
Summary
Thank you!
Thank you!

Contenu connexe

Tendances

API First Workflow: How could we have better API Docs through DevOps pipeline
API First Workflow: How could we have better API Docs through DevOps pipelineAPI First Workflow: How could we have better API Docs through DevOps pipeline
API First Workflow: How could we have better API Docs through DevOps pipeline
Pronovix
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
confluent
 

Tendances (20)

API First Workflow: How could we have better API Docs through DevOps pipeline
API First Workflow: How could we have better API Docs through DevOps pipelineAPI First Workflow: How could we have better API Docs through DevOps pipeline
API First Workflow: How could we have better API Docs through DevOps pipeline
 
Retail architecture target
Retail architecture targetRetail architecture target
Retail architecture target
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
 
Top 50 Node.js Interview Questions and Answers | Edureka
Top 50 Node.js Interview Questions and Answers | EdurekaTop 50 Node.js Interview Questions and Answers | Edureka
Top 50 Node.js Interview Questions and Answers | Edureka
 
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and VormetricProtecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
 
Domain Driven Design and Hexagonal Architecture with Rails
Domain Driven Design and Hexagonal Architecture with RailsDomain Driven Design and Hexagonal Architecture with Rails
Domain Driven Design and Hexagonal Architecture with Rails
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Swagger
SwaggerSwagger
Swagger
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Atomicity In Redis: Thomas Hunter
Atomicity In Redis: Thomas HunterAtomicity In Redis: Thomas Hunter
Atomicity In Redis: Thomas Hunter
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
Kafka Tutorial: Kafka Security
Kafka Tutorial: Kafka SecurityKafka Tutorial: Kafka Security
Kafka Tutorial: Kafka Security
 
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
 
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
 

Similaire à How Shopify Is Scaling Up Its Redis Message Queues

RedisDay London 2018 - How We Run Redis in Multiple Datacenters
RedisDay London 2018 - How We Run Redis in Multiple Datacenters RedisDay London 2018 - How We Run Redis in Multiple Datacenters
RedisDay London 2018 - How We Run Redis in Multiple Datacenters
Redis Labs
 
RedisDay London 2018 - Stack Overflow's Next Steps in Redis
RedisDay London 2018 - Stack Overflow's Next Steps in RedisRedisDay London 2018 - Stack Overflow's Next Steps in Redis
RedisDay London 2018 - Stack Overflow's Next Steps in Redis
Redis Labs
 
How Shopify Scales Rails
How Shopify Scales RailsHow Shopify Scales Rails
How Shopify Scales Rails
jduff
 
Moving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScaleMoving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScale
mmoline
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3
Wen-Tien Chang
 
Handling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeperHandling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeper
ryanlecompte
 

Similaire à How Shopify Is Scaling Up Its Redis Message Queues (20)

Scaling Social Games
Scaling Social GamesScaling Social Games
Scaling Social Games
 
Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017
 
RedisDay London 2018 - How We Run Redis in Multiple Datacenters
RedisDay London 2018 - How We Run Redis in Multiple Datacenters RedisDay London 2018 - How We Run Redis in Multiple Datacenters
RedisDay London 2018 - How We Run Redis in Multiple Datacenters
 
RedisDay London 2018 - Stack Overflow's Next Steps in Redis
RedisDay London 2018 - Stack Overflow's Next Steps in RedisRedisDay London 2018 - Stack Overflow's Next Steps in Redis
RedisDay London 2018 - Stack Overflow's Next Steps in Redis
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing Venice
 
How Shopify Scales Rails
How Shopify Scales RailsHow Shopify Scales Rails
How Shopify Scales Rails
 
Frontend as a first class citizen
Frontend as a first class citizenFrontend as a first class citizen
Frontend as a first class citizen
 
React on rails v6.1 at LA Ruby, November 2016
React on rails v6.1 at LA Ruby, November 2016React on rails v6.1 at LA Ruby, November 2016
React on rails v6.1 at LA Ruby, November 2016
 
Scala at foursquare
Scala at foursquareScala at foursquare
Scala at foursquare
 
Moving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScaleMoving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScale
 
PHP at Yahoo!
PHP at Yahoo!PHP at Yahoo!
PHP at Yahoo!
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3
 
KeyValue Stores
KeyValue StoresKeyValue Stores
KeyValue Stores
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the Cloud
 
Handling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeperHandling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeper
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
A Tale of 2 Systems
A Tale of 2 SystemsA Tale of 2 Systems
A Tale of 2 Systems
 
Developing polyglot persistence applications #javaone 2012
Developing polyglot persistence applications  #javaone 2012Developing polyglot persistence applications  #javaone 2012
Developing polyglot persistence applications #javaone 2012
 
Building Distributed Systems With Riak and Riak Core
Building Distributed Systems With Riak and Riak CoreBuilding Distributed Systems With Riak and Riak Core
Building Distributed Systems With Riak and Riak Core
 
A look ahead at RAP (ESE 2010)
A look ahead at RAP (ESE 2010)A look ahead at RAP (ESE 2010)
A look ahead at RAP (ESE 2010)
 

Plus de Redis Labs

SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Redis Labs
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Redis Labs
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 

Plus de Redis Labs (20)

Redis Day Bangalore 2020 - Session state caching with redis
Redis Day Bangalore 2020 - Session state caching with redisRedis Day Bangalore 2020 - Session state caching with redis
Redis Day Bangalore 2020 - Session state caching with redis
 
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
 
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
 
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
 
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
 
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis for Data Science and Engineering by Dmitry Polyakovsky of OracleRedis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
 
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
 
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
 
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
 
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
 
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
 
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
 
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
 
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
 
Redis as a High Scale Swiss Army Knife by Rahul Dagar and Abhishek Gupta of G...
Redis as a High Scale Swiss Army Knife by Rahul Dagar and Abhishek Gupta of G...Redis as a High Scale Swiss Army Knife by Rahul Dagar and Abhishek Gupta of G...
Redis as a High Scale Swiss Army Knife by Rahul Dagar and Abhishek Gupta of G...
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

How Shopify Is Scaling Up Its Redis Message Queues

Notes de l'éditeur

  1. Hello everyone, my name is Moe and work at Shopify. I’m excited to speak for the second time at Redis Day and share with you some of the challenges and insights we have at Shopify.
  2. Shopify is the leading omni-channel commerce platform. Merchants use Shopify to design, set up, and manage their stores across multiple sales channels, including mobile, web, social media, marketplaces, brick-and-mortar locations, and pop-up shops. We allow anyone to sell anywhere.
  3. On the technical side of things: We are one of the oldest and largest Ruby on Rails monoliths We have over a thousand developers We merge over a thousand pull requests every day We process 80 thousand requests per second at peak times We process 2 billion background jobs per day, which we will dive into in my talk today
  4. The main focus of my talk today is going to be how we use Redis as our background job queue, the scalability challenges that come with that, and how we solve some of those challenges.
  5. Background jobs are a common pattern in web development. They allow to encapsulate a process or unit of work that can be executed asynchronously from web requests. At Shopify, we use them quite a lot, from sending emails, to processing webhooks, delaying the processing of payments and checkouts for speedup, as well as for maintenance and backfills. We even use them as the backing engine for running our database schema migrations. This logic is all encapsulated inside our own library called Hedwig. It started out as a fork of the Resque library with some patches, but we started diverging enough that it made more sense to own our own library, for simplicity and performance. From the Redis perspective, we persist queues as lists, which represent our main use case, but we also use many other Redis data types for various metadata, such as worker heartbeats and uniqueness locks.
  6. Flash sales are a big thing on Shopify. These are sales that are either scheduled or not, in which a very limited quantity of highly demanded items goes for sale. These sales drive huge amounts of traffic to the platform. But rather than sway our merchants away from them, we fully embraced them as a feature and as a way of continually building resiliency and scalability into our platform.
  7. These sales can drive orders of magnitude more traffic to a single merchant, which our platform needs to respond positively to. One key feature of this traffic is that it’s very write-heavy. A lot of book-keeping operations need to happen during a checkout such as updating inventory, persisting checkout and user information.
  8. Historically, Shopify started out as a simple, small Ruby on Rails monolith, with: a single Redis instance supported the background job queue, a single MySQL instance was the main persistent store, and workers processed web requests and jobs. With growing traffic, we started running into some scalability concerns.
  9. For a while, it was possible to scale up these operations by simply getting a bigger database with more CPU power, but eventually this started posing resiliency concerns as well because a single database means a single point of failure. So we had to look into horizontal scaling. And the first candidate for horizontal scaling was MySQL. We partitioned our MySQL instances into shards. These shards have the exact schema, but contain a different subset of merchants and their data. A single shop belongs to a single shard, and a shard contains multiple shops.
  10. Scaling up workers is usually not as big of a problem, and we are able to do that by provisioning more nodes. So at this point in time, about 3 years ago, we had a tested mechanism for scaling up both our compute power and our MySQL cluster through sharding. However, we still had no way of scaling Redis, and this started to cause issues.
  11. So we decided to piggyback this concept, and apply the same partitioning of shops to the Redis instances. A single MySQL and Redis partition is what we call a Shopify Pod. These pods ensure that each subset of merchants have an isolated and dedicated MySQL instances and Redis. Having a single Redis instance dedicated to a smaller set of shops means that we can process much more out of it.
  12. Workers are not podded to allow for capacity sharing and elasticity across multiple pods.
  13. However, in the past year, we’ve been starting to hit the limit of this scaling strategy. We are now at the point where a single merchant can drive enough traffic to hit the limit of a single Redis instance. This is mainly due to the large amount of queuing and dequeueing operations that happen on a single queue in a given Shopify Pod. However, this is also due to some inefficient usage patterns that we found in our codebase.
  14. We sometimes see some other symptoms such as latency spikes that lead to cascading failures and a degraded state of the platform.
  15. To help us recover from these states, we have circuit breakers in place that allow us to fail fast and give the Redis instance a chance to recover. A circuit breaker is a software encapsulation of a given resource (like a Redis client), which keeps track of failure metrics and blocks access to that given resource if it fails more than a given threshold. This allows highly critical resources to dynamically get disconnected from sources of load when under pressure. This will hopefully allow that resource to recover faster. This is potentially costly. Although the access patterns are programmed with fallbacks, having open circuits can occasionally lead to inconsistent state and cause our merchants and our developers a lot of trouble to fix.
  16. When a Ruby exception occurs in Shopify, we need to generate a payload with some metadata and send it to Bugsnag, a service we use to aggregate metrics on exceptions. This requires an API call to the Bugnsag API, which we use a background job to execute. This was a fine use case for background jobs on top of Redis, but in cases of massive flashsales that caused spikes in exceptions, this meant that the Redis instance, which is already drowning in exceptions, had even less capacity to perform queueing and dequeueing operations for critical application operations. An important trait of this background job is that it has no dependency on Rails or Ruby itself. It’s simply an abstraction around an HTTP call.
  17. For this reason, we deemed that a message streaming bus was better suited for this use case. Especially since we already have operational expertise with Kafka with a dedicated team maintaining it. So we built a simple Kafka consumer in Go and made our web and job workers produce payloads to a Kafka topic instead. The consumer then took care of relaying those messages to Bugsnag. By doing this, we freed up around 25% CPU capacity during peak loads on Redis for job queueing and dequeuing.
  18. The next problematic pattern was how we handled capacity sharing. In a given cluster, all job workers connect to all Redis instances and process jobs from each one in a round-robin fashion. This is great to share load across multiple workers, but this means that each Redis instance needs to maintain a connection to thousands of workers in a cluster. We noticed that an estimated 20% of Redis CPU time was spent on connection handling.
  19. Our first attempt at mitigating this was by isolating subsets of Redis instances with smaller dedicated worker pools, which allowed for less capacity sharing between pods, but also reduced the number of connections per Redis instance drastically.
  20. However, a truly future-proof solution for this was to use a proxy. We are currently in the process of deploying Envoy as a proxy. This comes with the many benefits of proxies: Solves the problem of having too many connections, since the Proxy can maintain a connection pool with a large deployment of workers and reduce overhead on the upstream Redis servers (also keeps connections alive) Allows to distribute load to multiple Redis instances Allows for high-availability by dynamically route commands to a “master” and allow for failovers to. replicas
  21. Some background jobs written by our developers require to be ran uniquely. That means that we don’t want certain jobs to have multiple instances running at the same time. For this, we use Redis to store those uniqueness locks. Before a job worker processes a job, it acquires the lock for that job, processes it, and the releases the lock. A key thing we found out was that many background jobs generated during flash sales used this pattern. Because of that, these locking operations have a significant CPU usage overhead during flash sales.
  22. We decided to dedicate entirely separate instances for these locking operations. We were able to come up with a zero-downtime migration scheme that allowed us to safely transition our locking operations to a separate Redis instance. This means that we free up the Redis instance persisting queues from locking operations entirely.
  23. Ultimately, no matter how well we optimize our CPU usage of Redis, we also anticipate that the flash sales on our platforms are only going to get bigger. This means that performing all enqueuing and dequeueing operations on a single Redis is eventually going to overwhelm that instance entirely. For that reason, we are exploring ways of distributing the job queues themselves across multiple Redis instances.
  24. This first and easy way is to assign each job queue a separate instance. The downsides of this approach is that depending on the number of queues (currently a dozen), we might have a huge operational overhead since we’ll have to manage another order of magnitude of number of Redis instances.
  25. Another approach is to horizontally distribute every single queue across a fixed number of Redis instances per pod. This is nice because we can achieve equal load between all Redis instances. Selecting an instance can either be done at the worker-level by making Ruby workers partition-aware, or we can leverage our Envoy proxy to distribute commands across our partitions.
  26. In conclusion, we think that scaling your Redis infrastructure is really about knowing your usage patterns really well. In our case, the driving factor has been single-tenant traffic during flash-sales. This has forced us to look deeper into our Redis use cases, and evaluate the way forward in each case. In the case of asynchronous error reporting through HTTP, we leveraged Kafka. To deal with an overload of connections from a growing number of workers, we decided to employ a proxy. For locking, we provisioned more Redis instances for that specific use case. Finally we are in the early stages of horizontally scaling the queueing operations themselves across multiple Redis instances.
  27. If you have any questions or would like to chat, I’ll be around for the rest of the day and would be happy to talk! Thanks!