How Shopify Is Scaling Up Its Redis Message Queues
1.
2.
3. • One of the oldest and largest Ruby on Rails monoliths
• 1000+ developers
• 1000 Pull Requests per day
• 170K peak RPS
• 2 billion background jobs processed per day
4. Background Jobs at Shopify
Architecture Overview
Multi-Tenancy, Flash Sales
Scalability Problems & Solutions
Performance Bottlenecks, and Horizontal Scalability
5. • Asynchronous Units of Work
• Email
• Webhooks
• Checkout and payment processing
• Backfills, maintenance tasks
• Schema Migrations
• Our own library Hedwig
• Ruby
• Similar to Resque: but better fitted to our
architecture
• Queues as Redis Lists
Background Jobs at Shopify
Hello everyone, my name is Moe and work at Shopify. I’m excited to speak for the second time at Redis Day and share with you some of the challenges and insights we have at Shopify.
Shopify is the leading omni-channel commerce platform. Merchants use Shopify to design, set up, and manage their stores across multiple sales channels, including mobile, web, social media, marketplaces, brick-and-mortar locations, and pop-up shops.
We allow anyone to sell anywhere.
On the technical side of things:
We are one of the oldest and largest Ruby on Rails monoliths
We have over a thousand developers
We merge over a thousand pull requests every day
We process 80 thousand requests per second at peak times
We process 2 billion background jobs per day, which we will dive into in my talk today
The main focus of my talk today is going to be how we use Redis as our background job queue, the scalability challenges that come with that, and how we solve some of those challenges.
Background jobs are a common pattern in web development. They allow to encapsulate a process or unit of work that can be executed asynchronously from web requests.
At Shopify, we use them quite a lot, from sending emails, to processing webhooks, delaying the processing of payments and checkouts for speedup, as well as for maintenance and backfills. We even use them as the backing engine for running our database schema migrations.
This logic is all encapsulated inside our own library called Hedwig. It started out as a fork of the Resque library with some patches, but we started diverging enough that it made more sense to own our own library, for simplicity and performance.
From the Redis perspective, we persist queues as lists, which represent our main use case, but we also use many other Redis data types for various metadata, such as worker heartbeats and uniqueness locks.
Flash sales are a big thing on Shopify.
These are sales that are either scheduled or not, in which a very limited quantity of highly demanded items goes for sale. These sales drive huge amounts of traffic to the platform.
But rather than sway our merchants away from them, we fully embraced them as a feature and as a way of continually building resiliency and scalability into our platform.
These sales can drive orders of magnitude more traffic to a single merchant, which our platform needs to respond positively to. One key feature of this traffic is that it’s very write-heavy. A lot of book-keeping operations need to happen during a checkout such as updating inventory, persisting checkout and user information.
Historically, Shopify started out as a simple, small Ruby on Rails monolith, with:
a single Redis instance supported the background job queue,
a single MySQL instance was the main persistent store, and workers processed web requests and jobs.
With growing traffic, we started running into some scalability concerns.
For a while, it was possible to scale up these operations by simply getting a bigger database with more CPU power, but eventually this started posing resiliency concerns as well because a single database means a single point of failure.
So we had to look into horizontal scaling. And the first candidate for horizontal scaling was MySQL. We partitioned our MySQL instances into shards.
These shards have the exact schema, but contain a different subset of merchants and their data. A single shop belongs to a single shard, and a shard contains multiple shops.
Scaling up workers is usually not as big of a problem, and we are able to do that by provisioning more nodes.
So at this point in time, about 3 years ago, we had a tested mechanism for scaling up both our compute power and our MySQL cluster through sharding. However, we still had no way of scaling Redis, and this started to cause issues.
So we decided to piggyback this concept, and apply the same partitioning of shops to the Redis instances. A single MySQL and Redis partition is what we call a Shopify Pod.
These pods ensure that each subset of merchants have an isolated and dedicated MySQL instances and Redis. Having a single Redis instance dedicated to a smaller set of shops means that we can process much more out of it.
Workers are not podded to allow for capacity sharing and elasticity across multiple pods.
However, in the past year, we’ve been starting to hit the limit of this scaling strategy. We are now at the point where a single merchant can drive enough traffic to hit the limit of a single Redis instance. This is mainly due to the large amount of queuing and dequeueing operations that happen on a single queue in a given Shopify Pod.
However, this is also due to some inefficient usage patterns that we found in our codebase.
We sometimes see some other symptoms such as latency spikes that lead to cascading failures and a degraded state of the platform.
To help us recover from these states, we have circuit breakers in place that allow us to fail fast and give the Redis instance a chance to recover. A circuit breaker is a software encapsulation of a given resource (like a Redis client), which keeps track of failure metrics and blocks access to that given resource if it fails more than a given threshold. This allows highly critical resources to dynamically get disconnected from sources of load when under pressure. This will hopefully allow that resource to recover faster.
This is potentially costly. Although the access patterns are programmed with fallbacks, having open circuits can occasionally lead to inconsistent state and cause our merchants and our developers a lot of trouble to fix.
When a Ruby exception occurs in Shopify, we need to generate a payload with some metadata and send it to Bugsnag, a service we use to aggregate metrics on exceptions. This requires an API call to the Bugnsag API, which we use a background job to execute.
This was a fine use case for background jobs on top of Redis, but in cases of massive flashsales that caused spikes in exceptions, this meant that the Redis instance, which is already drowning in exceptions, had even less capacity to perform queueing and dequeueing operations for critical application operations.
An important trait of this background job is that it has no dependency on Rails or Ruby itself. It’s simply an abstraction around an HTTP call.
For this reason, we deemed that a message streaming bus was better suited for this use case. Especially since we already have operational expertise with Kafka with a dedicated team maintaining it.
So we built a simple Kafka consumer in Go and made our web and job workers produce payloads to a Kafka topic instead.
The consumer then took care of relaying those messages to Bugsnag. By doing this, we freed up around 25% CPU capacity during peak loads on Redis for job queueing and dequeuing.
The next problematic pattern was how we handled capacity sharing. In a given cluster, all job workers connect to all Redis instances and process jobs from each one in a round-robin fashion. This is great to share load across multiple workers, but this means that each Redis instance needs to maintain a connection to thousands of workers in a cluster.
We noticed that an estimated 20% of Redis CPU time was spent on connection handling.
Our first attempt at mitigating this was by isolating subsets of Redis instances with smaller dedicated worker pools, which allowed for less capacity sharing between pods, but also reduced the number of connections per Redis instance drastically.
However, a truly future-proof solution for this was to use a proxy. We are currently in the process of deploying Envoy as a proxy. This comes with the many benefits of proxies:
Solves the problem of having too many connections, since the Proxy can maintain a connection pool with a large deployment of workers and reduce overhead on the upstream Redis servers (also keeps connections alive)
Allows to distribute load to multiple Redis instances
Allows for high-availability by dynamically route commands to a “master” and allow for failovers to. replicas
Some background jobs written by our developers require to be ran uniquely. That means that we don’t want certain jobs to have multiple instances running at the same time.
For this, we use Redis to store those uniqueness locks. Before a job worker processes a job, it acquires the lock for that job, processes it, and the releases the lock.
A key thing we found out was that many background jobs generated during flash sales used this pattern. Because of that, these locking operations have a significant CPU usage overhead during flash sales.
We decided to dedicate entirely separate instances for these locking operations.
We were able to come up with a zero-downtime migration scheme that allowed us to safely transition our locking operations to a separate Redis instance. This means that we free up the Redis instance persisting queues from locking operations entirely.
Ultimately, no matter how well we optimize our CPU usage of Redis, we also anticipate that the flash sales on our platforms are only going to get bigger. This means that performing all enqueuing and dequeueing operations on a single Redis is eventually going to overwhelm that instance entirely.
For that reason, we are exploring ways of distributing the job queues themselves across multiple Redis instances.
This first and easy way is to assign each job queue a separate instance.
The downsides of this approach is that depending on the number of queues (currently a dozen), we might have a huge operational overhead since we’ll have to manage another order of magnitude of number of Redis instances.
Another approach is to horizontally distribute every single queue across a fixed number of Redis instances per pod.
This is nice because we can achieve equal load between all Redis instances. Selecting an instance can either be done at the worker-level by making Ruby workers partition-aware, or we can leverage our Envoy proxy to distribute commands across our partitions.
In conclusion, we think that scaling your Redis infrastructure is really about knowing your usage patterns really well. In our case, the driving factor has been single-tenant traffic during flash-sales.
This has forced us to look deeper into our Redis use cases, and evaluate the way forward in each case.
In the case of asynchronous error reporting through HTTP, we leveraged Kafka.
To deal with an overload of connections from a growing number of workers, we decided to employ a proxy.
For locking, we provisioned more Redis instances for that specific use case.
Finally we are in the early stages of horizontally scaling the queueing operations themselves across multiple Redis instances.
If you have any questions or would like to chat, I’ll be around for the rest of the day and would be happy to talk!
Thanks!