Roblox is a global online platform bringing millions of people together through play, with over 37 million daily active users and millions of games on the platform. Machine learning is a key part of our ability to scale important services to our massive community. In this talk, we share our journey of scaling our deep learning text classifiers to process 50k+ requests per second at latencies under 20ms. We will share how we were able to not only make BERT fast enough for our users, but also economical enough to run in production at a manageable cost on CPU. Further details can be found in our blog post below:
https://robloxtechblog.com/how-we-scaled-bert-to-serve-1-billion-daily-requests-on-cpus-d99be090db26
3. Deep Learning for Text Classification
4
● Text Classification is a key capability on
Roblox platform
● BERT is a deep learning model that
has transformed the Natural Language
Processing (NLP) landscape
● Performance (Precision/Recall Area
Under Curve) of our text classifiers
improved by 10 percentage points fine-
tuning BERT versus classical machine
learning
4. Beyond Accuracy: Latency and Throughput
5
Latency: Speed of Request
Analogy -> How long it takes for a
single person to cross a bridge
We required latency under 20ms
Throughput: Completed Requests
Per Second
Analogy -> How many people can
cross a bridge in a period of time
We required over 50k requests per sec
We want a short, wide bridge to
maximize both Latency and
Throughput in a realtime
environment
6. GPU vs CPU (for our application)
Higher Throughput
for Real-Time Inference
Higher Throughput
for Model Training
~GPU (TESLA V100) 10x faster
at processing training examples
than CPU, due to efficiency in
doing large batch matrix
operations
OR
~CPU (Intel Xeon Scalable
Processor) 5x more throughput
than GPU due to CPU-specific
optimizations and spreading real-
time inference requests across
cores (with latency < 20ms)
(on cost equivalent hardware in 2020)
8. Know Your Quoc Le’s
Quoc N. Le Quoc V. Le
Has over 85k citations
as an AI researcher
according to Google
Scholar
Once got kicked out of
the Boomtown Casino
in Reno for counting
cards in blackjack
9. Our Scaling Playbook on CPU: Less Is More!
10
❏ Smaller Model (Distillation)
❏ Smaller Inputs (Dynamic Inputs)
❏ Smaller Weights (Quantization)
❏ Smaller Number of Requests
(Caching)
❏ Smaller Number of Threads per
Core (Thread Tuning)
18. Smaller Number of Requests to Model (Caching)
Image Credit:
https://peltarion.com/blog/data-
science/illustration-3d-bert
Text
Classification
Service
Cache
DistilBERT Model
1. Retrieve text
classification result
from cache (we’re done
if it’s there)
2. Else call deep
learning model for
result
3. Add result to cache,
then return result to
service
20. Our Scaling Playbook on CPU: Less Is More!
21
✓ Smaller Model (Distillation)
✓ Smaller Inputs (Dynamic Inputs)
✓ Smaller Weights (Quantization)
✓ Smaller Number of Requests
(Caching)
✓ Smaller Number of Threads per
Core (Thread Tuning)
21. 30x Improvement in Latency and Throughput on CPU
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
22. Takeaways
23
● For certain real-time deep learning
applications, it is feasible/natural to super-
scale inferences on CPU
● The key to scaling is making things smaller,
as shown in this presentation
● Many optimizations that enabled scale are
easy to implement (one-liners)
● Check out our blog for more details:
https://robloxtechblog.com/how-we-scaled-
bert-to-serve-1-billion-daily-requests-on-cpus-
d99be090db26
23. Questions? Suggestions?
24
We are always looking to get more performance
from our models. Please reach out to
kkaehler@roblox.com
PS We are always hiring 🤓