Cost Effectively Scaling Machine Learning Systems in the Cloud: E-commerce and publishing clients use Sailthru to personalize billions of digital experiences for their customers weekly. Earlier this year, Sailthru launched Sightlines to allow clients to predict the future behavior of individual users. In this talk we cover how we scaled Sightlines cost effectively in the cloud by combining inexpensive computing resources with an efficient architecture and easy to maintain and evolve implementation.
To access computing resources cost effectively, we utilize Amazon spot instances and Apache Mesos to pool together large quantities of CPU and memory. This approach can be orders of magnitude more cost effective than traditional deployments, but requires sophisticated automation and orchestration tools, and a fine-grained fault tolerant application architecture.
Given cost effective resources, the next challenge was to design the application to be efficient. Simple sampling and data pre-processing techniques significantly limit the computational requirements without adversely impacting model performance. Further, by controlling how often we run various components of the pipeline, we minimize cost while keeping models up to date.
The final challenge is to make such a system maintainable and easy to evolve. This includes removing single points of failure, automating infrastructure management, building distributed logging and monitoring capabilities, and running identical A / B production environments to enable aggressive, iterative changes to the code base and architecture in production.
We hope to demonstrate that the challenges faced in scaling a complex machine learning system in the cloud are at least as interesting as the science behind it, and to provide some insight into modern tools and methods for addressing these scalability challenges.
Dev Dives: Streamline document processing with UiPath Studio Web
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
1.
2. Online, Offline, Mobile, Email, Social
www.sailthru.com
Cost Effectively Scaling Machine Learning Systems in the Cloud
Agenda:
● Background on me, Sailthru & Sightlines (mercifully short)
● Cost effective resources in the AWS cloud
● Efficient(ish) application design
● Easy maintenance and evolution
● Machine learning details
3. Online, Offline, Mobile, Email, Social
www.sailthru.com
@jeremystan
Capitalism
Idealism
Indirect
Value
Direct
Value
Graduate student
Math
2000
Consultant
Finance
2005 CTO
Ad Tech
2010
Chief Data Scientist
Mar Tech
2015
6. Online, Offline, Mobile, Email, Social
www.sailthru.com
Requirements
1. ~5 million users per client
2. JSON formatted user data, siloed across clients
3. Predict varying outcomes
normal, poisson, binomial, quantile, ...
4. Update models & predictions daily
5. Only really care about predictive performance
6. Scale to 1,000+ clients
7. Online, Offline, Mobile, Email, Social
www.sailthru.com
Our Cost Effective Scaling Strategy
1. Get really cheap computing power
2. Make it work really, really hard
3. Optimize apps for ease of evolution
4. Setup identical A/B environments
Iterate aggressively based on data:
✓ Features
✓ Efficiency
✓ Scale
10x
3x
0.6x =
0.5x
= 9x
JSON to
Features
GBM in
Memory
1 x0.2x
Half our
processing
Half our
processing
8. Online, Offline, Mobile, Email, Social
www.sailthru.com
Cost Effective
Resources in
the AWS Cloud
9. Online, Offline, Mobile, Email, Social
www.sailthru.com
Cost Effective r3.8xlarge
32 vCPU, 244GB RAM
Resource Utilization
30%
(typical cloud)
10%
(data center)
90%
(highly efficient)
Cost
Per
Hour
$2.80
(on demand)
$1.76
(reserved 1yr)
$1.05
(reserved 3yr)
$0.28
(spot instance)
Cloud
$9.80
Data Center
$10.50
Spot + Mesos + Relay
$0.30
30x more cost
efficient!
($10.50 = $1.05 / 10%)
10. Online, Offline, Mobile, Email, Social
www.sailthru.com
AWS Spot Instances
Your bid
What you pay
All instances died!
11. Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos
81 “slaves”
4 availability zones
2 instance types
1,360 CPUs
10TB of RAM
94% utilized
$11.90 per hour
$104,244 per year
12. Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)
13. Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)
Mesos
Master
App A
App B
App C
Queue Size
Applications must be:
● Distributed to be scheduled wherever Mesos wants
● Fine Grained to maximize utilization in Mesos
● Idempotent to handle duplicate runs in case network
is partitioned
14. Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)
Mesos
Master
App A
App B
App C
Queue Size
Time
Available
Mesos
CPU
Jiffies
Doesn’t work for apps
with highly variable load
Idle
User
15. Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos + Relay
Available
Mesos
CPU
Jiffies
User
Idle
Available
Mesos
CPU
Jiffies
User
Idle
Relay.Mesos
Auto-scaler for distributed applications
github.com/sailthru/relay.mesos
● Allocates resources based on queue size
● Wraps applications inside Mesos slaves
● Can significantly improve cluster utilization
Before Relay
After
Relay
App A
App B
App C
Queue Size
Mesos
Master
Time
After Relay
Relay.
Mesos
18. Online, Offline, Mobile, Email, Social
www.sailthru.com
shard 1
shard 1,000
Sampling Strategy
JSON
Day
1
Mongo
S3
JSON sharded on hash(user)
19. Online, Offline, Mobile, Email, Social
www.sailthru.com
shard 1
shard 1,000
Sampling Strategy
JSON
Day
N
Mongo
Day
1
S3
20. Online, Offline, Mobile, Email, Social
www.sailthru.com
Day
N
Day
1
shard 1
shard 1,000
Sampling Strategy
JSON
Consistent 0.1% of data to a
Mesos Slave CPU
Mongo
S3
21. Online, Offline, Mobile, Email, Social
www.sailthru.com
Day
N
Day
1
shard 1
shard 1,000
Sampling Strategy
JSON
Apps sample more as needed
Mongo
S3
23. Online, Offline, Mobile, Email, Social
www.sailthru.com
Each User Radically Different
User
Feature
???
24. Online, Offline, Mobile, Email, Social
www.sailthru.com
Each User Radically Different
User
Feature
tidyjson
Turn JSON into data frames
github.com/sailthru/tidyjson
● Arbitrary JSON into R data.frames
● Guarantees deterministic structure
● Seamless with dplyr and %>%
25. Online, Offline, Mobile, Email, Social
www.sailthru.com
Why GBMs?
● Predict varying outcomes
normal, poisson, binomial, quantile, …
● Flexible enough to capture non-linearity & complex interactions
no need to feature engineer for each client
● Minimal number of hyper-parameters
depth, shrinkage, number of trees
● Robust to missing values
no need to impute
26. Online, Offline, Mobile, Email, Social
www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
*
27. Online, Offline, Mobile, Email, Social
www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
tree 1 tree 2 tree 3 tree K
1. Across the sum
Gives bagging, not boosting (iterative)
=> less accurate
+ α2
* + α3
*
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slaves
28. Online, Offline, Mobile, Email, Social
www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
tree 1 tree 2 tree 3 tree K
1. Across the sum
Gives bagging, not boosting (iterative)
=> less accurate
2. Within each tree (Spark MLLib, H20)
A lot of overhead and coordination
=> not efficient for many small GBMs
+ α2
* + α3
*
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slaves
29. Online, Offline, Mobile, Email, Social
www.sailthru.com
Distributing a GBM
1. Across the sum
Gives bagging, not boosting (iterative)
=> less accurate
2. Within each tree (Spark MLLib, H20)
A lot of overhead and coordination
=> not efficient for many small GBMs
3. Across the GBMs
50,000 GBMs to build
=> each can be built independently
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slaves
+ … + αK
*α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
* + … + αK
*α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
*
…
GBM 1 GBM 50,000
50,000 = 1,000 clients * 10 models * 5-fold CV
✓
30. Online, Offline, Mobile, Email, Social
www.sailthru.com
Grid Search
+ … + αK
*α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
*
For each client & model:
1. Grid search over:
a. Depth: size of trees
b. Shrinkage: λ “learning rate” for {αi
}
2. Cross-validate for optimal # of trees
32. Online, Offline, Mobile, Email, Social
www.sailthru.com
Tools Used
R
Modeling
Python
ETL
AWS S3
Batch
Applications
State
Frameworks
Zookeeper
Coordination
Spark
Map Reduce
Marathon
Running Apps
Cluster
Mesos
Sharing
Maintenance
ELK
Log Mgmt
Consul
Discovery
Configuration
Chef
Automation
Librato
Monitoring
Sensu
Alerting
Asgard
Auto Scaling
AWS Spot
Compute
33. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
JSON
34. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
JSON
35. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
36. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
v1.0.1
37. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
v1.0.1
38. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
v1.0.1
v1.0.2
39. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
JSON
v1.0.0
v1.0.1
v1.0.2
40. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
✓ Check logging
JSON
v1.0.0
v1.0.1
v1.0.2
41. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
✓ Check logging
✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2
42. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
✓ Check logging
✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2
43. Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
✓ Check logging
✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2
44. Thank You! Our team:
Divyanshu Vats Alex Gaudio Andras Kerekes Jeremy Stanley