Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy serving loads. However, most machine learning frameworks and systems only address model training and not deployment.
Clipper is an open-source, general-purpose model-serving system that addresses these challenges. Interposing between applications that consume predictions and the machine-learning models that produce predictions, Clipper simplifies the model deployment process by adopting a modular serving architecture and isolating models in their own containers, allowing them to be evaluated using the same runtime environment as that used during training. Clipper’s modular architecture provides simple mechanisms for scaling out models to meet increased throughput demands and performing fine-grained physical resource allocation for each model. Further, by abstracting models behind a uniform serving interface, Clipper allows developers to compose many machine-learning models within a single application to support increasingly common techniques such as ensemble methods, multi-armed bandit algorithms, and prediction cascades.
In this talk I will provide an overview of the Clipper serving system and discuss how to get started using Clipper to serve Apache Spark and TensorFlow models on Kubernetes. I will then discuss some recent work on statistical performance monitoring for machine learning models.
2. Clipper Team
Ø Dan Crankshaw
Ø Corey Zumar
Ø Simon Mo
Ø Alexey Tumanov
Ø Eyal Sela
Ø Rehan Durrani
Ø Eric Sheng
Ø Joseph Gonzalez
Ø Ion Stoica
Ø Many other
contributors
3. Offline
Training
Data
Data
Collection
Cleaning &
Visualization
Feature Eng. &
Model Design
Training &
Validation
Model Development
Trained
Models
Training Pipelines
Live
Data
Training
Validation
End User
Application
Query
Prediction
Prediction Service
Inference
Feedback
Logic
What is the
Machine Learning Lifecycle?
10. Manageable
q Simple to deploy wide
range of ML applications
and frameworks
q Debuggable
q Easy to monitor system
and model performance
???
Create VWCaffe 10
11. Fast and Scalable
q Low and predictable latencies for interactive
applications
q Scale to high throughputs
q Serve on specialized hardware
18. Serving Pre-materialized Predictions
All Possible
Queries
Batch Training
Framework
(Scoring)
X Y
Application
Decision
Query
Low-Latency Serving
Data
Management System
19. All Possible
Queries
Batch Training
Framework
(Scoring)
X Y
Application
Decision
Query
Low-Latency Serving
Data
Management System
Problems:
Ø Requires full set of queries ahead of time
Ø Small and bounded input domain
Ø Requires substantial computation and space
Ø Example: scoring all content for all customers!
Ø Costly update à rescore everything!
Serving Pre-materialized Predictions
Advantages:
Ø Simple to deploy with standard tools
Ø Low-latency at serving time
22. Model in a Container
{RESTAPI}
Application
Decision
Query
23. Model in a Container
{RESTAPI}
Application
Decision
Query
Problems:
Ø Requires data scientist to write performance-sensitive
serving code
Ø Inefficient use of compute resources à use throughput-
optimized frameworks to render single prediction
Ø No support for monitoring or debugging models
Advantages:
Ø General-purpose
Ø Renders predictions at serving time
24. Prediction-Serving Today
Current approaches do not meet the
requirements for prediction-serving systems
All
Possible
Queries
Batch
Training
Framework (Scoring)
X Y
Pre-materialize predictions
{RESTAPI}
Model in a Container
Can we design a system that
does meet those requirements?
30. MC
Web Server
Database
Cache
Run on Kubernetes alongside other applications
Clipper
Other
applications
Other
applications
Other
applications
Other
applications
31. Clipper
MC MC MC
RPC RPC RPC RPC
Applications
Model Container (MC)
Ø Core system: 10K lines of C++ and 8K lines of Python
Ø Open Source (Apache License) – http://clipper.ai
Ø Designed to support production level query traffic
Ø Deliver low + predictable latency
Predict RPC/REST Interface
32. How does Clipper satisfy the prediction-
serving requirements?
q Manageable
q Simple model deployment
q Custom model metrics
33. Clipper
Caffe
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Abstract away performance sensitive serving code from data scientists
Ø Models run in separate processes as Docker containers
Ø Resource isolation: Cutting edge ML frameworks can be buggy
36. Package model implementation and dependencies
Model Container (MC)
Container-based Model Deployment
class ModelContainer:
def __init__(model_data)
def predict_batch(inputs)
38. Clipper
Caffe
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Abstract away performance sensitive serving code from data scientists
Ø Models run in separate processes as Docker containers
Ø Resource isolation: Cutting edge ML frameworks can be buggy
39. Clipper provides a library of model deployers
Ø Deployer automatically and intelligently saves all
prediction code
Ø Captures both framework-specific models and
arbitrary serializable code
Ø Replicates training environment and loads
prediction code in a Clipper model container
43. Clipper provides a (growing) library of model deployers
ØPython
ØCombine framework specific models with external featurization,
post-processing, business logic
ØCurrently support Scikit-Learn, PySpark, TensorFlow, PyTorch,
MXNet, XGBoost
ØArbitrary R functions
44. Model Container (MC)
Custom Model Metrics
Ø Register and report metrics directly from model container
Ø Available with model deployers and custom containers
Ø Results are exported to Prometheus
45. Model Container (MC)
Custom Model Metrics
Ø Register and report metrics directly from model container
Ø Available with model deployers and custom containers
Ø Results are exported to Prometheus
Ø Enables users to track both physical and
statistical performance of a model
Ø Detect and diagnose model degradation
Ø Debug performance issues
46. How does Clipper satisfy the prediction-
serving requirements?
q Fast and Scalable
q System optimizations: Zero-copy
RPC and prediction caching
q Hierarchical scaleout on Kubernetes
47. System Optimizations
Clipper
Applications
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Caffe
Model Abstraction LayerProvide a common interface and
system optimizations
Predict RPC/REST Interface
48. Predict RPC/REST Interface
System Optimizations
Clipper
Applications
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Caffe
Zero-Copy
RPC
Caching
Common
API Batching
Model Abstraction LayerProvide a common interface and
system optimizations
Automatically maintain a per-model function cache to eliminate
redundant computation
Reduces latency and cost
50. Web Server
Hierarchical Scaleout: Replicate Clipper
Clipper
MC MC MC
Clipper
MC MC MC
Hierarchical scaleout enables fine-grained
replication to minimize cost while scaling to
massive workloads
51. How does Clipper satisfy the prediction-
serving requirements?
q Affordable
q Hierarchical scaleout
q Latency-aware batching
52. Problem: Compute resources are expensive
Ø Many machine learning models require expensive
compute resources to perform inference at interactive
latencies
Ø Multi-core servers
Ø GPUs
Ø Other hardware accelerators like TPUs and FPGAs
Ø Models must be initialized and warmed to respond to
queries in 10s of ms
Prediction-Serving systems must maximize
resource utilization to reduce cost
53. Predict RPC/REST Interface
System Optimizations
Clipper
Applications
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Caffe
Zero-Copy
RPC
Caching
Common
API Batching
Model Abstraction LayerProvide a common interface and
system optimizations
54. A single
page load
may generate
many queries
Batching to Improve Resource Utilization
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size
55. A single
page load
may generate
many queries
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size
Batching to Improve Resource UtilizationLatency-aware
Clipper Solution:
Adaptively tradeoff latency and throughput…
Ø Explore: Inc. batch size until the latency
objective is exceeded
Ø Exploit: Use latency measurements from
previous batches to estimate optimal
batch size for SLO
60. Middle layer for prediction serving.
Common abstractions to
make Clipper manageable
System optimizations to make Clipper
fast, scalable, and affordable
VW
Caffe
62. Just released version 0.3
Release Highlights
Ø Metrics and Monitoring infrastructure using
Prometheus
Ø Support for user-defined model metrics
Ø Support for hierarchical replication
Ø Several new model deployers
63. Community and Adoption
Ø Active collaborations with several organizations
Ø SAP
Ø ScotiaBank
Ø ARM
Ø IBM
Ø AI Singapore
Ø Status of Community
Ø Initial users now have Clipper deployments in production
Ø 29 contributors
64. Getting Started with Clipper
Docker images available on DockerHub
Clipper admin is distributed as pip package:
pip install clipper_admin
Get up and running without cloning or compiling!
66. Ø Cross-framework model composition: improved
accuracy through ensembles and contextual bandits
Ø Exploit multiple models to estimate confidence
Ø Use multi-armed bandit algorithms to learn optimal
model-selection online
Ø Online personalization across ML frameworks
Model Selection
*Explored in research prototype
[NSDI 2017]
68. Model Composition
Ensembles can
improve accuracy
Faster inference
with prediction
cascades
Fast
model
If confident
then return
Slow but
accurate
model
Faster development
through model-
reuse
Pre-trained
DNN
Task-
specific
model
Model
specialization
Object
detector
If object
detected
If face
detected
Else
Face
detector
How to efficiently support serving arbitrary
model pipelines? 68
69. Model Composition
Ø InferLine: Research project studying resource
allocation and scheduling for arbitrary model
pipelines
Ø Investigating languages and API for expressing
model compositions
Ø Extend work on model deployers and function shipping
70. Integration with Workflow Systems
Ø There are now several workflow systems for managing the
ML lifecycle
Ø Serving solutions for workflow systems tend to adopt the
“Model in a Container” approach to serving
Ø Use Clipper as a better drop-in replacement serving
system
Ø Have an initial prototype with AWS SageMaker
Ø Will look at KubeFlow as well going forward