Deploying and Monitoring Heterogeneous Machine Learning Applications with Clipper with Dan Crankshaw

Dan Crankshaw
crankshaw@cs.berkeley.edu
http://clipper.ai
https://github.com/ucbrise/clipper
June 6, 2018
A Low-Latency Online Prediction
Serving System
Clipper

Clipper Team
Ø Dan Crankshaw
Ø Corey Zumar
Ø Simon Mo
Ø Alexey Tumanov
Ø Eyal Sela
Ø Rehan Durrani
Ø Eric Sheng
Ø Joseph Gonzalez
Ø Ion Stoica
Ø Many other
contributors

Offline
Training
Data
Data
Collection
Cleaning &
Visualization
Feature Eng. &
Model Design
Training &
Validation
Model Development
Trained
Models
Training Pipelines
Live
Data
Training
Validation
End User
Application
Query
Prediction
Prediction Service
Inference
Feedback
Logic
What is the
Machine Learning Lifecycle?

Trained
Models
Training Pipelines
Live
Data
Training
Validation
End User
Application
Query
Prediction
Prediction Service
Inference
Feedback
Logic
Machine Learning in Production

End User
Application
Query
Prediction
Prediction Service
Inference
Feedback
Logic
Goal: make predictions in
~10s of ms under heavy load
è Need new systems

q Requirements for prediction serving systems
q Clipper overview and architecture
q Current Project Status
q Future directions
In this talk…

Requirements for
Prediction-Serving
Systems

Prediction-Serving Requirements
q Manageable
q Fast and Scalable
q Affordable
???
Create
VWCaffe 9

Manageable
q Simple to deploy wide
range of ML applications
and frameworks
q Debuggable
q Easy to monitor system
and model performance
???
Create VWCaffe 10

Fast and Scalable
q Low and predictable latencies for interactive
applications
q Scale to high throughputs
q Serve on specialized hardware

Affordable
q Minimize storage cost
q Minimize compute cost
when using expensive
compute resources

Prediction-Serving Today
Two Approaches:
q Pre-materialize predictions
q Put model in a container

Trained
Models
Training Pipelines
Live
Data
Training
Validation
End User
Application
Query
Prediction
Prediction Service
Inference
Feedback
Logic
Data
Engineer
Data
Engineer
Pre-materialized Predictions

Trained
Models
Training Pipelines
Live
Data
Training
Validation
All Possible
Queries
Batch Training
Framework

Trained
Models
ipelines
raining
Validation
All Possible
Queries
Batch Training
Framework
Data
Management
System
(Scoring)
X Y

Trained
Models
Training Pipelines
Live
Data
Training
Validation
All Possible
Queries
Batch Training
Framework
Data
Management System
(Scoring)
X Y
Standard Data Eng. Tools

Serving Pre-materialized Predictions
All Possible
Queries
Batch Training
Framework
(Scoring)
X Y
Application
Decision
Query
Low-Latency Serving
Data
Management System

All Possible
Queries
Batch Training
Framework
(Scoring)
X Y
Application
Decision
Query
Low-Latency Serving
Data
Management System
Problems:
Ø Requires full set of queries ahead of time
Ø Small and bounded input domain
Ø Requires substantial computation and space
Ø Example: scoring all content for all customers!
Ø Costly update à rescore everything!
Serving Pre-materialized Predictions
Advantages:
Ø Simple to deploy with standard tools
Ø Low-latency at serving time

Trained
Models
Training Pipelines
Live
Data
Training
Validation
End User
Application
Query
Prediction
Prediction Service
Inference
Feedback
Logic
Data
Engineer
Data
Engineer
Model in a Container

Trained
Models
Training Pipelines
Live
Data
Training
Validation
{RESTAPI}

{RESTAPI}
Application
Decision
Query

{RESTAPI}
Application
Decision
Query
Problems:
Ø Requires data scientist to write performance-sensitive
serving code
Ø Inefficient use of compute resources à use throughput-
optimized frameworks to render single prediction
Ø No support for monitoring or debugging models
Advantages:
Ø General-purpose
Ø Renders predictions at serving time

Prediction-Serving Today
Current approaches do not meet the
requirements for prediction-serving systems
All
Possible
Queries
Batch
Training
Framework (Scoring)
X Y
Pre-materialize predictions
{RESTAPI}
Can we design a system that
does meet those requirements?

Clipper Overview and
Architecture

???
VW
Caffe 26
Wide range of
application and frameworks

Middle layer for prediction serving.
Common
Abstraction
System
Optimizations
VW
Caffe

Clipper
MC MC MC
RPC RPC RPC RPC
Clipper Decouples Applications and Models
Applications
Model Container (MC)
Caffe
Predict RPC/REST Interface

Clipper
Query Processor
Redis
ConfigDB
Applications
Model
Container
Model
Container
Clipper
Management
Predict
Clipper Implementation clipper
admin
Prometheus
Monitoring

MC
Web Server
Database
Cache
Run on Kubernetes alongside other applications
Clipper
Other
applications
Other
applications
Other
applications
Other
applications

Clipper
MC MC MC
RPC RPC RPC RPC
Applications
Ø Core system: 10K lines of C++ and 8K lines of Python
Ø Open Source (Apache License) – http://clipper.ai
Ø Designed to support production level query traffic
Ø Deliver low + predictable latency

How does Clipper satisfy the prediction-
serving requirements?
q Manageable
q Simple model deployment
q Custom model metrics

Clipper
Caffe
MC MC MC
RPC RPC RPC RPC
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Abstract away performance sensitive serving code from data scientists
Ø Models run in separate processes as Docker containers
Ø Resource isolation: Cutting edge ML frameworks can be buggy

Container-based Model Deployment
class ModelContainer:
def __init__(model_data)
def predict_batch(inputs)
Implement Model API:

Implement Model API:
Ø API support for many programming languages
Ø Python
Ø Java
Ø C/C++
Ø R

Package model implementation and dependencies

Clipper
Caffe
MC MC
RPC RPC RPC RPC
MC

Clipper provides a library of model deployers
Ø Deployer automatically and intelligently saves all
prediction code
Ø Captures both framework-specific models and
arbitrary serializable code
Ø Replicates training environment and loads
prediction code in a Clipper model container

Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Web Server
Database
Cache
Clipper
Model Deployers: Spark Example

Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Web Server
Database
CacheClipper

MC
Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Web Server
Database
CacheClipper

Clipper provides a (growing) library of model deployers
ØPython
ØCombine framework specific models with external featurization,
post-processing, business logic
ØCurrently support Scikit-Learn, PySpark, TensorFlow, PyTorch,
MXNet, XGBoost
ØArbitrary R functions

Custom Model Metrics
Ø Register and report metrics directly from model container
Ø Available with model deployers and custom containers
Ø Results are exported to Prometheus

Custom Model Metrics
Ø Register and report metrics directly from model container
Ø Available with model deployers and custom containers
Ø Results are exported to Prometheus
Ø Enables users to track both physical and
statistical performance of a model
Ø Detect and diagnose model degradation
Ø Debug performance issues

q Fast and Scalable
q System optimizations: Zero-copy
RPC and prediction caching
q Hierarchical scaleout on Kubernetes

System Optimizations
Clipper
Applications
MC MC MC
RPC RPC RPC RPC
Caffe
Model Abstraction LayerProvide a common interface and
system optimizations

Clipper
Applications
MC MC MC
RPC RPC RPC RPC
Caffe
Zero-Copy
RPC
Caching
Common
API Batching
Automatically maintain a per-model function cache to eliminate
redundant computation
Reduces latency and cost

Hierarchical Scaleout: Replicate models
Clipper
MC MC MC
Web Server

Web Server
Hierarchical Scaleout: Replicate Clipper
Clipper
MC MC MC
Clipper
MC MC MC
Hierarchical scaleout enables fine-grained
replication to minimize cost while scaling to
massive workloads

q Affordable
q Hierarchical scaleout
q Latency-aware batching

Problem: Compute resources are expensive
Ø Many machine learning models require expensive
compute resources to perform inference at interactive
latencies
Ø Multi-core servers
Ø GPUs
Ø Other hardware accelerators like TPUs and FPGAs
Ø Models must be initialized and warmed to respond to
queries in 10s of ms
Prediction-Serving systems must maximize
resource utilization to reduce cost

Clipper
Applications
MC MC MC
RPC RPC RPC RPC
Caffe
Zero-Copy
RPC
Caching
Common
API Batching

A single
page load
may generate
many queries
Batching to Improve Resource Utilization
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size

A single
page load
may generate
many queries
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size
Batching to Improve Resource UtilizationLatency-aware
Clipper Solution:
Adaptively tradeoff latency and throughput…
Ø Explore: Inc. batch size until the latency
objective is exceeded
Ø Exploit: Use latency measurements from
previous batches to estimate optimal
batch size for SLO

Throughput
(Queries Per Second)
Tensor Flow Conv. Net (GPU)
Batch Size
Better
56

Batch Size
Latency (ms)
Throughput
Better
Better
57

Batch Size
Optimal Batch Size
Latency (ms)
Throughput
Better
Better
58

Throughput
(QPS)
1R-2S
5andRP
)RreVt
(6KOearn)
LLnear 690
(3y6SarN)
LLnear 690
(6KLearn)
KerneO 690
(6KLearn)
LRg 5egreVVLRn
(6KLearn)
0
20000
40000
60000 48386
22859
29350 48934
197
47219
8963
317 7206
1920
203
1921
AdaStLve 1R BatFKLng
Better
Batching to Improve ThroughputLatency-aware

Middle layer for prediction serving.
Common abstractions to
make Clipper manageable
System optimizations to make Clipper
fast, scalable, and affordable
VW
Caffe

Just released version 0.3
Release Highlights
Ø Metrics and Monitoring infrastructure using
Prometheus
Ø Support for user-defined model metrics
Ø Support for hierarchical replication
Ø Several new model deployers

Community and Adoption
Ø Active collaborations with several organizations
Ø SAP
Ø ScotiaBank
Ø ARM
Ø IBM
Ø AI Singapore
Ø Status of Community
Ø Initial users now have Clipper deployments in production
Ø 29 contributors

Getting Started with Clipper
Docker images available on DockerHub
Clipper admin is distributed as pip package:
pip install clipper_admin
Get up and running without cloning or compiling!

Future Directions
Ø Model Selection
Ø Model Composition
Ø Integration with Workflow Systems

Ø Cross-framework model composition: improved
accuracy through ensembles and contextual bandits
Ø Exploit multiple models to estimate confidence
Ø Use multi-armed bandit algorithms to learn optimal
model-selection online
Ø Online personalization across ML frameworks
Model Selection
*Explored in research prototype
[NSDI 2017]

“CAT”
“MAN”
“CAT”
“SPACE”
“CAT”
UNSURE
Selection Policy: Estimate confidence
Policy

Model Composition
Ensembles can
improve accuracy
Faster inference
with prediction
cascades
Fast
model
If confident
then return
Slow but
accurate
model
Faster development
through model-
reuse
Pre-trained
DNN
Task-
specific
model
Model
specialization
Object
detector
If object
detected
If face
detected
Else
Face
detector
How to efficiently support serving arbitrary
model pipelines? 68

Model Composition
Ø InferLine: Research project studying resource
allocation and scheduling for arbitrary model
pipelines
Ø Investigating languages and API for expressing
model compositions
Ø Extend work on model deployers and function shipping

Integration with Workflow Systems
Ø There are now several workflow systems for managing the
ML lifecycle
Ø Serving solutions for workflow systems tend to adopt the
“Model in a Container” approach to serving
Ø Use Clipper as a better drop-in replacement serving
system
Ø Have an initial prototype with AWS SageMaker
Ø Will look at KubeFlow as well going forward

Conclusion
http://clipper.ai
https://github.com/ucbrise/clipper
Ø Clipper makes prediction-serving
Ø Manageable
Ø Fast and Scalable
Ø Affordable
Ø Check out the 0.3 release: pip install clipper_admin

TensorFlow-
Serving
Predict RPC Interface
Applications
Overhead of decoupled architecture
Clipper
Predict FeedbackRPC/REST Interface
Caffe
MC MC
RPC RPC RPC RPC
Applications
MC MC

Overhead of decoupled architecture
Throughput
(QPS)
Better
Clipper*
TensorFlow
Serving
P99 Latency
(ms)
Better
Clipper*
TensorFlow
Serving
Model: AlexNet trained on CIFAR-10
*[NSDI 17] Prototype version of Clipper

Deploying and Monitoring Heterogeneous Machine Learning Applications with Clipper with Dan Crankshaw

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Deploying and Monitoring Heterogeneous Machine Learning Applications with Clipper with Dan Crankshaw

Similaire à Deploying and Monitoring Heterogeneous Machine Learning Applications with Clipper with Dan Crankshaw (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Deploying and Monitoring Heterogeneous Machine Learning Applications with Clipper with Dan Crankshaw