Machine Learning (ML) models are often composed as pipelines of operators, from “classical” ML operators to pre-processing and featurization operators. Current systems deploy pipelines as "black boxes”, where the same implementation of training is run for inference. This solution is convenient but leaves large room to improve performance and resource usage. This talk presents Pretzel, a framework for deployment of ML pipelines that is inspired to Database Systems: Pretzel inspects and optimizes pipelines end-to-end much like queries, and manages resources common to multiple pipelines such as operators' state. Pretzel is joint work with University of Seoul and Microsoft Research and has recently been presented at OSDI ’18. After the overview, this talk also shows experimental results of Pretzel against state-of-art ML solutions and discusses limitations and extensions.
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
1. Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
Politecnico di Milano, Milano, 19/10/2018
Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco D.
Santambrogio, Markus Weimer, Matteo Interlandi
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems
2. ML-as-a-Service
• ML models are learnt from data
during training
2
• Key requirements:
1. Performance: latency/throughput
2. Minimal resource usage: minimal service cost
• Are deployed on cloud platforms for
Prediction Serving
• State-of-art deployment strategy: Black Box
3. 3
Inside the black box: interesting facts
• Applications host multiple models per machine
(10-100s)
• Deployed models are often similar in structure and
state
- Customer personalization, Templates, Transfer
Learning
• Inside, models are DAGs
of different operators
• But with black boxes, you can apply only
external optimizations: caching, batching, …
4. We need to know structure and state:
PRETZEL white-box model
4
Breaking the black-box model
1. To generate an optimised version of a model on
deployment: higher performance
2. To allocate shared state only once, and share
resources among models: higher density
5. ‣ Limitations of Black Box approaches
‣ PRETZEL, White Box Prediction Serving System
‣ Evaluation
‣ Conclusions and Future Work
5
Outline
7. 7
Case study
• 250 Sentiment Analysis models in ML.Net, C#, run 100 times
• First warm-up execution is cold, 99 following executions are hot
• Long-Tail latency, especially with cold:
cannot ensure Service-Level-Objectives
(SLOs)
• Overheads: JIT, memory allocation
• Profiling shows no clear bottleneck, with ML op LogReg being 0.3% of
runtime for simple models
8. • Black Box models cannot share resources
• Each model has its own container/process/
thread: overhead, poor scalability
8
Resource waste
• Each model has its own state
• But many operators have similar/equal state, and models are
deployed together
9. • Optimisations for single operators, like DNNs [1-2]
• TensorFlow Serving [3] as Servable Python objects, ML.Net as zip files
with state files and DLLs
• Clipper [4] and Rafiki [5] deploy pipelines as Docker containers
– They schedule requests based on latency target
– Can apply caching and batching
• MauveDB [6] accepts regression and interpolation model and optimises
them as DB views
• Tensor Comprehension [7] optimizes DNN models only via tensors
9
Related work
[1] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
[2] In-Datacenter Performance Analysis of a Tensor Processing Unit, arXiv, Apr 2017
[3] C.Olston, et al., Tensorflow - serving: Flexible, high-performance ml serving. In Work-shop on ML Systems at NIPS, 2017
[4] D. Crankshaw, et al.. Clipper: A low-latency online prediction serving system. In NSDI, 2017
[5] W.Wang, et al., Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018
[6] A. Deshpande and S. Madden, Mauvedb: Supporting model-based user views in database systems. In SIGMOD, 2006
[7] N. Vasilache, et al., Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
11. White Box Prediction Serving: make pipelines co-exist
better, schedule better
1. End-to-end Optimizations: inspect models and optimise
internal execution
2. Multi-model Optimizations: share data, code and
resources
11
Design principles
12. End-to-end
1. Ahead-of-Time compilation at deployment, to
minimise JIT
2. Vector pooling, pre-allocate data structures
12
Models optimizations
Multi-model
1. Use Object Store to share operators parameters/
weights
2. Sub-Plan materialisation to -use intermediate results
across models
13. 13
Off-line phase - Flour+Oven
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
Char and WordNgram, ther
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the element
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
Oven
Optimiser
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstr
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
14. • Oven optimizes models much like DB queries
• Uses a rule-based optimiser
– repeatedly looks for patterns of operators within model DAG
– merge operators into stage
14
Oven optimizations
Initial
model DAG
Push linear
predictor back
and remove
Concat
Apply rules
and group
into stages
Add
statistics and
create Model
Plan
15. • Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources: threads and buffers
• Object Store caches objects of all models
– models register and retrieve state objects via a key,
like the file MD5
• Scheduler is event-based, each stage being an event
15
On-line phase
17. • Two model classes written in ML.NET, running in ML.NET and
Pretzel
– 250 Sentiment Analysis (SA) models
– 250 Attendee Count (AC) models
• Testbed representing a small production server
– 2 8-core Xeon E5-2620 v4 at 2.10 GHz, HT disabled
– 32 GB RAM
– Windows 10
– .Net Core 2.0
17
Workload and testbed
18. • Experiments with all 250 AC models, smaller than SA
• With SA, only Pretzel can load all models
18
Memory
Setting
Shared
Objects
Shared
Runtime
ML.Net + Clipper
ML.Net ✓
PRETZEL without
ObjectStore ✓
PRETZEL ✓ ✓
19. • Micro-benchmark with stand-alone system, no communication
• All 250 SA models
19
Latency
ML.Net PRETZEL
P99 (hot) 0.6 0.2
P99 (cold) 8.1 0.8
Worst (cold) 280.2 6.2
20. • 250 AC models, run 1000 times each
• 1000 queries in a batch
• ML.Net vs PRETZEL
20
Throughput
22. • We addressed performance/density bottlenecks in ML inference for
Model-as-a-Service
• We advocate the adoption of a white-box approach
• We apply DB query optimizations techniques to ML Prediction Serving
• We were accepted at OSDI ’18
• Limitations:
- PRETZEL currently supports a subset of ML.Net operators
- No NN operators
- No automated code generation: stages implementation still
involves some manual process
22
Conclusions and Limitations
23. • NUMA-aware Scheduler and Runtime
• Fully automated code-generation of stages:
- hardware-specific templates [8]
- Halide-based generator for CPU and GPU: no JIT anymore
• Support user-coded operators for filtering and pre-processing
23
Future Work 1
[8] K.Krikellas, S.Viglas,et al. Generating code for holistic query evaluation, in ICDE, pages 613–624. IEEE Computer Society, 2010
24. • Supporting ML.Net operators, including ONNX [9], is complex
• Not just manpower: Oven rules need to scale fairly with number of
operators
- Cannot write rules for all possible (sequences of) operators
• We need a formal framework to describe operators
- Something like Relational Algebra for query optimiser
- Maybe Tensor Algebra, like Tensor Comprehension [10]?
24
Future Work 2
QUESTIONS ?
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
[9] Open Neural Network Exchange (ONNX). https://onnx.ai, 2017
[10] Announcing Tensor Comprehensions. https://research.fb.com/announcing-tensor-comprehensions/
Y. Lee, A. Scolari, B.-G. Chun, M. D. Santambrogio, M. Weimer, M. Interlandi
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
https://arxiv.org/abs/1810.06115