Monitoring AI applications with AI
The best performing offline algorithm can lose in production. The most accurate model does not always improve business metrics. Environment misconfiguration or upstream data pipeline inconsistency can silently kill the model performance. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug such types of incidents.
Was it possible for Microsoft to test Tay chatbot in advance and then monitor and adjust it continuously in production to prevent its unexpected behaviour? Real mission critical AI systems require advanced monitoring and testing ecosystem which enables continuous and reliable delivery of machine learning models and data pipelines into production. Common production incidents include:
Data drifts, new data, wrong features
Vulnerability issues, malicious users
Concept drifts
Model Degradation
Biased Training set / training issue
Performance issue
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines.
It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Technical part of the talk will cover the following topics:
Automatic Data Profiling
Anomaly Detection
Clustering of inputs and outputs of the model
A/B Testing
Service Mesh, Envoy Proxy, trafic shadowing
Stateless and stateful models
Monitoring of regression, classification and prediction models
2. Mission: Accelerate Machine Learning to Production
Opensource Products:
- ML Lambda: ML Deployment and Serving
- Sonar: Data and ML Monitoring
- Mist: Serverless proxy for Spark
Business Model: PaaS and hands-on consulting
About
3. Traditional Software Machine Learning applications
Explicit business rules ML generated model
Unit testing Model Evaluation
(Micro)service Model as a Service
Docker per service Docker per Model
1 version of Microservice in prod 1-10-20 model versions in prod at a time
Eng + QA team owning a service 1 ML Engineer owning 10-20 models
Fail loudly (exception, stack trace) Fail silently
Can work forever if verified Performance declines over time
Needs continuous retraining / redeployment
App metrics monitoring Data Monitoring | Model Metrics Monitoring
11. Where/why may AI fail in prod?
● Bad training data
● Bad serving data
● Training/serving data skew
● Misconfiguration
● Deployment issue
● Retraining issue
● Performance
● Concept Drift
Everywhere!
17. Model Deployment takeaways
● Eliminates hand-off between Data Scientist -> ML Eng ->
Data Eng -> SA Eng -> QA -> Ops
● Sticks components together: Data + Model + Applications +
Automation = AI Application
● Enables quick transition from research to production. ML
engineers can deploy models many times a day
But wait… This is not safe!
How to ensure we’ll not break things in prod?
22. Data exploration in production
Research:
Data Scientist makes
assumptions based on results
of data exploration
23. Data exploration in production
Research:
Data Scientist explores
datasets and makes
assumptions/hypothesis
Production:
The model works if and only
if the format and statistical
properties of prod data are
the same as in research
Push to Prod
24. Data exploration in production
Research:
Data Scientist makes
assumptions based on results
of data exploration
Production:
The model works if and only
if format and statistical
properties of prod data are
the same as in research
Push to Prod
Continuous data exploration
and validation?
25. Automatic Data Profiling
● Avro/Protobuf schema can catch data format drifts
● Statistical properties of input features are to be
captured and continously validated
{"name": "User",
"fields": [
{"name": "name", "type": "string", "min_length": 2, "max_length": 128},
{"name": "age", "type": ["int", "null"], "range": "[10, 100]"},
{"name": "sex", "type": ["string", "null"], " enum": "[male, female, ...]"},
{"name": "wage", "type": ["int", "null"], "validator": "a-distance"}
]
}
27. How to deal with
- multidimensional dataset
- data timeliness
- data completeness
- image data
- complicated seasonality?
28.
29. Anomaly detection
● Rule based programs -> statistical models -> machine
learning models
● Deal with multidimensional datasets, timeliness and
complicated seasonality
30. Model Monitoring Metrics on streaming data
● System metrics (latency/throughput)
● Kolmogorov-Smirnov
● Q-Q plot, t-digest
● Spearman and Pearson correlations
● Density based clustering algorithms with Elbow or
Silhouette methods
● Deep Autoencoders
● Generative Adversarial Networks
● Random Cut Forest (AWS paper)
● “Bring your own” metric
31. GANs for monitoring data quality at serving time
{production input}
{good}
{drift (fake)}
32. Model server = Metadata + Model Artifact +
Runtime + Deps + Sidecar + Training Metadata
/predict
input:
output:
JVM DL4j / TF / Other
GPU
CPU
model v2
[
....
]
gRPC HTTP server
sidecar
serving
requests
training data stats:
- min, max
- range
- clusters
- quantiles
- autoencoder
compare with prod
data in runtime
33. Change of the Paradigm
Shifts experimentation to
prod/shadowed environment
35. Use Case: Monitoring NLU system
Figure from: Bapna, Ankur, et al. "Towards zero-shot frame semantic parsing for domain scaling."
arXiv preprint arXiv:1707.02363 (2017).
36. Use Case: Monitoring NLU system
Source image: Kurata, Gakuto, et al. "Leveraging sentence-level information with encoder lstm for semantic slot filling." arXiv preprint
arXiv:1601.01530 (2016).
● Train and test offline on restaurants domain
● Deploy do prod
● Feed the model with new random Wiki data
● Monitor intermediate input representations (neural network hidden states)
37. Use Case: Monitoring NLU system
● Red and Purple - cluster
of “Bad” production data
● Yellow and Blue - dev and
test data
39. Drift Handling
● Unexpected or dramatic drift? - Alert and add
ML/Data Engineer into the loop.
● Expected drift? - Retrain.
Open question to be solved with ML: classify expected
vs. unexpected drift.
40. Model Retraining - common questions
When to retrain?
When/how to push to prod?
What data to retraining with?
Manually on demand
Works well for 1 model
But does not scale
41. Model Retraining - common questions
When to retrain?
When/how to push to prod safely?
What data to retraining with?
Manually on demand
Works well for 1 model
But does not scale
Automatically with the
latest batch
Not safe
Can be expensive
The latest batch may
not be representative