Monitoring AI with AI

Monitoring AI with AI
Stepan Pushkarev
CTO of Hydrosphere.io

Mission: Accelerate Machine Learning to Production
Opensource Products:
- ML Lambda: ML Deployment and Serving
- Sonar: Data and ML Monitoring
- Mist: Serverless proxy for Spark
Business Model: PaaS and hands-on consulting
About

Traditional Software Machine Learning applications
Explicit business rules ML generated model
Unit testing Model Evaluation
(Micro)service Model as a Service
Docker per service Docker per Model
1 version of Microservice in prod 1-10-20 model versions in prod at a time
Eng + QA team owning a service 1 ML Engineer owning 10-20 models
Fail loudly (exception, stack trace) Fail silently
Can work forever if verified Performance declines over time
Needs continuous retraining / redeployment
App metrics monitoring Data Monitoring | Model Metrics Monitoring

Cost of an AI/ML Error
● Fun
© http://blog.ycombinator.com/how-adversarial-attacks-work/

● Fun
● Not fun
Cost of an AI Error

● Fun
● Not fun
● Not fun at all...
Cost of an AI Error

● Fun
● Not fun
● Not fun at all…
● Money
Cost of an AI Error

● Fun
● Not fun
● Not fun at all…
● Money
● Business
Cost of an AI Error

Where/why may AI fail in prod?

Everywhere!

● Bad training data
● Bad serving data
● Training/serving data skew
● Misconfiguration
● Deployment issue
● Retraining issue
● Performance
● Concept Drift
Everywhere!

Reliable Training-Serving pipelines
Comfort Zone for Data Scientist in the
middle of Production

Model Deployment and integration
model.pkl model.zip
How to integrate it into AI Application?

Model server = Model Artifact +
Metadata + Runtime + Deps + Sidecar
/predict
input:
string text;
bytes image;
output:
string summary;
JVM DL4j
GPU
matching_model v2
[
....
]
gRPC HTTP server
routing, shadowing
pipelining
tracing
metrics
autoscaling
A/B, canary
sidecar
serving
requests

Model Deployment takeaways
● Eliminates hand-off between Data Scientist -> ML Eng ->
Data Eng -> SA Eng -> QA -> Ops
● Sticks components together: Data + Model + Applications +
Automation = AI Application
● Enables quick transition from research to production. ML
engineers can deploy models many times a day
But wait… This is not safe!
How to ensure we’ll not break things in prod?

AI Reliability Pyramid
1) Is the model degraded?
2) What is the reason?

Data exploration in production
Research:
Data Scientist makes
assumptions based on results
of data exploration

Research:
Data Scientist explores
datasets and makes
assumptions/hypothesis
Production:
The model works if and only
if the format and statistical
properties of prod data are
the same as in research
Push to Prod

Research:
Data Scientist makes
assumptions based on results
of data exploration
Production:
The model works if and only
if format and statistical
properties of prod data are
the same as in research
Push to Prod
Continuous data exploration
and validation?

Automatic Data Profiling
● Avro/Protobuf schema can catch data format drifts
● Statistical properties of input features are to be
captured and continously validated
{"name": "User",
"fields": [
{"name": "name", "type": "string", "min_length": 2, "max_length": 128},
{"name": "age", "type": ["int", "null"], "range": "[10, 100]"},
{"name": "sex", "type": ["string", "null"], " enum": "[male, female, ...]"},
{"name": "wage", "type": ["int", "null"], "validator": "a-distance"}
]
}

Quality metrics generated from
data profile checks

How to deal with
- multidimensional dataset
- data timeliness
- data completeness
- image data
- complicated seasonality?

Anomaly detection
● Rule based programs -> statistical models -> machine
learning models
● Deal with multidimensional datasets, timeliness and
complicated seasonality

Model Monitoring Metrics on streaming data
● System metrics (latency/throughput)
● Kolmogorov-Smirnov
● Q-Q plot, t-digest
● Spearman and Pearson correlations
● Density based clustering algorithms with Elbow or
Silhouette methods
● Deep Autoencoders
● Generative Adversarial Networks
● Random Cut Forest (AWS paper)
● “Bring your own” metric

GANs for monitoring data quality at serving time
{production input}
{good}
{drift (fake)}

Model server = Metadata + Model Artifact +
Runtime + Deps + Sidecar + Training Metadata
/predict
input:
output:
JVM DL4j / TF / Other
GPU
CPU
model v2
[
....
]
gRPC HTTP server
sidecar
serving
requests
training data stats:
- min, max
- range
- clusters
- quantiles
- autoencoder
compare with prod
data in runtime

Change of the Paradigm
Shifts experimentation to
prod/shadowed environment

Use Case: Kolmogorov-Smirnov in action

Use Case: Monitoring NLU system
Figure from: Bapna, Ankur, et al. "Towards zero-shot frame semantic parsing for domain scaling."
arXiv preprint arXiv:1707.02363 (2017).

Source image: Kurata, Gakuto, et al. "Leveraging sentence-level information with encoder lstm for semantic slot filling." arXiv preprint
arXiv:1601.01530 (2016).
● Train and test offline on restaurants domain
● Deploy do prod
● Feed the model with new random Wiki data
● Monitor intermediate input representations (neural network hidden states)

● Red and Purple - cluster
of “Bad” production data
● Yellow and Blue - dev and
test data

Drift Handling
● Unexpected or dramatic drift? - Alert and add
ML/Data Engineer into the loop.
● Expected drift? - Retrain.
Open question to be solved with ML: classify expected
vs. unexpected drift.

Model Retraining - common questions
When to retrain?
When/how to push to prod?
What data to retraining with?
Manually on demand
Works well for 1 model
But does not scale

Model Retraining - common questions
When to retrain?
When/how to push to prod safely?
What data to retraining with?
Manually on demand
Works well for 1 model
But does not scale
Automatically with the
latest batch
Not safe
Can be expensive
The latest batch may
not be representative

Solution: Reactive AI powered retraining

Thank you
- Stepan Pushkarev
- @hydrospheredata
- https://github.com/Hydrospheredata
- https://hydrosphere.io/
- spushkarev@hydrosphere.io

Monitoring AI with AI

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Monitoring AI with AI

Similaire à Monitoring AI with AI (20)

Plus de Stepan Pushkarev

Plus de Stepan Pushkarev (7)

Dernier

Dernier (20)

Monitoring AI with AI