Why APM Is Not the Same As ML Monitoring

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft à Databricks
ML Monitoring is not APM
Cory A. Johannsen
Product Engineer, Verta Inc.
www.verta.ai
Agenda
▴ What is APM?
▴ What is ML monitoring?
▴ How ML monitoring and APM differ
▴ The unique needs of ML monitoring
▴ A very cool solution to model monitoring from Verta
About
https://www.verta.ai/product
- End-to-end MLOps platform for ML
model delivery, operations and
management
- Kubernetes-based, operations stack
for ML
- 23 years as a software engineer
- Embedded systems, enterprise
software, SaaS
- 6 years in APM working at scale
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
What is APM?
What is APM?
▴ Application performance Monitoring
▴ Metrics
○ Name
○ Value
○ Labels
○ Timestamp
▴ Visualization
▴ Alerting
What do I care about monitoring in APM?
▴ Health
▴ Availability
▴ Performance
▴ Stability
▴ Notification
APM in practice
▴ Production operations
▴ Diagnostics and debugging
▴ Critical incident response
What is Model Monitoring?
▴ Know when models are failing
▴ Quickly find the root cause
▴ Close the loop by fast recovery
10
Ensuring model results are
consistently of high quality
*We refer to all latency, throughput etc. as model service health
▴ w/o ground truth, model
fails challenging to detect
▴ Need to monitor complex
statistical summaries
▴ Distributions, anomalies,
missing values, quantiles
etc.
▴ Often model-specific
▴ Intelligent detection
and alerting to
pre-emptively identify
issues and trigger
remediations
▴ Execute re-trains,
fallback models, and
human intervention.
11
Know when a model fails Close the loop
▴ A model is one part of a
inference pipeline
▴ Need global view of the
pipeline jungle to see
where the root issue
may be
Quickly find the root cause
How APM and ML monitoring align
▴ Error rate, Throughput, Latency
○ You need to know my production systems are
operational
▴ Visualization
○ You need to see change over time
▴ Alerting
○ You need to know when
something has gone wrong
(and only when something
has gone wrong)
What do you care about in ML Monitoring?
▴ Distribution
○ Training versus test
○ Iteration over iteration
○ Live prediction
▴ Drift
○ Change in Distribution over
time
How APM and ML monitoring differ
▴ Error Rate, Throughput, Latency
○ Necessary, no longer sufficient
▴ Not all work is production work
○ ML monitoring happens from the beginning
of the pipeline
▴ APM can tell you what is wrong
○ ML monitoring is about understanding why
What makes ML monitoring unique
▴ Quantitative analysis of model performance
○ Information you can use
▴ Controlled comparison of distributions
○ Repeatable
○ Reliable
○ Consistent
▴ Alerting on meaningful deviation
○ Actionable
○ Timely
○ Accurate
Only you know the shape of your data
▴ Every model and pipeline is different and specialized
○ You built them, you understand them
▴ You know what metrics and distributions are valuable
○ This is your model, you know the data and processes that created it
▴ You know the expected distributions
○ You can determine whether the behavior is correct
Only you know how to measure change
▴ Compare to reference set
○ Training, test, golden data set
▴ Compare to a baseline
○ Calculate a baseline from your data or production systems
▴ Compare to other
○ Use a comparison that makes sense in your domain
Only you know when a change matters
▴ You know your model and tolerances
▴ You know when a deviation is significant (or not!)
▴ You know when these conditions need to change
Verta understand model monitoring
▴ Designed for your workflows
▴ Easy integration to capture your monitoring data
▴ Visualize and understand your metrics, distributions, and drift
▴ Get alerted when you should - not otherwise
Introducing a generalized
framework for Model Monitoring
Concepts
▴ Monitored Entity: A reference name (e.g. model or pipeline) that you want to
monitor
▴ Profiler: A function that computes statistics about your data
▴ Summary: A collection of statistics about your data (output of profiler)
○ Samples: instance of a summary, i.e., a statistic
○ Labels: key-values attached to summary samples. Used for rich filtering and
aggregation
▴ Alerter: Triggered periodically, it can talk with the Verta API to fetch information
about summaries and identify if they look wrong
How does it work?
1. Define monitored entity: the entity to be monitored (e.g., model, data, pipeline)
2. Define summaries to monitor for the entity
3. Run profilers (manually or automatically) to produce summary samples
4. View samples, define alerts
5. Get alerted (e.g. via Slack)
6. Close the loop!
How does it work?
Time-series DB for
statistical summaries
...
Ground truth
Data/Model
Pipelines
Model (Live)
Remediation
- Retrain
- Rollback
- Human loop
Model (Batch)
Prediction
Log
Summary
▴ Performance monitoring is no longer sufficient for the needs of modern ML systems
○ Model monitoring starts at the beginning of the pipeline and continues through production
○ Batch and live can be addressed in the same framework
▴ Knowing something is wrong is not enough, you need to know why
▴ Timely actionable alerting is mandatory
▴ Building these tools on-site is difficult, error-prone, and expensive
▴ Spark is a fantastic tool to enable model monitoring
Monitor Your Models with Verta
▴ Visit monitoring.verta.ai today and see it in action
▴ Join our community
▴ Get more out of your models
▴ Get more out of your alerts
Thank you.
Cory A. Johannsen
Product Engineer, Verta Inc.
www.verta.ai
1 sur 26

Recommandé

Jeeves Grows Up: An AI Chatbot for Performance and Quality par
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
260 vues28 diapositives
Importance of ML Reproducibility & Applications with MLfLow par
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
288 vues29 diapositives
Re-imagine Data Monitoring with whylogs and Spark par
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
550 vues24 diapositives
The Critical Missing Component in the Production ML Stack par
The Critical Missing Component in the Production ML StackThe Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML StackDatabricks
66 vues21 diapositives
FlorenceAI: Reinventing Data Science at Humana par
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaDatabricks
462 vues28 diapositives
Model Monitoring at Scale with Apache Spark and Verta par
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaDatabricks
360 vues26 diapositives

Contenu connexe

Tendances

AI Modernization at AT&T and the Application to Fraud with Databricks par
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksDatabricks
583 vues15 diapositives
Infrastructure Agnostic Machine Learning Workload Deployment par
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
347 vues38 diapositives
NLP Text Recommendation System Journey to Automated Training par
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingDatabricks
175 vues20 diapositives
Tensors Are All You Need: Faster Inference with Hummingbird par
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
265 vues49 diapositives
Detecting Anomalous Behavior with Surveillance​ Analytics​ par
Detecting Anomalous Behavior with Surveillance​ Analytics​Detecting Anomalous Behavior with Surveillance​ Analytics​
Detecting Anomalous Behavior with Surveillance​ Analytics​Databricks
228 vues16 diapositives
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks... par
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
452 vues24 diapositives

Tendances(20)

AI Modernization at AT&T and the Application to Fraud with Databricks par Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
Databricks583 vues
Infrastructure Agnostic Machine Learning Workload Deployment par Databricks
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks347 vues
NLP Text Recommendation System Journey to Automated Training par Databricks
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated Training
Databricks175 vues
Tensors Are All You Need: Faster Inference with Hummingbird par Databricks
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks265 vues
Detecting Anomalous Behavior with Surveillance​ Analytics​ par Databricks
Detecting Anomalous Behavior with Surveillance​ Analytics​Detecting Anomalous Behavior with Surveillance​ Analytics​
Detecting Anomalous Behavior with Surveillance​ Analytics​
Databricks228 vues
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks... par Rodney Joyce
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce452 vues
Feature drift monitoring as a service for machine learning models at scale par Noriaki Tatsumi
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
Noriaki Tatsumi178 vues
ML-Ops: From Proof-of-Concept to Production Application par Hunter Carlisle
ML-Ops: From Proof-of-Concept to Production ApplicationML-Ops: From Proof-of-Concept to Production Application
ML-Ops: From Proof-of-Concept to Production Application
Hunter Carlisle95 vues
Advanced Model Comparison and Automated Deployment Using ML par Databricks
Advanced Model Comparison and Automated Deployment Using MLAdvanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using ML
Databricks226 vues
Ml infra at an early stage par Nick Handel
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
Nick Handel209 vues
Machine Learning In Production par Samir Bessalah
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
Samir Bessalah5.8K vues
Unified MLOps: Feature Stores & Model Deployment par Databricks
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks432 vues
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ... par Databricks
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Databricks614 vues
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil par Databricks
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
Databricks256 vues
Machine learning model to production par Georg Heiler
Machine learning model to productionMachine learning model to production
Machine learning model to production
Georg Heiler4.4K vues
Big Data at Speed par markgrover
Big Data at SpeedBig Data at Speed
Big Data at Speed
markgrover244 vues
Oct 2011 CHADNUG Presentation on Hadoop par Josh Patterson
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson1.4K vues
Strata parallel m-ml-ops_sept_2017 par Nisha Talagala
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
Nisha Talagala491 vues
Production ready big ml workflows from zero to hero daniel marcous @ waze par Ido Shilon
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon1.8K vues

Similaire à Why APM Is Not the Same As ML Monitoring

Monitoring Distributed Systems par
Monitoring Distributed SystemsMonitoring Distributed Systems
Monitoring Distributed SystemsAleksandr Tavgen
94 vues38 diapositives
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf... par
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...Agile Testing Alliance
231 vues12 diapositives
Pipeline analytics concept for posting par
Pipeline analytics concept for postingPipeline analytics concept for posting
Pipeline analytics concept for postingMark Peco
638 vues28 diapositives
Pipeline analytics concept for posting on linked in par
Pipeline analytics concept for posting on linked inPipeline analytics concept for posting on linked in
Pipeline analytics concept for posting on linked inMark Peco
501 vues28 diapositives
Managing the Machine Learning Lifecycle with MLflow par
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowDatabricks
941 vues46 diapositives
Vgo Sim And Opt par
Vgo Sim And OptVgo Sim And Opt
Vgo Sim And Optlksisemore
280 vues8 diapositives

Similaire à Why APM Is Not the Same As ML Monitoring(20)

#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf... par Agile Testing Alliance
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
Pipeline analytics concept for posting par Mark Peco
Pipeline analytics concept for postingPipeline analytics concept for posting
Pipeline analytics concept for posting
Mark Peco638 vues
Pipeline analytics concept for posting on linked in par Mark Peco
Pipeline analytics concept for posting on linked inPipeline analytics concept for posting on linked in
Pipeline analytics concept for posting on linked in
Mark Peco501 vues
Managing the Machine Learning Lifecycle with MLflow par Databricks
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
Databricks941 vues
Delivering BAM & BPM With Run-Time Integration par Nathaniel Palmer
Delivering BAM & BPM With Run-Time IntegrationDelivering BAM & BPM With Run-Time Integration
Delivering BAM & BPM With Run-Time Integration
Nathaniel Palmer507 vues
Data drift and machine learning par Smita Agrawal
Data drift and machine learningData drift and machine learning
Data drift and machine learning
Smita Agrawal125 vues
SAS Training session - By Pratima par Pratima Pandey
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima
Pratima Pandey2.1K vues
The Automation Firehose: Be Strategic and Tactical by Thomas Haver par QA or the Highway
The Automation Firehose: Be Strategic and Tactical by Thomas HaverThe Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
Data drift and machine learning par Smita Agrawal
Data drift and machine learningData drift and machine learning
Data drift and machine learning
Smita Agrawal391 vues
LIMS_ASQ.pptx par Arta Doci
LIMS_ASQ.pptxLIMS_ASQ.pptx
LIMS_ASQ.pptx
Arta Doci928 vues
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing par Perfecto by Perforce
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web TestingThe Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing

Plus de Databricks

DW Migration Webinar-March 2022.pptx par
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K vues25 diapositives
Data Lakehouse Symposium | Day 1 | Part 1 par
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K vues43 diapositives
Data Lakehouse Symposium | Day 1 | Part 2 par
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
739 vues16 diapositives
Data Lakehouse Symposium | Day 4 par
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K vues74 diapositives
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop par
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K vues64 diapositives
Democratizing Data Quality Through a Centralized Platform par
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K vues36 diapositives

Plus de Databricks(20)

DW Migration Webinar-March 2022.pptx par Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K vues
Data Lakehouse Symposium | Day 1 | Part 1 par Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K vues
Data Lakehouse Symposium | Day 1 | Part 2 par Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks739 vues
Data Lakehouse Symposium | Day 4 par Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K vues
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop par Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K vues
Democratizing Data Quality Through a Centralized Platform par Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K vues
Learn to Use Databricks for Data Science par Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K vues
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix par Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks688 vues
Stage Level Scheduling Improving Big Data and AI Integration par Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 vues
Simplify Data Conversion from Spark to TensorFlow and PyTorch par Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K vues
Scaling your Data Pipelines with Apache Spark on Kubernetes par Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K vues
Scaling and Unifying SciKit Learn and Apache Spark Pipelines par Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 vues
Sawtooth Windows for Feature Aggregations par Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks604 vues
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink par Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks675 vues
Raven: End-to-end Optimization of ML Prediction Queries par Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks448 vues
Massive Data Processing in Adobe Using Delta Lake par Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 vues
Machine Learning CI/CD for Email Attack Detection par Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 vues
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue par Databricks
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks348 vues
Improving Apache Spark for Dynamic Allocation and Spot Instances par Databricks
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks281 vues
Hyperspace for Delta Lake par Databricks
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks560 vues

Dernier

RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx par
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxJaysonGarabilesEspej
6 vues3 diapositives
Introduction to Microsoft Fabric.pdf par
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdfishaniuudeshika
24 vues16 diapositives
How Leaders See Data? (Level 1) par
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)Narendra Narendra
13 vues76 diapositives
PROGRAMME.pdf par
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdfHiNedHaJar
17 vues13 diapositives
Understanding Hallucinations in LLMs - 2023 09 29.pptx par
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxGreg Makowski
13 vues18 diapositives
Short Story Assignment by Kelly Nguyen par
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyenkellynguyen01
18 vues17 diapositives

Dernier(20)

Understanding Hallucinations in LLMs - 2023 09 29.pptx par Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski13 vues
Short Story Assignment by Kelly Nguyen par kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 vues
Advanced_Recommendation_Systems_Presentation.pptx par neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx par ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
Building Real-Time Travel Alerts par Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann109 vues
Data structure and algorithm. par Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 vues
Organic Shopping in Google Analytics 4.pdf par GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 vues
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx par DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation par DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Supercharging your Data with Azure AI Search and Azure OpenAI par Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 vues
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf par vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 vues
Cross-network in Google Analytics 4.pdf par GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 vues
RuleBookForTheFairDataEconomy.pptx par noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 vues

Why APM Is Not the Same As ML Monitoring

  • 1. ML Monitoring is not APM Cory A. Johannsen Product Engineer, Verta Inc. www.verta.ai
  • 2. Agenda ▴ What is APM? ▴ What is ML monitoring? ▴ How ML monitoring and APM differ ▴ The unique needs of ML monitoring ▴ A very cool solution to model monitoring from Verta
  • 3. About https://www.verta.ai/product - End-to-end MLOps platform for ML model delivery, operations and management - Kubernetes-based, operations stack for ML - 23 years as a software engineer - Embedded systems, enterprise software, SaaS - 6 years in APM working at scale
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 6. What is APM? ▴ Application performance Monitoring ▴ Metrics ○ Name ○ Value ○ Labels ○ Timestamp ▴ Visualization ▴ Alerting
  • 7. What do I care about monitoring in APM? ▴ Health ▴ Availability ▴ Performance ▴ Stability ▴ Notification
  • 8. APM in practice ▴ Production operations ▴ Diagnostics and debugging ▴ Critical incident response
  • 9. What is Model Monitoring?
  • 10. ▴ Know when models are failing ▴ Quickly find the root cause ▴ Close the loop by fast recovery 10 Ensuring model results are consistently of high quality *We refer to all latency, throughput etc. as model service health
  • 11. ▴ w/o ground truth, model fails challenging to detect ▴ Need to monitor complex statistical summaries ▴ Distributions, anomalies, missing values, quantiles etc. ▴ Often model-specific ▴ Intelligent detection and alerting to pre-emptively identify issues and trigger remediations ▴ Execute re-trains, fallback models, and human intervention. 11 Know when a model fails Close the loop ▴ A model is one part of a inference pipeline ▴ Need global view of the pipeline jungle to see where the root issue may be Quickly find the root cause
  • 12. How APM and ML monitoring align ▴ Error rate, Throughput, Latency ○ You need to know my production systems are operational ▴ Visualization ○ You need to see change over time ▴ Alerting ○ You need to know when something has gone wrong (and only when something has gone wrong)
  • 13. What do you care about in ML Monitoring? ▴ Distribution ○ Training versus test ○ Iteration over iteration ○ Live prediction ▴ Drift ○ Change in Distribution over time
  • 14. How APM and ML monitoring differ ▴ Error Rate, Throughput, Latency ○ Necessary, no longer sufficient ▴ Not all work is production work ○ ML monitoring happens from the beginning of the pipeline ▴ APM can tell you what is wrong ○ ML monitoring is about understanding why
  • 15. What makes ML monitoring unique ▴ Quantitative analysis of model performance ○ Information you can use ▴ Controlled comparison of distributions ○ Repeatable ○ Reliable ○ Consistent ▴ Alerting on meaningful deviation ○ Actionable ○ Timely ○ Accurate
  • 16. Only you know the shape of your data ▴ Every model and pipeline is different and specialized ○ You built them, you understand them ▴ You know what metrics and distributions are valuable ○ This is your model, you know the data and processes that created it ▴ You know the expected distributions ○ You can determine whether the behavior is correct
  • 17. Only you know how to measure change ▴ Compare to reference set ○ Training, test, golden data set ▴ Compare to a baseline ○ Calculate a baseline from your data or production systems ▴ Compare to other ○ Use a comparison that makes sense in your domain
  • 18. Only you know when a change matters ▴ You know your model and tolerances ▴ You know when a deviation is significant (or not!) ▴ You know when these conditions need to change
  • 19. Verta understand model monitoring ▴ Designed for your workflows ▴ Easy integration to capture your monitoring data ▴ Visualize and understand your metrics, distributions, and drift ▴ Get alerted when you should - not otherwise
  • 20. Introducing a generalized framework for Model Monitoring
  • 21. Concepts ▴ Monitored Entity: A reference name (e.g. model or pipeline) that you want to monitor ▴ Profiler: A function that computes statistics about your data ▴ Summary: A collection of statistics about your data (output of profiler) ○ Samples: instance of a summary, i.e., a statistic ○ Labels: key-values attached to summary samples. Used for rich filtering and aggregation ▴ Alerter: Triggered periodically, it can talk with the Verta API to fetch information about summaries and identify if they look wrong
  • 22. How does it work? 1. Define monitored entity: the entity to be monitored (e.g., model, data, pipeline) 2. Define summaries to monitor for the entity 3. Run profilers (manually or automatically) to produce summary samples 4. View samples, define alerts 5. Get alerted (e.g. via Slack) 6. Close the loop!
  • 23. How does it work? Time-series DB for statistical summaries ... Ground truth Data/Model Pipelines Model (Live) Remediation - Retrain - Rollback - Human loop Model (Batch) Prediction Log
  • 24. Summary ▴ Performance monitoring is no longer sufficient for the needs of modern ML systems ○ Model monitoring starts at the beginning of the pipeline and continues through production ○ Batch and live can be addressed in the same framework ▴ Knowing something is wrong is not enough, you need to know why ▴ Timely actionable alerting is mandatory ▴ Building these tools on-site is difficult, error-prone, and expensive ▴ Spark is a fantastic tool to enable model monitoring
  • 25. Monitor Your Models with Verta ▴ Visit monitoring.verta.ai today and see it in action ▴ Join our community ▴ Get more out of your models ▴ Get more out of your alerts
  • 26. Thank you. Cory A. Johannsen Product Engineer, Verta Inc. www.verta.ai