SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Real-World Strategies for
Model Debugging
Patrick Hall
Principal Scientist, bnh.ai
Visiting Faculty, George Washington School of Business
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data,
analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
All software has bugs.
Machine learning is software.
Model Debugging
▪ Model debugging is an emergent discipline focused on remediating errors in the
internal mechanisms and outputs of machine learning (ML) models.
▪ Model debugging attempts to test ML models like software (because models are
code).
▪ Model debugging is similar to regression diagnostics, but for ML models.
▪ Model debugging promotes trust directly and enhances interpretability as a side
effect.
See https://debug-ml-iclr2019.github.io for numerous model debugging approaches.
Why Debug ML
Models?
AI Incidents on the Rise
This information is based on a qualitative assessment of 146 publicly reported incidents between 2015 and 2020.
Common Failure Modes
This information is based on a qualitative assessment of 169 publicly reported incidents between 1988 and February 1, 2021.
Regulatory and Legal Considerations
EU: Proposal for a Regulation on a European Approach for Artificial Intelligence
https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence
● Article 17 - Quality management system (c): “techniques, procedures and systematic actions to be
used for the development, quality control and quality assurance of the high-risk AI system”
U.S. FTC: Using Artificial Intelligence and Algorithms
https://www.ftc.gov/news-events/blogs/business-blog/2020/04/using-artificial-intelligence-algorithms
● “Make sure that your AI models are validated and revalidated to ensure that they work as intended”
Brookings Institution: Products liability law as a way to address AI harms
https://www.brookings.edu/research/products-liability-law-as-a-way-to-address-ai-harms/
● “Manufacturers have an obligation to make products that will be safe when used in reasonably
foreseeable ways. If an AI system is used in a foreseeable way and yet becomes a source of harm, a
plaintiff could assert that the manufacturer was negligent in not recognizing the possibility of that
outcome.”
Textbook assessment is insufficient
for real-world woes ...
The Strawman
Model: gmono
The Strawman: gmono
▪ Constrained, monotonic GBM probability of
default (PD) classifier, gmono
.
▪ Grid search over hundreds of models.
▪ Best model selected by validation-based early
stopping.
▪ Seemingly well-regularized (row and column
sampling, explicit specification of L1 and L2
penalties).
▪ No evidence of over- or underfitting.
▪ Better validation logloss than benchmark GLM.
▪ Decision threshold selected by maximization of
F1 statistics.
▪ BUT traditional assessment can be insufficient!
ML Models Can Be Unnecessary
gmono
is a glorified business rule: IF PAY_0 > 1, THEN DEFAULT.
PAY_0 is overemphasized.
ML Models Perpetuate Sociological Biases
Group disparity metrics are out-of-range for gmono
across different marital statuses.
ML Models Have Security Vulnerabilities
Full-size image available: https://resources.oreilly.com/examples/0636920415947/blob/master/Attack_Cheat_Sheet.png
Software Quality Assurance
IT Governance
Methods of
Debugging
IT Governance and Software QA
Software Quality Assurance (QA)
● Unit testing
● Integration testing
● Functional testing
● Chaos testing
More for ML:
● Reproducible benchmarks
● Random attack
IT Governance
● Incident response
● Managed development processes
● Code reviews (even pair programming)
● Security and privacy policies
More for ML: Model governance and
model risk management (MRM)
● Executive oversight
● Documentation standards
● Multiple lines of defense
● Model inventories and monitoring
Further Reading:
Interagency Guidance on Model Risk Management (SR 11-7)
https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf
● Due to hype, data scientists and ML engineers are often:
○ Excused from basic QA requirements and IT governance.
○ Allowed to operate in violation of security and privacy policies
(and laws).
● Many organizations have incident response plans for all
mission-critical computing except ML.
● Very few nonregulated organizations practice MRM.
● We are in the Wild West days of AI.
Further Reading: Overview of Debugging ML Models (Google)
https://developers.google.com/machine-learning/testing-debugging/common/overview
IT Governance and Software QA
Detection and Remediation Strategies
Methods of
Debugging
Sensitivity Analysis
● ML models behave in complex and
unexpected ways.
● The only way to know how they will behave is
to test them.
● With sensitivity analysis, we can test model
behavior in interesting, critical, adversarial, or
random situations.
Important Tests:
● Visualizations of model performance
(ALE, ICE, partial dependence)
● Stress-testing and adversarial example
searches
● Random attacks
● Tests for underspecification Source: http://www.vias.org/tmdatanaleng/cc_ann_extrapolation.html
Sensitivity Analysis Example—Partial Dependence
▪Training data is sparse for
PAY_0 > 1.
▪ICE curves indicate that partial
dependence is likely trustworthy
and empirically confirm
monotonicity, but also expose
adversarial attack vulnerabilities.
▪Partial dependence and ICE
indicate gmono
likely learned very
little for PAY_0 > 1.
▪Pay_0 = missing gives lowest
probability of default?!
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
Sensitivity Analysis Example—Adversarial Example
Search
An adversarial example is a
row of data that evokes a
strange prediction—we can
learn a lot from them.
Adversary search confirms
multiple avenues of attack and
exposes a potential flaw in
gmono
inductive logic: default is
predicted for customers who
make payments above their
credit limit.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
Residual Analysis
● Learning from mistakes is important.
● Residual analysis is the mathematical study of
modeling mistakes.
● With residual analysis, we can see the mistakes
our models are likely to make and correct or
mitigate them.
Important Tests:
● Residuals by feature and level
● Segmented error analysis
(including differential validity tests for
social discrimination)
● Shapley contributions to logloss
● Models of residuals Source: Residual (Sur)Realism
https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/Residual_Surrealism_TAS_2007.pdf
Residual Analysis Example—Segmented Error
For PAY_0:
▪Notable change in accuracy and error
characteristics for PAY_0 > 1.
▪Varying performance across segments
can also be an indication of
underspecification.
▪For SEX, accuracy and error
characteristics vary little across
individuals represented in the training
data.
Nondiscrimination should be tested by
more involved disparate impact
analysis.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
Residual Analysis Example—Shapley Values
Globally important
features PAY_3 and
PAY_2 are more
important, on
average, to the loss
than to the
predictions!
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
Residual Analysis Example—Modeling Residuals
This tree encodes rules describing when gmono
is probably wrong!
Decision tree model of gmono
DEFAULT_NEXT_MONTH=1 logloss residuals with
3-fold CV MSE=0.0070 and R2
=0.8871.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
Benchmark Models
● Technical progress in training: Take small steps from reproducible
benchmarks. How else do you know if the code changes you made
today to your incredibly complex ML system made it any better?
● Sanity checks on real-world performance: Compare complex model
predictions to benchmark model predictions. How else can you know if
your incredibly complex ML system is giving strange predictions on
real-world data?
Remediation of gmono
Strawman
▪ Overemphasis of PAY_0:
▪ Collect better data!
▪ Engineer features for payment trends or stability.
▪ Strong regularization or missing value injection.
▪ Sparsity of PAY_0 > 1 training data: Get better data! (Increase observation weights?)
▪ Payments ≥ credit limit: Inference-time model assertion.
▪ Disparate impact: Model selection by minimal disparate impact.
(Pre-, in-, post-processing?)
▪ Security vulnerabilities: API throttling, authentication, real-time model monitoring.
▪ Large logloss importance: Evaluate dropping non-robust features.
Process Remediation Strategies
● Appeal and override: Always enable users to appeal inevitable wrong decisions.
● Audits or red-teaming: Pay external experts to find bugs and problems.
● Bug bounties: Pay rewards to researchers (and teenagers) who find bugs in your
(ML) software.
● Demographic and professional diversity: Diverse teams spot different kinds of problems.
● Domain expertise: Understand the context in which you are operating; crucial for testing.
● Incident response plan: Complex systems fail; be prepared.
● IT governance and QA: Treat ML systems like other mission-critical software assets!
● Model risk management: Empower executives; align incentives; challenge and document
design decisions; and monitor models.
● Past known incidents: Those who ignore history are doomed to repeat it.
Technical Remediation Strategies
▪ Anomaly detection: Strange predictions can signal performance or security problems.
▪ Calibration to past data: Make output probabilities meaningful in the real world.
▪ Experimental design: Use science to select training data that addresses your implicit
hypotheses.
▪ Interpretable models/XAI: It’s easier to debug systems we can actually understand.
▪ Manual prediction limits: Don’t let models make embarrassing, harmful, or illegal
predictions.
▪ Model or model artifact editing: Directly edit the inference code of your model.
▪ Model monitoring: Always watch the behavior of ML models in the real world.
▪ Monotonicity and interaction constraints: Force your models to obey reality.
▪ Strong regularization or missing value injection: Penalize your models for
overemphasizing non-robust input features.
References and
Resources
Must Reads
AI Incidents Fundamental Limitations Risk Management
Study and catalog incidents
so you don’t repeat them.
Same processes from
transportation incidents.
ML must be constrained and
tested in the context of
domain knowledge … or it
doesn’t really work.
Somethings cannot be
predicted … no matter how
good the data or how many
data scientists are
involved. 
Executive oversight,
incentives, culture, and
process are crucial to
mitigate risk.
Resources
ModelTracker: Redesigning Performance Analysis Tools for Machine Learning
https://www.microsoft.com/en-us/research/publication/modeltracker-redesigning-performance-analysis-tools-for-machine-learning/
BIML Interactive Machine Learning Risk Framework
https://berryvilleiml.com/interactive/
Debugging Machine Learning Models
https://debug-ml-iclr2019.github.io/
Safe and Reliable Machine Learning
https://www.dropbox.com/s/sdu26h96bc0f4l7/FAT19-AI-Reliability-Final.pdf?dl=0
Tools:
allennlp, cleverhans, manifold, SALib, shap, What-If Tool
QUESTIONS? • CONTACT US • CONTACT@BNH.AI
Patrick Hall
Principal Scientist, bnh.ai
ph@bnh.ai
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data,
analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Contenu connexe

Tendances (7)

Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable Path
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Genetics and heredity
Genetics and heredityGenetics and heredity
Genetics and heredity
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Genetics for NCERT class 12
Genetics for NCERT class 12Genetics for NCERT class 12
Genetics for NCERT class 12
 
Patterns of inheritance mendelian inheritance
Patterns of inheritance mendelian inheritancePatterns of inheritance mendelian inheritance
Patterns of inheritance mendelian inheritance
 
Machine Learning techniques
Machine Learning techniques Machine Learning techniques
Machine Learning techniques
 

Similaire à Real-world Strategies for Debugging Machine Learning Systems

achine Learning and Model Risk
achine Learning and Model Riskachine Learning and Model Risk
achine Learning and Model Risk
QuantUniversity
 
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYCPatrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Sri Ambati
 
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLXDN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
Dataconomy Media
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 

Similaire à Real-world Strategies for Debugging Machine Learning Systems (20)

achine Learning and Model Risk
achine Learning and Model Riskachine Learning and Model Risk
achine Learning and Model Risk
 
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYCPatrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
 
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricingXavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
 
Technical debt in machine learning - Data Natives Berlin 2018
Technical debt in machine learning - Data Natives Berlin 2018Technical debt in machine learning - Data Natives Berlin 2018
Technical debt in machine learning - Data Natives Berlin 2018
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 
How adversaries interfere with AI and ML systems
How adversaries interfere with AI and ML systemsHow adversaries interfere with AI and ML systems
How adversaries interfere with AI and ML systems
 
The Role of AI Safety Institutes in Enabling Trustworthy AI
The Role of AI Safety Institutes in Enabling Trustworthy AIThe Role of AI Safety Institutes in Enabling Trustworthy AI
The Role of AI Safety Institutes in Enabling Trustworthy AI
 
vodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applicationsvodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applications
 
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLXDN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
 
Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18
 
Machine Learning Risk Management
Machine Learning Risk ManagementMachine Learning Risk Management
Machine Learning Risk Management
 
The Machine Learning Audit
The Machine Learning AuditThe Machine Learning Audit
The Machine Learning Audit
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Quant Foundry Labs - Low Probability Defaults
Quant Foundry Labs - Low Probability DefaultsQuant Foundry Labs - Low Probability Defaults
Quant Foundry Labs - Low Probability Defaults
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0
 
Share Credit_Card_Fraud_Detection_ML_MP (1).pptx
Share Credit_Card_Fraud_Detection_ML_MP (1).pptxShare Credit_Card_Fraud_Detection_ML_MP (1).pptx
Share Credit_Card_Fraud_Detection_ML_MP (1).pptx
 
The Key Differences Between Rule-Based AI And Machine Learning
The Key Differences Between Rule-Based AI And Machine LearningThe Key Differences Between Rule-Based AI And Machine Learning
The Key Differences Between Rule-Based AI And Machine Learning
 
Fraud Detection in Insurance with Machine Learning for WARTA - Artur Suchwalko
Fraud Detection in Insurance with Machine Learning for WARTA - Artur SuchwalkoFraud Detection in Insurance with Machine Learning for WARTA - Artur Suchwalko
Fraud Detection in Insurance with Machine Learning for WARTA - Artur Suchwalko
 
Foutse_Khomh.pptx
Foutse_Khomh.pptxFoutse_Khomh.pptx
Foutse_Khomh.pptx
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Real-world Strategies for Debugging Machine Learning Systems

  • 1. Real-World Strategies for Model Debugging Patrick Hall Principal Scientist, bnh.ai Visiting Faculty, George Washington School of Business Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data, analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
  • 2. All software has bugs. Machine learning is software.
  • 3. Model Debugging ▪ Model debugging is an emergent discipline focused on remediating errors in the internal mechanisms and outputs of machine learning (ML) models. ▪ Model debugging attempts to test ML models like software (because models are code). ▪ Model debugging is similar to regression diagnostics, but for ML models. ▪ Model debugging promotes trust directly and enhances interpretability as a side effect. See https://debug-ml-iclr2019.github.io for numerous model debugging approaches.
  • 5.
  • 6. AI Incidents on the Rise This information is based on a qualitative assessment of 146 publicly reported incidents between 2015 and 2020.
  • 7. Common Failure Modes This information is based on a qualitative assessment of 169 publicly reported incidents between 1988 and February 1, 2021.
  • 8. Regulatory and Legal Considerations EU: Proposal for a Regulation on a European Approach for Artificial Intelligence https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence ● Article 17 - Quality management system (c): “techniques, procedures and systematic actions to be used for the development, quality control and quality assurance of the high-risk AI system” U.S. FTC: Using Artificial Intelligence and Algorithms https://www.ftc.gov/news-events/blogs/business-blog/2020/04/using-artificial-intelligence-algorithms ● “Make sure that your AI models are validated and revalidated to ensure that they work as intended” Brookings Institution: Products liability law as a way to address AI harms https://www.brookings.edu/research/products-liability-law-as-a-way-to-address-ai-harms/ ● “Manufacturers have an obligation to make products that will be safe when used in reasonably foreseeable ways. If an AI system is used in a foreseeable way and yet becomes a source of harm, a plaintiff could assert that the manufacturer was negligent in not recognizing the possibility of that outcome.”
  • 9. Textbook assessment is insufficient for real-world woes ... The Strawman Model: gmono
  • 10. The Strawman: gmono ▪ Constrained, monotonic GBM probability of default (PD) classifier, gmono . ▪ Grid search over hundreds of models. ▪ Best model selected by validation-based early stopping. ▪ Seemingly well-regularized (row and column sampling, explicit specification of L1 and L2 penalties). ▪ No evidence of over- or underfitting. ▪ Better validation logloss than benchmark GLM. ▪ Decision threshold selected by maximization of F1 statistics. ▪ BUT traditional assessment can be insufficient!
  • 11. ML Models Can Be Unnecessary gmono is a glorified business rule: IF PAY_0 > 1, THEN DEFAULT. PAY_0 is overemphasized.
  • 12. ML Models Perpetuate Sociological Biases Group disparity metrics are out-of-range for gmono across different marital statuses.
  • 13. ML Models Have Security Vulnerabilities Full-size image available: https://resources.oreilly.com/examples/0636920415947/blob/master/Attack_Cheat_Sheet.png
  • 14. Software Quality Assurance IT Governance Methods of Debugging
  • 15. IT Governance and Software QA Software Quality Assurance (QA) ● Unit testing ● Integration testing ● Functional testing ● Chaos testing More for ML: ● Reproducible benchmarks ● Random attack IT Governance ● Incident response ● Managed development processes ● Code reviews (even pair programming) ● Security and privacy policies More for ML: Model governance and model risk management (MRM) ● Executive oversight ● Documentation standards ● Multiple lines of defense ● Model inventories and monitoring Further Reading: Interagency Guidance on Model Risk Management (SR 11-7) https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf
  • 16. ● Due to hype, data scientists and ML engineers are often: ○ Excused from basic QA requirements and IT governance. ○ Allowed to operate in violation of security and privacy policies (and laws). ● Many organizations have incident response plans for all mission-critical computing except ML. ● Very few nonregulated organizations practice MRM. ● We are in the Wild West days of AI. Further Reading: Overview of Debugging ML Models (Google) https://developers.google.com/machine-learning/testing-debugging/common/overview IT Governance and Software QA
  • 17. Detection and Remediation Strategies Methods of Debugging
  • 18. Sensitivity Analysis ● ML models behave in complex and unexpected ways. ● The only way to know how they will behave is to test them. ● With sensitivity analysis, we can test model behavior in interesting, critical, adversarial, or random situations. Important Tests: ● Visualizations of model performance (ALE, ICE, partial dependence) ● Stress-testing and adversarial example searches ● Random attacks ● Tests for underspecification Source: http://www.vias.org/tmdatanaleng/cc_ann_extrapolation.html
  • 19. Sensitivity Analysis Example—Partial Dependence ▪Training data is sparse for PAY_0 > 1. ▪ICE curves indicate that partial dependence is likely trustworthy and empirically confirm monotonicity, but also expose adversarial attack vulnerabilities. ▪Partial dependence and ICE indicate gmono likely learned very little for PAY_0 > 1. ▪Pay_0 = missing gives lowest probability of default?! Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
  • 20. Sensitivity Analysis Example—Adversarial Example Search An adversarial example is a row of data that evokes a strange prediction—we can learn a lot from them. Adversary search confirms multiple avenues of attack and exposes a potential flaw in gmono inductive logic: default is predicted for customers who make payments above their credit limit. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
  • 21. Residual Analysis ● Learning from mistakes is important. ● Residual analysis is the mathematical study of modeling mistakes. ● With residual analysis, we can see the mistakes our models are likely to make and correct or mitigate them. Important Tests: ● Residuals by feature and level ● Segmented error analysis (including differential validity tests for social discrimination) ● Shapley contributions to logloss ● Models of residuals Source: Residual (Sur)Realism https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/Residual_Surrealism_TAS_2007.pdf
  • 22. Residual Analysis Example—Segmented Error For PAY_0: ▪Notable change in accuracy and error characteristics for PAY_0 > 1. ▪Varying performance across segments can also be an indication of underspecification. ▪For SEX, accuracy and error characteristics vary little across individuals represented in the training data. Nondiscrimination should be tested by more involved disparate impact analysis. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  • 23. Residual Analysis Example—Shapley Values Globally important features PAY_3 and PAY_2 are more important, on average, to the loss than to the predictions! Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  • 24. Residual Analysis Example—Modeling Residuals This tree encodes rules describing when gmono is probably wrong! Decision tree model of gmono DEFAULT_NEXT_MONTH=1 logloss residuals with 3-fold CV MSE=0.0070 and R2 =0.8871. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  • 25. Benchmark Models ● Technical progress in training: Take small steps from reproducible benchmarks. How else do you know if the code changes you made today to your incredibly complex ML system made it any better? ● Sanity checks on real-world performance: Compare complex model predictions to benchmark model predictions. How else can you know if your incredibly complex ML system is giving strange predictions on real-world data?
  • 26. Remediation of gmono Strawman ▪ Overemphasis of PAY_0: ▪ Collect better data! ▪ Engineer features for payment trends or stability. ▪ Strong regularization or missing value injection. ▪ Sparsity of PAY_0 > 1 training data: Get better data! (Increase observation weights?) ▪ Payments ≥ credit limit: Inference-time model assertion. ▪ Disparate impact: Model selection by minimal disparate impact. (Pre-, in-, post-processing?) ▪ Security vulnerabilities: API throttling, authentication, real-time model monitoring. ▪ Large logloss importance: Evaluate dropping non-robust features.
  • 27. Process Remediation Strategies ● Appeal and override: Always enable users to appeal inevitable wrong decisions. ● Audits or red-teaming: Pay external experts to find bugs and problems. ● Bug bounties: Pay rewards to researchers (and teenagers) who find bugs in your (ML) software. ● Demographic and professional diversity: Diverse teams spot different kinds of problems. ● Domain expertise: Understand the context in which you are operating; crucial for testing. ● Incident response plan: Complex systems fail; be prepared. ● IT governance and QA: Treat ML systems like other mission-critical software assets! ● Model risk management: Empower executives; align incentives; challenge and document design decisions; and monitor models. ● Past known incidents: Those who ignore history are doomed to repeat it.
  • 28. Technical Remediation Strategies ▪ Anomaly detection: Strange predictions can signal performance or security problems. ▪ Calibration to past data: Make output probabilities meaningful in the real world. ▪ Experimental design: Use science to select training data that addresses your implicit hypotheses. ▪ Interpretable models/XAI: It’s easier to debug systems we can actually understand. ▪ Manual prediction limits: Don’t let models make embarrassing, harmful, or illegal predictions. ▪ Model or model artifact editing: Directly edit the inference code of your model. ▪ Model monitoring: Always watch the behavior of ML models in the real world. ▪ Monotonicity and interaction constraints: Force your models to obey reality. ▪ Strong regularization or missing value injection: Penalize your models for overemphasizing non-robust input features.
  • 30. Must Reads AI Incidents Fundamental Limitations Risk Management Study and catalog incidents so you don’t repeat them. Same processes from transportation incidents. ML must be constrained and tested in the context of domain knowledge … or it doesn’t really work. Somethings cannot be predicted … no matter how good the data or how many data scientists are involved.  Executive oversight, incentives, culture, and process are crucial to mitigate risk.
  • 31. Resources ModelTracker: Redesigning Performance Analysis Tools for Machine Learning https://www.microsoft.com/en-us/research/publication/modeltracker-redesigning-performance-analysis-tools-for-machine-learning/ BIML Interactive Machine Learning Risk Framework https://berryvilleiml.com/interactive/ Debugging Machine Learning Models https://debug-ml-iclr2019.github.io/ Safe and Reliable Machine Learning https://www.dropbox.com/s/sdu26h96bc0f4l7/FAT19-AI-Reliability-Final.pdf?dl=0 Tools: allennlp, cleverhans, manifold, SALib, shap, What-If Tool
  • 32. QUESTIONS? • CONTACT US • CONTACT@BNH.AI Patrick Hall Principal Scientist, bnh.ai ph@bnh.ai Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data, analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
  • 33. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.