You used cross-validation, early stopping, grid search, monotonicity constraints, and regularization to train a generalizable, interpretable, and stable machine learning (ML) model. Its fit statistics look just fine on out-of-time test data, and better than the linear model it’s replacing. You selected your probability cutoff based on business goals and you even containerized your model to create a real-time scoring engine for your pals in information technology (IT). Time to deploy?
Not so fast. Current best practices for ML model training and assessment can be insufficient for high-stakes, real-world systems. Much like other complex IT systems, ML models must be debugged for logical or run-time errors and security vulnerabilities. Recent, high-profile failures have made it clear that ML models must also be debugged for disparate impact and other types of discrimination.
This presentation introduces model debugging, an emergent discipline focused on finding and fixing errors in the internal mechanisms and outputs of ML models. Model debugging attempts to test ML models like code (because they are code). It enhances trust in ML directly by increasing accuracy in new or holdout data, by decreasing or identifying hackable attack surfaces, or by decreasing discrimination. As a side-effect, model debugging should also increase the understanding and interpretability of model mechanisms and predictions.
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
Real-world Strategies for Debugging Machine Learning Systems
1. Real-World Strategies for
Model Debugging
Patrick Hall
Principal Scientist, bnh.ai
Visiting Faculty, George Washington School of Business
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data,
analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
3. Model Debugging
▪ Model debugging is an emergent discipline focused on remediating errors in the
internal mechanisms and outputs of machine learning (ML) models.
▪ Model debugging attempts to test ML models like software (because models are
code).
▪ Model debugging is similar to regression diagnostics, but for ML models.
▪ Model debugging promotes trust directly and enhances interpretability as a side
effect.
See https://debug-ml-iclr2019.github.io for numerous model debugging approaches.
6. AI Incidents on the Rise
This information is based on a qualitative assessment of 146 publicly reported incidents between 2015 and 2020.
7. Common Failure Modes
This information is based on a qualitative assessment of 169 publicly reported incidents between 1988 and February 1, 2021.
8. Regulatory and Legal Considerations
EU: Proposal for a Regulation on a European Approach for Artificial Intelligence
https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence
● Article 17 - Quality management system (c): “techniques, procedures and systematic actions to be
used for the development, quality control and quality assurance of the high-risk AI system”
U.S. FTC: Using Artificial Intelligence and Algorithms
https://www.ftc.gov/news-events/blogs/business-blog/2020/04/using-artificial-intelligence-algorithms
● “Make sure that your AI models are validated and revalidated to ensure that they work as intended”
Brookings Institution: Products liability law as a way to address AI harms
https://www.brookings.edu/research/products-liability-law-as-a-way-to-address-ai-harms/
● “Manufacturers have an obligation to make products that will be safe when used in reasonably
foreseeable ways. If an AI system is used in a foreseeable way and yet becomes a source of harm, a
plaintiff could assert that the manufacturer was negligent in not recognizing the possibility of that
outcome.”
10. The Strawman: gmono
▪ Constrained, monotonic GBM probability of
default (PD) classifier, gmono
.
▪ Grid search over hundreds of models.
▪ Best model selected by validation-based early
stopping.
▪ Seemingly well-regularized (row and column
sampling, explicit specification of L1 and L2
penalties).
▪ No evidence of over- or underfitting.
▪ Better validation logloss than benchmark GLM.
▪ Decision threshold selected by maximization of
F1 statistics.
▪ BUT traditional assessment can be insufficient!
11. ML Models Can Be Unnecessary
gmono
is a glorified business rule: IF PAY_0 > 1, THEN DEFAULT.
PAY_0 is overemphasized.
12. ML Models Perpetuate Sociological Biases
Group disparity metrics are out-of-range for gmono
across different marital statuses.
13. ML Models Have Security Vulnerabilities
Full-size image available: https://resources.oreilly.com/examples/0636920415947/blob/master/Attack_Cheat_Sheet.png
15. IT Governance and Software QA
Software Quality Assurance (QA)
● Unit testing
● Integration testing
● Functional testing
● Chaos testing
More for ML:
● Reproducible benchmarks
● Random attack
IT Governance
● Incident response
● Managed development processes
● Code reviews (even pair programming)
● Security and privacy policies
More for ML: Model governance and
model risk management (MRM)
● Executive oversight
● Documentation standards
● Multiple lines of defense
● Model inventories and monitoring
Further Reading:
Interagency Guidance on Model Risk Management (SR 11-7)
https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf
16. ● Due to hype, data scientists and ML engineers are often:
○ Excused from basic QA requirements and IT governance.
○ Allowed to operate in violation of security and privacy policies
(and laws).
● Many organizations have incident response plans for all
mission-critical computing except ML.
● Very few nonregulated organizations practice MRM.
● We are in the Wild West days of AI.
Further Reading: Overview of Debugging ML Models (Google)
https://developers.google.com/machine-learning/testing-debugging/common/overview
IT Governance and Software QA
18. Sensitivity Analysis
● ML models behave in complex and
unexpected ways.
● The only way to know how they will behave is
to test them.
● With sensitivity analysis, we can test model
behavior in interesting, critical, adversarial, or
random situations.
Important Tests:
● Visualizations of model performance
(ALE, ICE, partial dependence)
● Stress-testing and adversarial example
searches
● Random attacks
● Tests for underspecification Source: http://www.vias.org/tmdatanaleng/cc_ann_extrapolation.html
19. Sensitivity Analysis Example—Partial Dependence
▪Training data is sparse for
PAY_0 > 1.
▪ICE curves indicate that partial
dependence is likely trustworthy
and empirically confirm
monotonicity, but also expose
adversarial attack vulnerabilities.
▪Partial dependence and ICE
indicate gmono
likely learned very
little for PAY_0 > 1.
▪Pay_0 = missing gives lowest
probability of default?!
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
20. Sensitivity Analysis Example—Adversarial Example
Search
An adversarial example is a
row of data that evokes a
strange prediction—we can
learn a lot from them.
Adversary search confirms
multiple avenues of attack and
exposes a potential flaw in
gmono
inductive logic: default is
predicted for customers who
make payments above their
credit limit.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
21. Residual Analysis
● Learning from mistakes is important.
● Residual analysis is the mathematical study of
modeling mistakes.
● With residual analysis, we can see the mistakes
our models are likely to make and correct or
mitigate them.
Important Tests:
● Residuals by feature and level
● Segmented error analysis
(including differential validity tests for
social discrimination)
● Shapley contributions to logloss
● Models of residuals Source: Residual (Sur)Realism
https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/Residual_Surrealism_TAS_2007.pdf
22. Residual Analysis Example—Segmented Error
For PAY_0:
▪Notable change in accuracy and error
characteristics for PAY_0 > 1.
▪Varying performance across segments
can also be an indication of
underspecification.
▪For SEX, accuracy and error
characteristics vary little across
individuals represented in the training
data.
Nondiscrimination should be tested by
more involved disparate impact
analysis.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
23. Residual Analysis Example—Shapley Values
Globally important
features PAY_3 and
PAY_2 are more
important, on
average, to the loss
than to the
predictions!
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
24. Residual Analysis Example—Modeling Residuals
This tree encodes rules describing when gmono
is probably wrong!
Decision tree model of gmono
DEFAULT_NEXT_MONTH=1 logloss residuals with
3-fold CV MSE=0.0070 and R2
=0.8871.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
25. Benchmark Models
● Technical progress in training: Take small steps from reproducible
benchmarks. How else do you know if the code changes you made
today to your incredibly complex ML system made it any better?
● Sanity checks on real-world performance: Compare complex model
predictions to benchmark model predictions. How else can you know if
your incredibly complex ML system is giving strange predictions on
real-world data?
26. Remediation of gmono
Strawman
▪ Overemphasis of PAY_0:
▪ Collect better data!
▪ Engineer features for payment trends or stability.
▪ Strong regularization or missing value injection.
▪ Sparsity of PAY_0 > 1 training data: Get better data! (Increase observation weights?)
▪ Payments ≥ credit limit: Inference-time model assertion.
▪ Disparate impact: Model selection by minimal disparate impact.
(Pre-, in-, post-processing?)
▪ Security vulnerabilities: API throttling, authentication, real-time model monitoring.
▪ Large logloss importance: Evaluate dropping non-robust features.
27. Process Remediation Strategies
● Appeal and override: Always enable users to appeal inevitable wrong decisions.
● Audits or red-teaming: Pay external experts to find bugs and problems.
● Bug bounties: Pay rewards to researchers (and teenagers) who find bugs in your
(ML) software.
● Demographic and professional diversity: Diverse teams spot different kinds of problems.
● Domain expertise: Understand the context in which you are operating; crucial for testing.
● Incident response plan: Complex systems fail; be prepared.
● IT governance and QA: Treat ML systems like other mission-critical software assets!
● Model risk management: Empower executives; align incentives; challenge and document
design decisions; and monitor models.
● Past known incidents: Those who ignore history are doomed to repeat it.
28. Technical Remediation Strategies
▪ Anomaly detection: Strange predictions can signal performance or security problems.
▪ Calibration to past data: Make output probabilities meaningful in the real world.
▪ Experimental design: Use science to select training data that addresses your implicit
hypotheses.
▪ Interpretable models/XAI: It’s easier to debug systems we can actually understand.
▪ Manual prediction limits: Don’t let models make embarrassing, harmful, or illegal
predictions.
▪ Model or model artifact editing: Directly edit the inference code of your model.
▪ Model monitoring: Always watch the behavior of ML models in the real world.
▪ Monotonicity and interaction constraints: Force your models to obey reality.
▪ Strong regularization or missing value injection: Penalize your models for
overemphasizing non-robust input features.
30. Must Reads
AI Incidents Fundamental Limitations Risk Management
Study and catalog incidents
so you don’t repeat them.
Same processes from
transportation incidents.
ML must be constrained and
tested in the context of
domain knowledge … or it
doesn’t really work.
Somethings cannot be
predicted … no matter how
good the data or how many
data scientists are
involved.
Executive oversight,
incentives, culture, and
process are crucial to
mitigate risk.
32. QUESTIONS? • CONTACT US • CONTACT@BNH.AI
Patrick Hall
Principal Scientist, bnh.ai
ph@bnh.ai
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data,
analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.