MLSEV Virtual. State of the Art in ML

Dietterich #MLSEV 2
State of the Art in
Machine Learning
Tom Dietterich
Chief Scientist, BigML, Inc

Dietterich #MLSEV 3
• Carnegie-Mellon University
• Organizers: Jaime Carbonell, Tom Mitchell,
Ryszard Michalski
• Attendees: ~30
• Topics:
• Exact learning
• Compression
• Supervised learning with noise-free labels
1980: First Machine
Learning Workshop

Dietterich #MLSEV 4
• Generalization
• Feature Engineering
• Explanation and Uncertainty
• Uncertainty Quantification
• Run-Time Monitoring
• Application-Specific Metrics
Outline:
Six Challenges for ML

Dietterich #MLSEV
Challenge #1: Generalization
5

Dietterich #MLSEV 6
• Ross Quinlan introduced ID3
• Decision tree learning algorithm
• Goal: Compress chess endgame tables into
simple decision rules
• Ken Thompson had reverse-enumerated the
winning positions for certain chess endgames 
Large table of (board position, outcome) pairs
• ID3 was applied to compress these into a more
understandable representation
• Notes:
• No generalization, Noise Free
• Interpretability was important
Decision Tree Method: ID3
Win in 10
Breda, 2006
ID3

Dietterich #MLSEV 7
• Generalization for iid data
• Assume training and runtime data are drawn
from the same distribution
• Strong theoretical guarantees
• Generalization across domains
• Causal Transportability
• Domain-Adversarial Training
Today:
Generalization is the Key

Dietterich #MLSEV 8
• Predicting Lung Cancer
• T: Lung Cancer
• C: Chest Pain
• A: Patient is taking aspirin
• K: Patient is a smoker (not observed)
• S: The distribution of A may change between training and
deployment (change of hospital)
• Goal: Create a predictive model that does not depend on S
• Guaranteed to generalize to new hospital (assuming this
causal model is correct)
Causal Transportability
(Pearl & Bareinboim, 2011)

Dietterich #MLSEV 9
• Generate all models that can make 𝑇𝑇
independent of 𝑆𝑆
• Evaluate each model on validation
data
• Keep the best model
• Guaranteed to transport across
hospitals provided that the causal
diagram is correct
Graph Surgery Technique
Encourages thinking ahead about possible changes at
deployment time
(Subbaswamy et al., 2018)

Dietterich #MLSEV 10
• Given:
• Training data points from two or more domains: 𝐷𝐷1, 𝐷𝐷2
• 𝐷𝐷1 points are labeled pairs 𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖
• 𝐷𝐷2 points are unlabeled 𝑥𝑥𝑖𝑖
• Training:
• For 𝐷𝐷1 points: Predict the correct label
• For all points: Predict the domain 1 vs. 2
• Find weights that give accurate predictions for 𝐷𝐷1 and
chance predictions for the domain
Domain Adversarial Training

Domain Adversarial Training
Ganin, et al., JMLR 2016

Experiments

• Method assumes that the class label distributions
are not changing
• The method can be unstable. Works best if you
have at least some labeled data for the target
domain to help choose hyperparameters
Domain-Adversarial Training
Weaknesses

Dietterich #MLSEV
Challenge #2: Feature Engineering
14

• In 1980, Quinlan carefully designed
interpretable features with
predictive power. This is still
important today in most
applications
• Claim: Features should include
meta-data definitions
• “Numbers should never travel
alone across the internet” –Mark
Fox
• BigML flatline language
• SQL statements/procedures
• Trifacta rules
Feature Engineering
Example:
Student_Teacher_Ratio(school, time)
|{s | registered(s, school, time)}| /
∑ 𝐹𝐹𝐹𝐹𝐹𝐹(𝑡𝑡𝑡𝑡 , school, time)

• Allows data consumers to detect when the
meaning of the feature has changed even when
the feature name has not changed
• important for detecting data errors and
debugging classifier failures
Importance of
Feature Meta-Data

• No: Deep learning applications still
require careful data preparation
• image normalization, contrast
enhancement, etc.
• Yes: Deep learning can learn
powerful intermediate
representations
• <2012: Manually-designed SIFT
and HoG features for images
combined with support vector
machines or random forests
• >2012: Deep learning produces
much better results
Does Deep Learning Automate
Feature Engineering? Yes and No
0
5
10
15
20
25
30
2010 2011 2012 2013 2014
Top5ClassificationError
(%) Before After
ImageNet 1000 Classes

Dietterich #MLSEV
Challenge #3:
Explanation and Interpretability
18

• 1980: Quinlan wanted interpretability because he
expected people to memorize the learned
decision tree
• In practice, we needed to check whether the
learning algorithm got the right answer
• Today: Our highest-performing models (random
forests, boosted trees, deep neural networks) are
not interpretable
• Interpretability and explanation are “hot topics”
in ML research
Interpretability and
Explanation

• Claim: Explanations should help the user perform some
task
• BigML has worked hard on visualization tools to provide
interpretability
• At Oregon State, we are developing explanation tools
for reinforcement learning
Explanation and
Interpretability
ML System User Task
Predictive Model ML Engineer Find errors and holes in data
Recommendation
System
End User
Decide whether to follow the
recommendation
Predictive Model
RL Model
ML Engineer
Acceptance Testing:
Decide whether delivered
system is sufficiently
accurate

Dietterich #MLSEV
Challenge #4:
Uncertainty Quantification
21

• 1980: This issue was totally ignored
• Today: Giving calibrated uncertainty estimates is
important
• Calibrated Probabilities:
• When the classifier says “X belongs to class C
with probability 0.94”, then it is correct 94% of the
time
• This is measured using a separate labeled
“calibration set”
• Can use “out of bag” training data in random
forests
Uncertainty Quantification

• Some classifiers are always well-calibrated
• Decision Trees
• Random Forests
• Others must be post-processed to achieve good
calibration
• Boosted Trees
• Support Vector Machines
• Deep Neural Networks
Calibration

• Sort the predicted probabilities
into bins 0.0-0.1, 0.1-0.2, etc.
• For each bin, measure the
average accuracy on the
calibration data
• Plot the accuracy for each bin
• should lie on the diagonal if
well-calibrated
• Example shows that Naïve
Bayes is generally very
optimistic
Measuring Calibration via a
Reliability Diagram
Reliability Diagram (Naïve
Bayes; ADULT)
Zadrozny & Elkan,
2002

• Fit a function to the reliability
diagram
• Often a sigmoid (logistic
regression) function works
well
• Use this to convert the
predicted values (on X axis) to
calibrated values (Y axis)
• Similar techniques can
calibrate Naïve Bayes,
Deep Nets, Boosted Trees,
etc.
Fitting a recalibration
function

• Calibration compares predicted probability and expected accuracy
globally – across the entire calibration data set
• This may be misleading
• A classifier could achieve 95% accuracy and perfect calibration by classifying
95% of the data set perfectly and the remaining 5% completely incorrectly
• This 5% could be a specific customer segment
• Within that segment, the classifier is actually very poorly calibrated
because it outputs a confidence of 0.95 but is correct 0% of the time
• Lesson: Calibration should be done separately for each customer
segment or local group
• Decision trees calibrate separately for each leaf of the tree, so they
usually don’t exhibit this problem
• It is always important to look at model accuracy by customer
segments and other customer features (gender, race, region, age,
etc.)
• Example: Face recognition is less accurate on dark skin and on women, etc.
Local vs. Global Calibration

Dietterich #MLSEV
Challenge #5:
Run-Time Monitoring
27

• Predictive models are only guaranteed to be
accurate if run-time queries are drawn from the same
distribution as the training data
• Open Category Problem: Run-time data may involve
new classes
• New types of objects in computer vision
• New classes of items (books, restaurants) in
recommender systems
• New diseases in medical systems
• New types of fraud in supervised fraud detection
Why Monitor?

• Outlier Detection
• Detect whether a new query 𝑥𝑥𝑞𝑞 is an outlier
compared to the training data 𝑥𝑥1, … , 𝑥𝑥𝑁𝑁
• Change Detection
• Detect whether the data distribution has changed
• Compare the 𝐿𝐿 most recent points 𝑥𝑥𝑡𝑡−𝐿𝐿+1, … , 𝑥𝑥𝑡𝑡
to the 𝐿𝐿 points before them, 𝑥𝑥𝑡𝑡−2𝐿𝐿+1, … , 𝑥𝑥𝑡𝑡−𝐿𝐿. Do
they come from different distributions?
How to Monitor?

• Most AD papers only evaluate on a few datasets
• Often proprietary or very easy (e.g., KDD 1999)
• ML community needs a large and growing
collection of public anomaly benchmarks
Anomaly Detection
Benchmarking Study
[Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013]
[Emmott, Das, Dietterich, Fern, Wong. 2016; arXiv 1503.01158v2]

• Density-Based Approaches
• RKDE: Robust Kernel Density
Estimation (Kim & Scott, 2008)
• EGMM: Ensemble Gaussian
Mixture Model (our group)
• Quantile-Based Methods
• OCSVM: One-class SVM
(Schoelkopf, et al., 1999)
• SVDD: Support Vector Data
Description (Tax & Duin, 2004)
Algorithms
• Neighbor-Based Methods
• LOF: Local Outlier Factor (Breunig,
et al., 2000)
• ABOD: kNN Angle-Based Outlier
Detector (Kriegel, et al., 2008)
• Projection-Based Methods
• IFOR: Isolation Forest (Liu, et al.,
2008)
• LODA: Lightweight Online Detector
of Anomalies (Pevny, 2016)

Algorithm Comparison
0
0.2
0.4
0.6
0.8
1
1.2
ChangeinMetricwrt
ControlDataset
Algorithm
logit(AUC)
log(LIFT)
Based on this
study, BigML
implemented
Isolation Forest

• Only make a prediction
if the query 𝑥𝑥𝑞𝑞 has a
low anomaly score
• Liu, et al. 2018 showed
how to set 𝜏𝜏 to
guarantee detecting
new category queries
with high probability
Open Category Detection
𝑥𝑥𝑞𝑞
Anomaly
Detector
𝐴𝐴 𝑥𝑥𝑞𝑞 > 𝜏𝜏?
Classifier 𝑓𝑓
Training
Examples
(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) no
𝑦𝑦 = 𝑓𝑓(𝑥𝑥𝑞𝑞)
yes reject
[Liu, Garrepalli, Fern, Dietterich, ICML 2018]

• “Two Sample” test. 𝑆𝑆𝑎𝑎 ∼ 𝑃𝑃𝑎𝑎, 𝑆𝑆𝑏𝑏 ∼ 𝑃𝑃𝑏𝑏, is 𝑃𝑃𝑎𝑎 ≠ 𝑃𝑃𝑏𝑏?
• Method 1: Kernel two-sample test
• Method 2: Old-vs-New Classifier
• Train a classifier to distinguish between 𝑆𝑆𝑎𝑎 and 𝑆𝑆𝑏𝑏. Can it
do better than random guessing?
• At each time 𝑡𝑡, slide 𝑆𝑆𝑎𝑎 and 𝑆𝑆𝑏𝑏 one step forward in time
(requires online methods)
• An area of active research
Change Detection
𝑥𝑥1 𝑥𝑥2 𝑥𝑥100 𝑥𝑥101 𝑥𝑥200
𝑆𝑆𝑎𝑎 𝑆𝑆𝑏𝑏

Dietterich #MLSEV
Challenge #6:
Evaluation
35

• Standard metrics for evaluating classifiers, such as F1
and AUC, were developed for machine learning research
• Most applications require separate metrics
• Example:
• Financial fraud
• Suppose we have 5 analysts and each analyst can
examine 10 cases per day
• Metric: Expected value of the top 50 alarms
(value@50).
• Incorporates the estimated value of each
candidate fraud alarm
Application-Specific Metrics
are Essential

• Open Category Detection:
• Detect 99% of all open category queries
• Metric: Precision at 99% recall
• Obstacle Detection for Self-Driving cars
• Detect 99.999% of all dangerous obstacles
• Metric: Precision at 99.999% recall
• Cancer Screening:
• Must trade off false alarms versus missed alarms
• Metric: Cost to patient (may vary from one patient to
another)
• AUC is a fairly good metric for this case
More Examples

• Generalization
• Beyond iid: Causal transportability; Domain adaptation
• Feature Engineering
• Very important; Deep learning can discover useful
intermediate features in some cases
• Uncertainty Quantification
• Probability Calibration
• Run-time Monitoring
• Anomaly Detection; Change Point Detection
• Application-Specific Metrics
Frontiers of Machine
Learning and Applications

MLSEV Virtual. State of the Art in ML

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à MLSEV Virtual. State of the Art in ML

Similaire à MLSEV Virtual. State of the Art in ML (20)

Plus de BigML, Inc

Plus de BigML, Inc (20)

Dernier

Dernier (20)

MLSEV Virtual. State of the Art in ML