State of the Art in Machine Learning, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
6. Dietterich #MLSEV 6
• Ross Quinlan introduced ID3
• Decision tree learning algorithm
• Goal: Compress chess endgame tables into
simple decision rules
• Ken Thompson had reverse-enumerated the
winning positions for certain chess endgames
Large table of (board position, outcome) pairs
• ID3 was applied to compress these into a more
understandable representation
• Notes:
• No generalization, Noise Free
• Interpretability was important
Decision Tree Method: ID3
Win in 10
Breda, 2006
ID3
7. Dietterich #MLSEV 7
• Generalization for iid data
• Assume training and runtime data are drawn
from the same distribution
• Strong theoretical guarantees
• Generalization across domains
• Causal Transportability
• Domain-Adversarial Training
Today:
Generalization is the Key
8. Dietterich #MLSEV 8
• Predicting Lung Cancer
• T: Lung Cancer
• C: Chest Pain
• A: Patient is taking aspirin
• K: Patient is a smoker (not observed)
• S: The distribution of A may change between training and
deployment (change of hospital)
• Goal: Create a predictive model that does not depend on S
• Guaranteed to generalize to new hospital (assuming this
causal model is correct)
Causal Transportability
(Pearl & Bareinboim, 2011)
9. Dietterich #MLSEV 9
• Generate all models that can make 𝑇𝑇
independent of 𝑆𝑆
• Evaluate each model on validation
data
• Keep the best model
• Guaranteed to transport across
hospitals provided that the causal
diagram is correct
Graph Surgery Technique
Encourages thinking ahead about possible changes at
deployment time
(Subbaswamy et al., 2018)
10. Dietterich #MLSEV 10
• Given:
• Training data points from two or more domains: 𝐷𝐷1, 𝐷𝐷2
• 𝐷𝐷1 points are labeled pairs 𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖
• 𝐷𝐷2 points are unlabeled 𝑥𝑥𝑖𝑖
• Training:
• For 𝐷𝐷1 points: Predict the correct label
• For all points: Predict the domain 1 vs. 2
• Find weights that give accurate predictions for 𝐷𝐷1 and
chance predictions for the domain
Domain Adversarial Training
13. Dietterich #MLSEV 13
• Method assumes that the class label distributions
are not changing
• The method can be unstable. Works best if you
have at least some labeled data for the target
domain to help choose hyperparameters
Domain-Adversarial Training
Weaknesses
15. Dietterich #MLSEV 15
• In 1980, Quinlan carefully designed
interpretable features with
predictive power. This is still
important today in most
applications
• Claim: Features should include
meta-data definitions
• “Numbers should never travel
alone across the internet” –Mark
Fox
• BigML flatline language
• SQL statements/procedures
• Trifacta rules
Feature Engineering
Example:
Student_Teacher_Ratio(school, time)
|{s | registered(s, school, time)}| /
∑ 𝐹𝐹𝐹𝐹𝐹𝐹(𝑡𝑡𝑡𝑡 , school, time)
16. Dietterich #MLSEV 16
• Allows data consumers to detect when the
meaning of the feature has changed even when
the feature name has not changed
• important for detecting data errors and
debugging classifier failures
Importance of
Feature Meta-Data
17. Dietterich #MLSEV 17
• No: Deep learning applications still
require careful data preparation
• image normalization, contrast
enhancement, etc.
• Yes: Deep learning can learn
powerful intermediate
representations
• <2012: Manually-designed SIFT
and HoG features for images
combined with support vector
machines or random forests
• >2012: Deep learning produces
much better results
Does Deep Learning Automate
Feature Engineering? Yes and No
0
5
10
15
20
25
30
2010 2011 2012 2013 2014
Top5ClassificationError
(%) Before After
ImageNet 1000 Classes
19. Dietterich #MLSEV 19
• 1980: Quinlan wanted interpretability because he
expected people to memorize the learned
decision tree
• In practice, we needed to check whether the
learning algorithm got the right answer
• Today: Our highest-performing models (random
forests, boosted trees, deep neural networks) are
not interpretable
• Interpretability and explanation are “hot topics”
in ML research
Interpretability and
Explanation
20. Dietterich #MLSEV 20
• Claim: Explanations should help the user perform some
task
• BigML has worked hard on visualization tools to provide
interpretability
• At Oregon State, we are developing explanation tools
for reinforcement learning
Explanation and
Interpretability
ML System User Task
Predictive Model ML Engineer Find errors and holes in data
Recommendation
System
End User
Decide whether to follow the
recommendation
Predictive Model
RL Model
ML Engineer
Acceptance Testing:
Decide whether delivered
system is sufficiently
accurate
22. Dietterich #MLSEV 22
• 1980: This issue was totally ignored
• Today: Giving calibrated uncertainty estimates is
important
• Calibrated Probabilities:
• When the classifier says “X belongs to class C
with probability 0.94”, then it is correct 94% of the
time
• This is measured using a separate labeled
“calibration set”
• Can use “out of bag” training data in random
forests
Uncertainty Quantification
23. Dietterich #MLSEV 23
• Some classifiers are always well-calibrated
• Decision Trees
• Random Forests
• Others must be post-processed to achieve good
calibration
• Boosted Trees
• Support Vector Machines
• Deep Neural Networks
Calibration
24. Dietterich #MLSEV 24
• Sort the predicted probabilities
into bins 0.0-0.1, 0.1-0.2, etc.
• For each bin, measure the
average accuracy on the
calibration data
• Plot the accuracy for each bin
• should lie on the diagonal if
well-calibrated
• Example shows that Naïve
Bayes is generally very
optimistic
Measuring Calibration via a
Reliability Diagram
Reliability Diagram (Naïve
Bayes; ADULT)
Zadrozny & Elkan,
2002
25. Dietterich #MLSEV 25
• Fit a function to the reliability
diagram
• Often a sigmoid (logistic
regression) function works
well
• Use this to convert the
predicted values (on X axis) to
calibrated values (Y axis)
• Similar techniques can
calibrate Naïve Bayes,
Deep Nets, Boosted Trees,
etc.
Fitting a recalibration
function
26. Dietterich #MLSEV 26
• Calibration compares predicted probability and expected accuracy
globally – across the entire calibration data set
• This may be misleading
• A classifier could achieve 95% accuracy and perfect calibration by classifying
95% of the data set perfectly and the remaining 5% completely incorrectly
• This 5% could be a specific customer segment
• Within that segment, the classifier is actually very poorly calibrated
because it outputs a confidence of 0.95 but is correct 0% of the time
• Lesson: Calibration should be done separately for each customer
segment or local group
• Decision trees calibrate separately for each leaf of the tree, so they
usually don’t exhibit this problem
• It is always important to look at model accuracy by customer
segments and other customer features (gender, race, region, age,
etc.)
• Example: Face recognition is less accurate on dark skin and on women, etc.
Local vs. Global Calibration
28. Dietterich #MLSEV 28
• Predictive models are only guaranteed to be
accurate if run-time queries are drawn from the same
distribution as the training data
• Open Category Problem: Run-time data may involve
new classes
• New types of objects in computer vision
• New classes of items (books, restaurants) in
recommender systems
• New diseases in medical systems
• New types of fraud in supervised fraud detection
Why Monitor?
29. Dietterich #MLSEV 29
• Outlier Detection
• Detect whether a new query 𝑥𝑥𝑞𝑞 is an outlier
compared to the training data 𝑥𝑥1, … , 𝑥𝑥𝑁𝑁
• Change Detection
• Detect whether the data distribution has changed
• Compare the 𝐿𝐿 most recent points 𝑥𝑥𝑡𝑡−𝐿𝐿+1, … , 𝑥𝑥𝑡𝑡
to the 𝐿𝐿 points before them, 𝑥𝑥𝑡𝑡−2𝐿𝐿+1, … , 𝑥𝑥𝑡𝑡−𝐿𝐿. Do
they come from different distributions?
How to Monitor?
30. Dietterich #MLSEV 30
• Most AD papers only evaluate on a few datasets
• Often proprietary or very easy (e.g., KDD 1999)
• ML community needs a large and growing
collection of public anomaly benchmarks
Anomaly Detection
Benchmarking Study
[Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013]
[Emmott, Das, Dietterich, Fern, Wong. 2016; arXiv 1503.01158v2]
31. Dietterich #MLSEV 31
• Density-Based Approaches
• RKDE: Robust Kernel Density
Estimation (Kim & Scott, 2008)
• EGMM: Ensemble Gaussian
Mixture Model (our group)
• Quantile-Based Methods
• OCSVM: One-class SVM
(Schoelkopf, et al., 1999)
• SVDD: Support Vector Data
Description (Tax & Duin, 2004)
Algorithms
• Neighbor-Based Methods
• LOF: Local Outlier Factor (Breunig,
et al., 2000)
• ABOD: kNN Angle-Based Outlier
Detector (Kriegel, et al., 2008)
• Projection-Based Methods
• IFOR: Isolation Forest (Liu, et al.,
2008)
• LODA: Lightweight Online Detector
of Anomalies (Pevny, 2016)
32. Dietterich #MLSEV 32
Algorithm Comparison
0
0.2
0.4
0.6
0.8
1
1.2
ChangeinMetricwrt
ControlDataset
Algorithm
logit(AUC)
log(LIFT)
Based on this
study, BigML
implemented
Isolation Forest
33. Dietterich #MLSEV 33
• Only make a prediction
if the query 𝑥𝑥𝑞𝑞 has a
low anomaly score
• Liu, et al. 2018 showed
how to set 𝜏𝜏 to
guarantee detecting
new category queries
with high probability
Open Category Detection
𝑥𝑥𝑞𝑞
Anomaly
Detector
𝐴𝐴 𝑥𝑥𝑞𝑞 > 𝜏𝜏?
Classifier 𝑓𝑓
Training
Examples
(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) no
𝑦𝑦 = 𝑓𝑓(𝑥𝑥𝑞𝑞)
yes reject
[Liu, Garrepalli, Fern, Dietterich, ICML 2018]
34. Dietterich #MLSEV 34
• “Two Sample” test. 𝑆𝑆𝑎𝑎 ∼ 𝑃𝑃𝑎𝑎, 𝑆𝑆𝑏𝑏 ∼ 𝑃𝑃𝑏𝑏, is 𝑃𝑃𝑎𝑎 ≠ 𝑃𝑃𝑏𝑏?
• Method 1: Kernel two-sample test
• Method 2: Old-vs-New Classifier
• Train a classifier to distinguish between 𝑆𝑆𝑎𝑎 and 𝑆𝑆𝑏𝑏. Can it
do better than random guessing?
• At each time 𝑡𝑡, slide 𝑆𝑆𝑎𝑎 and 𝑆𝑆𝑏𝑏 one step forward in time
(requires online methods)
• An area of active research
Change Detection
𝑥𝑥1 𝑥𝑥2 𝑥𝑥100 𝑥𝑥101 𝑥𝑥200
𝑆𝑆𝑎𝑎 𝑆𝑆𝑏𝑏
36. Dietterich #MLSEV 36
• Standard metrics for evaluating classifiers, such as F1
and AUC, were developed for machine learning research
• Most applications require separate metrics
• Example:
• Financial fraud
• Suppose we have 5 analysts and each analyst can
examine 10 cases per day
• Metric: Expected value of the top 50 alarms
(value@50).
• Incorporates the estimated value of each
candidate fraud alarm
Application-Specific Metrics
are Essential
37. Dietterich #MLSEV 37
• Open Category Detection:
• Detect 99% of all open category queries
• Metric: Precision at 99% recall
• Obstacle Detection for Self-Driving cars
• Detect 99.999% of all dangerous obstacles
• Metric: Precision at 99.999% recall
• Cancer Screening:
• Must trade off false alarms versus missed alarms
• Metric: Cost to patient (may vary from one patient to
another)
• AUC is a fairly good metric for this case
More Examples