Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 2: Ensembles and Logistic Regressions. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
3. BigML, Inc 3Ensembles
what is an Ensemble?
• Rather than build a single model…
• Combine the output of several typically “weaker” models into
a powerful ensemble…
• Q1: Why is this necessary?
• Q2: How do we build “weaker” models?
• Q3: How do we “combine” models?
4. BigML, Inc 4Ensembles
No Model is Perfect
• A given ML algorithm may simply not be able to exactly
model the “real solution” of a particular dataset.
• Try to fit a line to a curve
• Even if the model is very capable, the “real solution” may be
elusive
• DT/NN can model any decision boundary with enough
training data, but the solution is NP-hard
• Practical algorithms involve random processes and may
arrive at different, yet equally good, “solutions” depending
on the starting conditions, local optima, etc.
• If that wasn’t bad enough…
5. BigML, Inc 5Ensembles
No Data is Perfect
• Not enough data!
• Always working with finite training data
• Therefore, every “model” is an approximation of the “real
solution” and there may be several good approximations.
• Anomalies / Outliers
• The model is trying to generalize from discrete training
data.
• Outliers can “skew” the model, by overfitting
• Mistakes in your data
• Does the model have to do everything for you?
• But really, there is always mistakes in your data
6. BigML, Inc 6Ensembles
Ensembles Techniques
• Key Idea:
• By combining several good “models”, the combination
may be closer to the best possible “model”
• we want to ensure diversity. It’s not useful to use an
ensemble of 100 models that are all the same
• Training Data Tricks
• Build several models, each with only some of the data
• Introduce randomness directly into the algorithm
• Add training weights to “focus” the additional models on
the mistakes made
• Prediction Tricks
• Model the mistakes
• Model the output of several different algorithms
12. BigML, Inc 12Ensembles
Decision Forest Config
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
• Number of models: How many trees to build
• Sampling options:
• Deterministic / Random
• Replacement:
• Allows sampling the same instance more than once
• Effectively the same as ≈ 63.21%
• “Full size” samples with zero covariance (good thing)
• At prediction time
• Combiner…
13. BigML, Inc 13Ensembles
Quick Review
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classification
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
label
14. BigML, Inc 14Ensembles
Ensemble Combiners
• Regression: Average of the predictions and expected error
• Classification:
• Plurality - majority wins.
• Confidence Weighted - majority wins but each vote is
weighted by the confidence.
• Probability Weighted - each tree votes the distribution at
it’s leaf node.
• K Threshold - only votes if the specified class and
required number of trees is met. For example, allowing a
“True” vote if and only if at least 9 out of 10 trees vote
“True”.
• Confidence Threshold - only votes the specified class if
the minimum confidence is met.
16. BigML, Inc 16Ensembles
Outlier Example
Diameter Color Shape Fruit
4 red round plum
5 red round apple
5 red round apple
6 red round plum
7 red round apple
All Data: “plum”
Sample 2: “apple”
Sample 3: “apple”
Sample 1: “plum”
}“apple”
What is a round, red 6cm fruit?
17. BigML, Inc 17Ensembles
Random Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER
18. BigML, Inc 18Ensembles
RDF Config
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
• Decision Forest parameters still available
• Number of model, Sampling, etc
• Random candidates:
• The number of features to consider at each split
20. BigML, Inc 20Ensembles
Boosting
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44.594828 -123.269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44.643876 -123.238189 307500
4748 NW
Veronica
5 3.5 4135 6098 2004 44.5929659 -123.306916 600000
411 NW 16th 3 2825 4792 1938 44.570883 -123.272113 435350
MODEL 1
PREDICTED
SALE PRICE
360750
306875
587500
435350
ERROR
750
-625
-12500
0
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE ERROR
1522 NW
Jonquil
4 3 2424 5227 1991 44.594828 -123.269328 750
7360 NW
Valley Vw
3 2 1785 25700 1979 44.643876 -123.238189 625
4748 NW
Veronica
5 3.5 4135 6098 2004 44.5929659 -123.306916 12500
411 NW 16th 3 2825 4792 1938 44.570883 -123.272113 0
MODEL 2
PREDICTED
ERROR
750
625
12393.83333
6879.67857
Why
stop
at
one
iteration?
"Hey Model 1, what do you predict is the sale price of this home?"
"Hey Model 2, how much error do you predict Model 1 just made?"
21. BigML, Inc 21Ensembles
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration
1
Iteration
2
Iteration
3
Iteration
4
etc…
22. BigML, Inc 22Ensembles
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
23. BigML, Inc 23Ensembles
Boosting Config
“OUT OF BAG”
SAMPLES
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration
1
Iteration
2
Iteration
3
Iteration
4
etc…
24. BigML, Inc 24Ensembles
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
25. BigML, Inc 25Ensembles
Iterations
Boosted
Ensemble
#1
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Early Stop # Iterations
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Boosted
Ensemble
#2
Early Stop# Iterations
This is OK because the early stop means the iterative improvement is small
and we have "converged" before being forcibly stopped by the # iterations
This is NOT OK because the hard limit on iterations stopped improving the quality of the
boosting long before there was enough iterations to have achieved the best quality.
26. BigML, Inc 26Ensembles
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize
27. BigML, Inc 27Ensembles
Boosting Randomize
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration
1
Iteration
2
Iteration
3
Iteration
4
etc…
28. BigML, Inc 28Ensembles
Boosting Randomize
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration
1
Iteration
2
Iteration
3
Iteration
4
etc…
29. BigML, Inc 29Ensembles
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
32. BigML, Inc 32Ensembles
Wait a Second…
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33.6 0.627 50 RED
1 85 66 29 0 26.6 0.351 31 GREEN
8 183 64 0 0 23.3 0.672 32 BLUE
1 89 66 23 94 28.1 0.167 21 RED
MODEL 1
predicted
favorite color
BLUE
GREEN
RED
GREEN
ERROR
?
?
?
?
…
but
then
what
about
multiple
classes?
33. BigML, Inc 33Ensembles
Boosting Classification
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33.6 0.627 50 RED
1 85 66 29 0 26.6 0.351 31 GREEN
8 183 64 0 0 23.3 0.672 32 BLUE
1 89 66 23 94 28.1 0.167 21 RED
MODEL 1
RED/NOT RED
Class RED
Probability
0.9
0.7
0.46
0.12
Class RED
ERROR
0.1
-0.7
0.54
-0.12
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age ERROR
6 148 72 35 0 33.6 0.627 50 0.1
1 85 66 29 0 26.6 0.351 31 -0.7
8 183 64 0 0 23.3 0.672 32 0.54
1 89 66 23 94 28.1 0.167 21 -0.12
MODEL 2
RED/NOT RED ERR
PREDICTED
ERROR
0.05
-0.54
0.32
-0.22
MODEL 1
BLUE/NOT BLUE
Class BLUE
Probability
0.1
0.3
0.54
0.88
Class BLUE
ERROR
-0.1
0.7
-0.54
0.12
…and
repeat
for
each
class
at
each
iteration
…and
repeat
for
each
class
at
each
iteration
Iteration
1
Iteration
2
34. BigML, Inc 34Ensembles
Boosting Classification
DATASET
MODELS 1
per class
DATASETS 2
per class
MODELS 2
per class
PREDICTIONS 1
per class
PREDICTIONS 2
per class
PREDICTIONS 3
per class
PREDICTIONS 4
per class
Comb
PROBABILITY
per class
MODELS 3
per class
MODELS 4
per class
DATASETS 3
per class
DATASETS 4
per class
Iteration
1
Iteration
2
Iteration
3
Iteration
4
etc…
37. BigML, Inc 37Ensembles
Which Ensemble Method
• The one that works best!
• Ok, but seriously. Did you evaluate?
• For "large" / "complex" datasets
• Use DF/RDF with deeper node threshold
• Even better, use Boosting with more iterations
• For "noisy" data
• Boosting may overfit
• RDF preferred
• For "wide" data
• Randomize features (RDF) will be quicker
• For "easy" data
• A single model may be fine
• Bonus: also has the best interpretability!
• For classification with "large" number of classes
• Boosting will be slower
• For "general" data
• DF/RDF likely better than a single model or Boosting.
• Boosting will be slower since the models are processed serially
38. BigML, Inc 38Ensembles
Too Many Parameters?
• How many trees?
• How many nodes?
• Missing splits?
• Random candidates?
• Too many parameters?
SMACdown!
39. BigML, Inc 39Ensembles
Summary
• Models have shortcomings: ability to fit, NP-hard, etc
• Data has shortcomings: not enough, outliers, mistakes, etc
• Ensemble Techniques can improve on single models
• Sampling: partitioning, Decision Tree bagging
• Adding Randomness: RDF
• Modeling the Error: Boosting
• Modeling the Models: Stacking
• Guidelines for knowing which one might work best in a given
situation
41. BigML, Inc 3Logistic Regressions
Logistic Regression
• Classification implies a discrete objective. How can this be a
regression?
• Why do we need another classification algorithm?
• more questions….
Logistic Regression is a classification algorithm
Potential Confusion:
45. BigML, Inc 7Logistic Regressions
Regression
• Linear Regression: 𝛽₀+𝛽1·∙(INPUT)
≈
OBJECTIVE
• Quadratic Regression: 𝛽₀+𝛽1·∙(INPUT)+𝛽2·∙(INPUT)2
≈
OBJECTIVE
• Decision Tree Regression: DT(INPUT)
≈
OBJECTIVE
• Problem:
• What if we want to do a classification problem: T/F or 1/0
• What function can we fit to discrete data?
Regression is the process of "fitting" a function to the data
Key Take-Away:
48. BigML, Inc 10Logistic Regressions
Logistic Function
𝑥➝-‐∞
𝑓(𝑥)➝0
• Looks promising, but still not "discrete"
• What about the "green" in the middle?
• Let’s change the problem…
𝑥➝∞
𝑓(𝑥)➝1
Goal
1
1
+
𝒆−𝑥𝑓(𝑥)
=
Logistic Function
50. BigML, Inc 12Logistic Regressions
Logistic Regression
• Assumes that output is linearly related to "predictors"
• What? (hang in there…)
• Sometimes we can "fix" this with feature engineering
• Question: how do we "fit" the logistic function to real data?
LR is a classification algorithm … that uses a regression …
to model the probability of the discrete objective
Clarification:
Caveats:
51. BigML, Inc 13Logistic Regressions
Logistic Regression
𝛽₀ is the "intercept"
𝛽₁ is the "coefficient"
• In which case solving is now a linear regression
• But this is only one dimension, that is one feature 𝑥…
• Given training data consisting of inputs 𝑥, and probabilities 𝑃
• Solve for 𝛽₀ and 𝛽₁ to fit the logistic function
• How? The inverse of the logistic function is called the "logit":
𝑃(𝑥)=
1
1+𝑒−(𝛽0+𝛽1 𝑥)
𝑙𝑛( )𝑃(𝑥)
𝑃(𝑥ʹ′)
=𝑙𝑛 ( )1-𝑃(𝑥ʹ′)
𝑃(𝑥)
=𝛽0+𝛽1 𝑥
55. BigML, Inc 17Logistic Regressions
LR Parameters
1. Default Numeric: Replaces missing numeric values
2. Missing Numeric: Adds a field for missing numerics
3. Stats: Extended statistics, ex: p-value (runs slower)
4. Bias: Enables/Disables the intercept term - 𝛽₀
• Don’t disable this…
5. Regularization: Reduces over-fitting by minimizing 𝛽𝑗
• L1: prefers reducing individual coefficients
• L2 (default): prefers reducing all coefficients
6. Strength "C": Higher values reduce regularization
7. EPS: The minimum error between steps to stop
Larger values stop earlier but quality may be less
8. Auto-scaling: Ensures that all features contribute equally
• Don’t change this unless you have a specific reason
56. BigML, Inc 18Logistic Regressions
LR Questions
• How do we handle multiple classes?
• Binary class True/False only need to solve for one
𝑃(True)
≡
1−
𝑃(False)
• What about non-numeric inputs?
• Text/Items fields
• Categorical fields
Questions:
57. BigML, Inc 19Logistic Regressions
LR - Multi Class
• Instead of a binary class ex: [ true, false ],
we have multi-class ex: [ red, green, blue, … ]
• "𝑘" classes: 𝑪=[𝑐1,
𝑐2,⋯,
𝑐 𝑘]
• solve one-vs-rest LR
• Result: 𝞫𝑗 for each class 𝑐𝑗
• apply combiner to ensure
all probabilities add to 1
𝑙𝑛( )𝑃(𝑐1)
𝑃(𝑐1ʹ′)
=𝛽1,0+𝞫1·∙𝑿
𝑙𝑛( )𝑃(𝑐2)
𝑃(𝑐2ʹ′)
=𝛽2,0+𝞫2·∙𝑿
⋯
𝑙𝑛( )𝑃(𝑐 𝑘)
𝑃(𝑐 𝑘ʹ′)
=𝛽 𝑘,0+𝞫 𝑘·∙𝑿
58. BigML, Inc 20Logistic Regressions
LR - Field Codings
• LR is expecting numeric values to perform regression.
• How do we handle categorical values, or text?
Class color=red color=blue color=green color=NULL
red 1 0 0 0
blue 0 1 0 0
green 0 0 1 0
MISSING 0 0 0 1
One-hot encoding
• Only one feature is "hot" for each class
• This is the default
59. BigML, Inc 21Logistic Regressions
LR - Field Codings
Dummy Encoding
• Chooses a *reference class*
• requires one less degree of freedom
Class color_1 color_2 color_3
*red* 0 0 0
blue 1 0 0
green 0 1 0
MISSING 0 0 1
60. BigML, Inc 22Logistic Regressions
LR - Field Codings
Contrast Encoding
• Field values must sum to zero
• Allows comparison between classes
Class field "influence"
red 0.5 positive
blue -0.25 negative
green -0.25 negative
MISSING 0 excluded
61. BigML, Inc 23Logistic Regressions
LR - Field Codings
Which one to use?
• One-hot is the default
• Use this unless you have a specific need
• Dummy
• Use when there is a control group in mind, which
becomes the reference class
• Contrast
• Allows for testing specific hypothesis of relationships.
• Ex: customers give a "rating" of bad / ok / good
rating
Contrast
Encoding
bad -0.66
ok 0.33
good 0.33
Hypothesis is a
good and ok
review have the
same impact, but
a bad review has
a negative impact
twice as great.
rating
Contrast
Encoding
bad -0.5
ok 0
good 0.5
Hypothesis is that
a good and bad
review have an
equal but opposite
impact, while an
ok rating has no
impact.
62. BigML, Inc 24Logistic Regressions
LR - Field Codings
• Text/Items field types are handled by creating a field
for text token/item and setting 1 or 0
Text "hippo" "safari" "zebra"
“we saw hippos and
zebras…
1 0 1
“The best safari for
seeing zebras”
0 1 1
“The Oregon coast
is rainy in winter”
0 0 0
“Have you ever tried
a hippo burger”
1 0 0
Text / Items
64. BigML, Inc 26Logistic Regressions
Curvilinear LR
• Logistic Regression is expecting a linear relationship
between the features and the objective
• Remember - it’s a linear regression under the hood
• This is actually pretty common in natural datasets
• But non-linear relationships will impact model quality
• This can be addressed by adding non-linear
transformations to the features
• Knowing which transformations requires
• domain knowledge
• experimentation
• both
65. BigML, Inc 27Logistic Regressions
Curvilinear LR
Instead of
We could add a feature
Where
????
Possible to add any higher order terms or other functions to
match shape of data
𝛽0+𝛽1 𝑥1
𝛽0+𝛽1 𝑥1+𝛽2 𝑥2
𝑥1
≡
𝑥2
2
67. BigML, Inc 29Logistic Regressions
LR vs DT
• Expects a "smooth" linear
relationship with predictors.
• LR is concerned with probability of
a discrete outcome.
• Lots of parameters to get wrong:
regularization, scaling, codings
• Slightly less prone to over-fitting
• Because fits a shape, might work
better when less data available.
• Adapts well to ragged non-linear
relationships
• No concern: classification,
regression, multi-class all fine.
• Virtually parameter free
• Slightly more prone to over-fitting
• Prefers surfaces parallel to
parameter axes, but given enough
data will discover any shape.
Logistic Regression Decision Tree
69. BigML, Inc 31Logistic Regressions
Summary
• Logistic Regression is a classification algorithm that
models the probabilities of each class
• How the algorithm works and why this is important
• Expects a linear relationship between the features
and the objective, and how to fix it
• Categorical encodings
• LR outputs a set of coefficients and how to interpret
• Scale relates to impact
• Sign relates to direction of impact
• Guidelines for comparing to Decision Trees