More Related Content Similar to Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles (20) More from Intuit Inc. (20) Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles1. How to Create Predictive Models in
R using Ensembles
Giovanni Seni, Ph.D.
Intuit
@IntuitInc
Giovanni_Seni@intuit.com
Santa Clara University
GSeni@scu.edu
Strata - Hadoop World, New York
October 28, 2013
3. Overview
• Motivation, In a Nutshell & Timeline
• Predictive Learning & Decision Trees
• Ensemble Methods - Diversity & Importance Sampling
– Bagging
– Random Forest
– Ada Boost
– Gradient Boosting
– Rule Ensembles
• Summary
© 2013 G.Seni
2013 Strata Conference + Hadoop World
3
5. Motivation (2)
“1′st Place Algorithm Description: … 4. Classification: Ensemble classification methods are used to combine
multiple classifiers. Two separate Random Forest ensembles are created based on the shadow index (one for the
shadow-covered area and one for the shadow-free area). The random forest “Out of Bag” error is used to
automatically evaluate features according to their impact, resulting in 45 features selected for the shadow-free and
55 for the shadow-covered part.”
© 2013 G.Seni
2013 Strata Conference + Hadoop World
5
6. Motivation (3)
• “What are the best of the best techniques at winning
Kaggle competitions?
– Ensembles of Decisions Trees
– Deep Learning
account for 90% of top 3 winners!”
Jeremy Howard, Chief Scientist of Kaggle
KDD 2013
⇒ Key common characteristics:
– Resistance to overfitting
– Universal approximations
© 2013 G.Seni
2013 Strata Conference + Hadoop World
6
7. Ensemble Methods in a Nutshell
• “Algorithmic” statistical procedure
• Based on combining the fitted values from a number of
fitting attempts
• Loosely related to:
– Iterative procedures
– Bootstrap procedures
• Original idea: a “weak” procedure can be strengthened if
it can operate “by committee”
– e.g., combining low-bias/high-variance procedures
• Accompanied by interpretation methodology
© 2013 G.Seni
2013 Strata Conference + Hadoop World
7
8. Timeline
• CART (Breiman, Friedman, Stone, Olshen, 1983)
• Bagging (Breiman, 1996)
– Random Forest (Ho, 1995; Breiman 2001)
• AdaBoost (Freund, Schapire, 1997)
• Boosting – a statistical view (Friedman et. al., 2000)
– Gradient Boosting (Friedman, 2001)
– Stochastic Gradient Boosting (Friedman, 1999)
• Importance Sampling Learning Ensembles (ISLE)
(Friedman, Popescu, 2003)
© 2013 G.Seni
2013 Strata Conference + Hadoop World
8
9. Timeline (2)
• Regularization – variance control techniques:
– Lasso (Tibshirani, 1996)
– LARS (Efron, 2004)
– Elastic Net (Zou, Hastie, 2005)
– GLMs via Coordinate Descent (Friedman, Hastie, Tibshirani, 2008)
• Rule Ensembles (Friedman, Popescu, 2008)
© 2013 G.Seni
2013 Strata Conference + Hadoop World
9
10. Overview
• Motivation, In a Nutshell & Timeline
Ø Predictive Learning & Decision Trees
• Ensemble Methods
• Summary
© 2013 G.Seni
2013 Strata Conference + Hadoop World
10
11. Predictive Learning
Procedure Summary
N
N
• Given "training" data D = { yi , xi1 , xi 2 ,, xin }1 = { yi , x i }1
– D is a random sample from some unknown (joint) distribution
• Build a functional model y = F ( x1 , x2 ,, xn ) = F ( x )
– Offers adequate and interpretable description of how the inputs
affect the outputs
– Parsimony is an important criterion: simpler models are preferred
for the sake of scientific insight into the x - y relationship
• Need to specify: < model, score criterion, search strategy >
© 2013 G.Seni
2013 Strata Conference + Hadoop World
11
12. Predictive Learning
Procedure Summary (2)
• Model: underlying functional form sought from data
F (x) = F (x; a) ∈ ℱ
family of functions indexed by a
• Score criterion: judges (lack of) quality of fitted model
– Loss function L( y, F ): penalizes individual errors in prediction
– Risk R(a) = E y ,x L( y, F (x; a)) : the expected loss over all predictions
• Search Strategy: minimization procedure of score criterion
a* = arg min R(a)
a
© 2013 G.Seni
2013 Strata Conference + Hadoop World
12
13. Predictive Learning
Procedure Summary (3)
• “Surrogate” Score criterion:
N
– Training data: { yi , x i }1 ~ p( x, y )
*
– p ( x, y ) unknown ⇒ a unknown
⇒ Use approximation: Empirical Risk
1
• R (a) =
N
∑ L( y, F (xi ; a))
N
i =1
• If not N >> n ,
© 2013 G.Seni
⇒
a = arg min R(a)
a
R(a) >> R(a* )
2013 Strata Conference + Hadoop World
13
14. Predictive Learning
Example
• A simple data set
Attribute-1
Attribute-2
Class
( x1 )
( x2 )
1.0
2.0
blue
2.0
1.0
green
…
…
…
4.5
3.5
x2
?
(y)
• What is the class of new point
x1
?
• Many approaches… no method is universally better; try
several / use committee
© 2013 G.Seni
2013 Strata Conference + Hadoop World
14
15. Predictive Learning
Example (2)
• Ordinary Linear Regression (OLR)
x2
x1
n
– Model: F(x) = a0 + ∑ a j x j
j=1
;
⎧
F (x) ≥ 0 ⎨
⎩else
⇒ Not flexible enough
© 2013 G.Seni
2013 Strata Conference + Hadoop World
15
16. Decision Trees
Overview
x2
R2
x1 ≥ 5
R1
x2 ≥ 3
3
R4
x1 ≥ 2
R3
2
x1
5
M
ˆ
ˆ ˆ
• Model: y = T (x ) = ∑ cm I R (x )
m =1
m
M
{Rm }m=1 =
Sub-regions of input
variable space
where I R (x) = 1 if x ∈ R , 0 otherwise
© 2013 G.Seni
2013 Strata Conference + Hadoop World
16
17. Decision Trees
Overview (2)
• Score criterion:
– Classification – "0-1 loss" ⇒ misclassification error (or surrogate)
N
M
ˆ
{ cˆm, Rm } = argmin
1
M
cm ,Rm 1
TM ={
}
∑I (y ≠ T
i
M
(x i ))
i=1
2
ˆ
ˆ
– Regression – least squares – i.e., L( y , y ) = ( y − y )
M
ˆ ˆ
{ cm, Rm } = argmin
1
M
N
∑( y − T
i
TM ={cm ,Rm }1 i=1
M
(x i ))
R(TM )
2
• Search: Find T = arg min T R(T )
– i.e., find best regions Rm and constants cm
© 2013 G.Seni
2013 Strata Conference + Hadoop World
17
18. Decision Trees
Overview (3)
• Join optimization with respect to Rm and cm simultaneously is
very difficult
⇒ use a greedy iterative procedure
R0
R4
R1
R5 R6
R2
R3
•
•
j 1 , s1
R0
j 2 , s2
•
•
j 1 , s1
R0
j 2 , s2
R1
•
R0
•
R1
•
R3
•
j 1 , s1
•
R4
j 3 , s3
j 2 , s2
•
j 1 , s1
R0
•
R3
••
•
R4 R5
R7
2013 Strata Conference + Hadoop World
j 4 , s4
R6
•
© 2013 G.Seni
j 3 , s3
R2
R1
R2
•
•
R8
18
19. Decision Trees
What is the “right” size of a model?
y
y
y
ο ο ο
ο
ο
ο ο
ο ο
ο ο ο
ο ο
ο
⇒
ο
ο
ο
c1
ο
ο ο
x
ο ο
ο ο
ο ο
ο ο ο
c2
ο
vs
ο
ο
c1 ο
c2
ο ο
ο
ο ο
ο
ο ο
ο ο
c3
ο
ο
ο
ο
x
x
• Dilemma
– If model (# of splits) is too small, then approximation is too crude
(bias) ⇒ increased errors
– If model is too large, then it fits the training data too closely
(overfitting, increased variance) ⇒ increased errors
© 2013 G.Seni
2013 Strata Conference + Hadoop World
19
20. Decision Trees
What is the “right” size of a model? (2)
High Bias
Low Bias
Low Variance
Prediction Error
High Variance
Test Sample
Training Sample
Low
M*
High
Model Complexity
– Right sized tree, M * when test error is at a minimum
,
– Error on the training is not a useful estimator!
• If test set is not available, need alternative method
© 2013 G.Seni
2013 Strata Conference + Hadoop World
20
21. Decision Trees
Pruning to obtain “right” size
• Two strategies
– Prepruning - stop growing a branch when information becomes
unreliable
• #(Rm) – i.e., number of data points, too small
⇒ same bound everywhere in the tree
• Next split not worthwhile
⇒ Not sufficient condition
– Postpruning - take a fully-grown tree and discard unreliable parts
(i.e., not supported by test data)
• C4.5: pessimistic pruning
• CART: cost-complexity pruning
© 2013 G.Seni
(more statistically grounded)
2013 Strata Conference + Hadoop World
21
22. Decision Trees
1.0
Hands-on Exercise
Start Rstudio
•
0.8
•
Navigate to directory:
example.1.LinearBoundary
Load and run “fitModel_CART.R”
•
If curious, also see
“gen2DdataLinear.R”
•
After boosting discussion, load and
run “fitModel_GBM.R
0.0
0.2
0.4
x2
Set working directory: use
setwd() or with GUI
•
0.6
•
0.0
0.2
0.4
0.6
0.8
1.0
x1
© 2013 G.Seni
2013 Strata Conference + Hadoop World
22
23. Decision Trees
Key Features
• Ability to deal with irrelevant inputs
– i.e., automatic variable subset selection
– Measure anything you can measure
– Score provided for selected variables ("importance")
• No data preprocessing needed
- Naturally handle all types of variables
•
numeric, binary, categorical
- Invariant under monotone transformations: x j = g j (x j )
•
•
© 2013 G.Seni
Variable scales are irrelevant
Immune to bad x j −distributions (e.g., outliers)
2013 Strata Conference + Hadoop World
23
24. Decision Trees
Key Features (2)
• Computational scalability
– Relatively fast: O(nN log N )
• Missing value tolerant
- Moderate loss of accuracy due to missing values
- Handling via "surrogate" splits
• "Off-the-shelf" procedure
- Few tunable parameters
• Interpretable model representation
- Binary tree graphic
© 2013 G.Seni
2013 Strata Conference + Hadoop World
24
25. Decision Trees
Limitations
• Discontinuous piecewise constant model
F (x)
x
– In order to have many splits you need to have a lot of data
• In high-dimensions, you often run out of data after a few splits
– Also note error is bigger near region boundaries
© 2013 G.Seni
2013 Strata Conference + Hadoop World
25
26. Decision Trees
Limitations (2)
• Not good for low interaction F * (x )
n
*
– e.g., F (x ) = ao + ∑ a j x j is worst function for trees
j =1
n
= ∑ f j* (x j )
(no interaction, additive)
j =1
– In order for xl to enter model, must split on it
• Path from root to node is a product of indicators
• Not good for F * (x ) that has dependence on many variables
- Each split reduces training data for subsequent splits (data
fragmentation)
© 2013 G.Seni
2013 Strata Conference + Hadoop World
26
27. Decision Trees
Limitations (3)
• High variance caused by greedy search strategy (local optima)
– Errors in upper splits are propagated down to affect all splits
below it
⇒ Small changes in data (sampling fluctuations) can cause
big changes in tree
- Very deep trees might be questionable
- Pruning is important
• What to do next?
– Live with problems
– Use other methods (when possible)
– Fix-up trees: use ensembles
© 2013 G.Seni
2013 Strata Conference + Hadoop World
27
28. Overview
• In a Nutshell & Timeline
• Predictive Learning & Decision Trees
Ø Ensemble Methods
– In a Nutshell, Diversity & Importance Sampling
– Generic Ensemble Generation
– Bagging, RF, AdaBoost, Boosting, Rule Ensembles
• Summary
© 2013 G.Seni
2013 Strata Conference + Hadoop World
28
29. Ensemble Methods
In a Nutshell
M
• Model: F (x) = c0 + ∑m=1 cmTm (x)
M
– { m (x)}1 : “basis” functions (or “base learners”)
T
– i.e., linear model in a (very) high dimensional space of derived
variables
• Learner characterization:
Tm (x) = T (x; p m )
– p m : a specific set of joint parameter values – e.g., split definitions
at internal nodes and predictions at terminal nodes
– {T (x; p)}p∈P : function class – i.e., set of all base learners of specified
family
© 2013 G.Seni
2013 Strata Conference + Hadoop World
29
30. Ensemble Methods
In a Nutshell (2)
• Learning: two-step process; approximate solution to
N
M
M
{cm , p m }o = arg min ∑ L yi , c0 + ∑ cmT (x;p m )
M
{cm , p m }o i=1
(
m=1
)
M
– Step 1: Choose points {p m }1
M
• i.e., select {Tm (x)}1 ⊂ {T (x; p)}p∈P
M
– Step 2: Determine weights {cm }0
• e.g., via regularized LR
© 2013 G.Seni
2013 Strata Conference + Hadoop World
30
31. Ensemble Methods
Importance Sampling (Friedman, 2003)
• How to judiciously choose the “basis” functions (i.e., {pm }1M )?
M
• Goal: find “good” {pm }1 so that
M
M
F (x;{p m }1 , {cm }1 ) ≅ F * (x )
• Connection with numerical integration:
–
∫
Ρ
M
I (p) ∂p ≈ ∑m =1 w m I (p m )
vs.
© 2013 G.Seni
2013 Strata Conference + Hadoop World
Accuracy improves
when we choose more
points from this
region…
31
32. Importance Sampling
Numerical Integration via Monte Carlo Methods
M
• r (p) = sampling pdf of p ∈ P -- i.e, {p m ~ r (p)}1
– Simple approach: r (p m ) iid -- i.e., uniform
– In our problem: inversely related to p m’s “risk”
• i.e., T (x; p m ) has high error ⇒ lack of relevance of p m ⇒ low r (pm )
• “Quasi” Monte Carlo:
– with/out knowledge of the other points that will be used
• i.e., single point vs. group importance
– Sequential approximation: p’s relevance judged in the context of
the (fixed) previously selected points
© 2013 G.Seni
2013 Strata Conference + Hadoop World
32
33. Ensemble Methods
Importance Sampling – Characterization of
• Let p∗ = arg minp Risk(p)
Narrow r (p)
Broad r (p)
M
T
• Ensemble { (x; p m )}1 of “strong” base
learners - i.e., all with Risk (p m ) ≈ Risk (p∗ )
• Diverse ensemble - i.e., predictions are
not highly correlated with each other
• T (x; p m ) yield similar highly correlated
’s
predictions ⇒ unexceptional performance
• However, many “weak” base learners - i.e.,
Risk (p m ) >> Risk (p ∗ ) ⇒ poor performance
© 2013 G.Seni
2013 Strata Conference + Hadoop World
33
34. Ensemble Methods
Approximate Process of Drawing from
• Heuristic sampling strategy: sampling around p by iteratively
applying small perturbations to existing problem structure
∗
– Generating ensemble members Tm (x) = T (x; p m )
For m = 1 to M {
pm = PERTURBm { minp Ε xy L( y, T (x; p) )}
arg
}
⋅
– PERTURB {} is a (random) modification of any of
• Data distribution - e.g., by re-weighting the observations
• Loss function - e.g., by modifying its argument
• Search algorithm (used to find minp)
– Width of r (p ) is controlled by degree of perturbation
© 2013 G.Seni
2013 Strata Conference + Hadoop World
34
35. Generic Ensemble Generation
Step 1: Choose Base Learners p!
!
! !
• Forward Stagewise Fitting Procedure:
𝐹0 (x) = 0
For 𝑚 = 1 to 𝑀 {
//
Fit
a
single
base
learner
p
Modification of
data distribution
𝑚
= argmin .
p
𝐿0𝑦 𝑖 , 𝐹 𝑚 −1 + 𝑇(x 𝑖 ; p)8
𝑖∈𝑆 𝑚 ( 𝜂 )
//
Update
additive
expansion
𝑇 𝑚 ( 𝑥 ) = 𝑇0x; p 𝒎 8
𝐹 𝑚 (x) = 𝐹 𝑚 −1 (x) + 𝜐 ∙ 𝑇 𝑚 (x)
}
write { 𝑇 𝑚 (x)}1𝑀
– Algorithm control: L, η , υ
Modification of loss function
(“sequential” approximation)
• Sm (η ) : random sub-sample of size η ≤ N ⇒ impacts ensemble "diversity"
m −1
• Fm−1 (x) = υ ⋅ ∑k =1Tk (x) : “memory” function (0 ≤ υ ≤ 1 )
© 2013 G.Seni
2013 Strata Conference + Hadoop World
35
36. Generic Ensemble Generation
Step 2: Choose Coefficients c!
!
!!
M
M
• Given {Tm (x)}m=1 = {T (x; pm )}m=1 , coefficients can be obtained by a
regularized linear regression
N
M
⎛
⎞
{cm } = arg min ∑ L⎜ yi , c0 + ∑ cmTm (xi ) ⎟ + λ ⋅ P(c )
{cm }
i =1 ⎝
m =1
⎠
– Regularization here helps reduce bias (in addition to variance) of the
model
– New iterative fast algorithms for various loss/penalty combinations
• “GLMs via Coordinate Descent” (2008)
© 2013 G.Seni
2013 Strata Conference + Hadoop World
36
37. Bagging (Breiman, 1996)
• Bagging = Bootstrap Aggregation
ˆ
• L(y, y) : as available for single tree
F0 (x) = 0
For m = 1 to M {
• υ = 0 ⇒ no memory
p m = arg min
p
• η = N / 2
i
m −1
i∈S m ( )
( x i ) + T ( x i ; p ))
Tm (x) = T (x; p m )
• Tm (x) ⇒ are large un-pruned
trees
∑ηL(y , F
Fm (x) = Fm −1 (x) + υ ⋅ Tm (x)
υ
}
• co = 0, {cm = 1 / M }1M
M
i.e., not fit to the data (avg)
write {Tm (x)}1
– i.e., perturbation of the data distribution only
– Potential improvements?
– R package: ipred
© 2013 G.Seni
2013 Strata Conference + Hadoop World
37
38. Bagging
Hands-on Exercise
1.0
•
Navigate to directory:
example.2.EllipticalBoundary
0.0
Load and run
– fitModel_Bagging_by_hand.R
-0.5
– fitModel_CART.R (optional)
•
If curious, also see
gen2DdataNonLinear.R
-1.0
x2
Set working directory: use setwd()
or with GUI
•
0.5
•
•
After class, load and run
fitModel_Bagging.R
-2
-1
0
1
2
x1
© 2013 G.Seni
2013 Strata Conference + Hadoop World
38
39. Bagging
Why it helps?
ˆ
• Under L( y, y) = ( y − y) 2, averaging reduces variance and leaves
bias unchanged
• Consider “idealized” bagging (aggregate)
estimator: f (x) = Ε f Z (x)
– f Z fit to bootstrap data set Z = {yi , xi }1N
– Z is sampled from actual population distribution (not training data)
– We can write: Ε[Y − f Z (x)] = Ε[Y − f (x) + f (x) − f Z (x)]
2
2
2
= Ε Y − f ( x) + Ε f Z ( x) − f ( x)
[
]
≥ Ε[Y − f (x)]
[
]
2
2
⇒ true population aggregation never increases mean squared error!
⇒ Bagging will often decrease MSE…
© 2013 G.Seni
2013 Strata Conference + Hadoop World
39
40. Random Forest (Ho, 1995; Breiman, 2001)
• Random Forest = Bagging + algorithm randomizing
– Subset splitting
As each tree is constructed…
• Draw a random sample of predictors before each node is split
ns = ⎣log 2 (n) + 1⎦
• Find best split as usual but selecting only from subset of predictors
M
⇒ Increased diversity among {Tm (x)}1 - i.e., wider r (p)
• Width (inversely) controlled by ns
– Speed improvement over Bagging
– R package: randomForest
© 2013 G.Seni
2013 Strata Conference + Hadoop World
40
41. Bagging vs. Random Forest vs. ISLE
100 Target Functions Comparison (Popescu, 2005)
• ISLE improvements:
– Different data sampling strategy (not fixed)
– Fit coefficients to data
Comparative RMS Error
• xxx_6_5%_P : 6 terminal nodes trees
5% samples without replacement
Post-processing – i.e., using
estimated “optimal” quadrature
coefficients
⇒ Significantly faster to build!
Bag
© 2013 G.Seni
RF
Bag_6_5%_P RF_6_5%_P
2013 Strata Conference + Hadoop World
41
42. AdaBoost (Freund & Schapire, 1997)
observation weights : wi( 0 ) = 1 N
For m = 1 to M {
a. Fit a classifier Tm (x) to training data with wi( m )
b. Compute
errm =
∑
N
i =1
(cm , p m ) = arg min
w I ( yi ≠ Tm (x i ))
∑
N
∑ηL( y , F
i
m −1
(x i ) + c ⋅ T (x i ; p) )
i∈S m ( )
Tm (x) = T (x; p m )
Fm (x) = Fm −1 (x) + υ ⋅ cm ⋅ Tm (x)
d. Set wi( m +1) = wi( m ) ⋅ exp[α m ⋅ I ( yi ≠ Tm (x i )]
}
Output sign ∑m =1α mTm (x)
c, p
wi( m )
c. Compue α m = log((1 − errm ) errm )
M
For m = 1 to M {
(m)
i
i =1
(
F0 (x) = 0
}
M
write {cm , Tm (x)}1
)
– We need to show p m = arg min (⋅) is equivalent to line a. above
p
Book
• Equivalence to Forward Stagewise Fitting Procedure
– cm = arg min (⋅) is equivalent to line c.
c
• R package adabag
© 2013 G.Seni
2013 Strata Conference + Hadoop World
42
43. AdaBoost
Hands-on Exercise
1.0
•
Navigate to directory:
example.2.EllipticalBoundary
Set working directory: use setwd()
or with GUI
•
Load and run
0.0
– fitModel_Adaboost_by_hand.R
-0.5
•
After class, load and run
fitModel_Adaboost.R and
fitModel_RandomForest.R
-1.0
x2
0.5
•
-2
-1
0
1
2
x1
© 2013 G.Seni
2013 Strata Conference + Hadoop World
43
44. Stochastic Gradient Boosting (Friedman, 2001)
• Boosting with any differentiable loss criterion
ˆ
• General L( y, y )
F0 (x) = c00
• υ = 0.1 ⇒ Sequential sampling
For m = 1 to M {
(cm , p m ) = arg min
m
c, p
• η = N 2
∑ηL( y , F
i
m −1
i∈S m ( )
(x i ) + c ⋅ T (x i ; p))
Tm (x) = T (x; p m )
• Tm (x) ⇒ Any “weak” learner
N
• co = arg minc ∑i =1 L( yi , c)
Fm (x) = Fm −1 (x) +υ ⋅ cm ⋅ Tm (x)
υ
}
M
write {(υ ⋅ cm ), Tm (x)}1
M
• {cm }1 ⇒ “shrunk” sequential
partial regression coefficients
– Potential improvements?
– R package: gbm
© 2013 G.Seni
2013 Strata Conference + Hadoop World
44
45. Stochastic Gradient Boosting
LAD Regression – L
!, ! = ! − !
• More robust than ( y − F )2
• Resistant to outliers in y
…trees already providing
resistance to outliers in x
!
N
F0 (x) = median{yi }1
For m = 1 to M {
// Step1 : find Tm (x)
~ = sign ( y − F (x ) )
yi
i
m −1
i
• Note:
{R }
J
jm 1
– Trees are fitted to pseudoresponse
(
// Step2 : find coefficients
⇒ Can’t interpret interpret
γˆ jm = median{yi − Fm −1 (x i )}1N
x i ∈R jm
individual trees
– “shrunk” version of tree gets
added to ensemble
j = 1… J
// Update expansion
– Original tree constants are
overwritten
© 2013 G.Seni
N
= J − terminal node LS - regression tree {~i , x i }1
y
Fm (x) = Fm −1 (x) + υ ⋅ ∑ γˆ jm I x i ∈ R jm
J
j =1
(
)
}
2013 Strata Conference + Hadoop World
45
)
46. Parallel vs. Sequential Ensembles
100 Target Functions Comparison (Popescu, 2005)
Comparative RMS Error
• xxx_6_5%_P : 6 terminal nodes trees
5% samples without replacement
Post-processing – i.e., using
estimated “optimal” quadrature
coefficients
“Sequential”
“Parallel”
Bag
RF
Boost
Seq_0.01_20%_P
• Seq_υ_η%_P : “Sequential” ensemble
6 terminal nodes trees
υ : “memory” factor
η % samples without replacement
Post-processing
Bag_6_5%_P RF_6_5%_P
Seq_0.1_50%_P
• Sequential ISLE tend to perform better than parallel ones
– Consistent with results observed in classical Monte Carlo integration
© 2013 G.Seni
2013 Strata Conference + Hadoop World
46
47. Rule Ensembles (Friedman & Popescu, 2005)
J
ˆ
ˆ
• Trees as collection of conjunctive rules: Tm (x) = ∑ c jm I (x ∈ R jm )
j =1
R1
27
R4
15
R2 ⇒
R5
15
22
x1
r2 (x) = I ( x1 > 22) ⋅ I (0 ≤ x2 ≤ 27)
r3 (x) = I (15 < x1 ≤ 22) ⋅ I (0 ≤ x2 )
R4 ⇒
r4 (x) = I (0 ≤ x1 ≤ 15) ⋅ I ( x2 > 15)
R5 ⇒
x2
r1 (x) = I ( x1 > 22) ⋅ I ( x2 > 27)
R3 ⇒
R3
ˆ
y
R2
R1 ⇒
r5 (x) = I (0 ≤ x1 ≤ 15) ⋅ I (0 ≤ x2 ≤ 15)
– These simple rules, rm (x) ∈ {0,1} can be used as base learners
,
– Main motivation is interpretability
© 2013 G.Seni
2013 Strata Conference + Hadoop World
47
48. Rule Ensembles
ISLE Procedure
• Rule-based model: F (x) = a0 + ∑ am rm (x)
m
– Still a piecewise constant model ⇒ complement the non-linear rules
with purely linear terms:
• Fitting
– Step 1: derive rules from tree ensemble (shortcut)
• Tree size controls rule “complexity” (interaction order)
– Step 2: fit coefficients using linear regularized procedure:
(
N
P
K
ˆ
ˆ
({ak },{b j }) = arg min ∑ L yi , F x; {ak }0 , {b j }1
{ak },{b j }
© 2013 G.Seni
i=1
(
)) +!!! ⋅
2013 Strata Conference + Hadoop World
!(a) + !(b) !
48
49. Boosting & Rule Ensembles
Hands-on Exercise
2500
•
example.3.Diamonds
Load and run
1500
Set working directory: use setwd()
or with GUI
•
2000
•
– viewDiamondData.R
– fitModel_GBM.R
1000
– fitModel_RE.R
•
500
Absolute loss
Navigate to directory:
After class, go to:
example.1.LinearBoundary
0
200
400
600
800
1000
Run fitModel_GBM.R
Iteration
© 2013 G.Seni
2013 Strata Conference + Hadoop World
49
50. Overview
• Motivation, In a Nutshell & Timeline
• Predictive Learning & Decision Trees
• Ensemble Methods
Ø Summary
© 2013 G.Seni
2013 Strata Conference + Hadoop World
50
51. Summary
• Ensemble methods have been found to perform extremely
well in a variety of problem domains
• Shown to have desirable statistical properties
• Latest ensemble research brings together important
foundational strands of statistics
• Emphasis on accuracy but significant progress has been
made on interpretability
Go build Ensembles and keep in touch!
© 2013 G.Seni
2013 Strata Conference + Hadoop World
51