Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Ensemble Learners
• Classiﬁcation Trees (Done)
• Bagging: Averaging Trees
• Random Forests: Cleverer Averaging of Trees
• Boosting: Cleverest Averaging of Trees
• Details in

chaps 10, 15 and 16

Methods for dramatically improving the performance of weak
learners such as Trees. Classiﬁcation trees are adaptive and robust,
but do not make good prediction models. The techniques discussed
here enhance their performance considerably.

227



Properties of Trees
• [ ]Can handle huge datasets
• [ ]Can handle mixed predictors—quantitative and qualitative
• [ ]Easily ignore redundant variables
• [ ]Handle missing data elegantly through surrogate splits
• [ ]Small trees are easy to interpret
• ["]Large trees are hard to interpret
• ["]Prediction performance is often poor

228



email
600/1536
ch$<0.0555
ch$>0.0555

email
280/1177

spam
48/359

remove<0.06

hp<0.405
hp>0.405

remove>0.06

email
180/1065

spam

ch!<0.191

spam

george<0.005
george>0.005

email

email

spam

80/652

0/209

36/123

16/81

free<0.065
free>0.065

email

email

email

spam

77/423

3/229

16/94

9/29

CAPMAX<10.5
business<0.145
CAPMAX>10.5
business>0.145

email

email

email

spam

20/238

57/185

14/89

3/5

receive<0.125
edu<0.045
receive>0.125
edu>0.045

email

spam

email

email

19/236

1/2

48/113

9/72

our<1.2
our>1.2

email

spam

37/101

1/12

0/3

CAPAVE<2.7505
CAPAVE>2.7505

email

hp<0.03
hp>0.03

email

6/109

email
100/204

80/861

0/22

george<0.15
CAPAVE<2.907
george>0.15
CAPAVE>2.907

ch!>0.191

email

email

spam
26/337

9/112

spam
19/110

spam

7/227

1999<0.58
1999>0.58

spam
18/109

email

0/1

229



ROC curve for pruned tree on SPAM data

0.8

1.0

SPAM Data
TREE − Error: 8.7%

0.0

0.2

0.4

Sensitivity

0.6

o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0

Overall error rate on test data:
8.7%.
ROC curve obtained by varying the threshold c0 of the classifier:
C(X) = +1 if Pr(+1|X) > c0 .
Sensitivity: proportion of true
spam identified
Specificity: proportion of true
email identified.

We may want specificity to be high, and suffer some spam:
Specificity : 95% =⇒ Sensitivity : 79%

230



Toy Classiﬁcation Problem
Bayes Error Rate: 0.25
6
4

1

2
0
-2
-4
-6

X2

1

1

1

1

1
1
111
11
1 1
1
1
1
1 0 110111 1 0 1 1
1
1 1 11 0 1
1
11 1
1 1
1
1 1
1
1 1
1
1
11 1 0 1 11 0 1 1 1 0
1
0
1
00 1 11
1
1
1 1 010 010 11 010 1 00 01 1 1 1
1
1 1
1 01
0
1
0 00 00
0
1 1 11 1 0 0 00 0 01011 0 0111
101 1
01 0
0 0 0 0 1 0011 1 11 0
1 1 1 101 1010 0 0 0000010 0 0 1
0 00 1
0 100 0 1 0 0 0
1
0
0 00 01 100 000 011
0 00 0 0 0
1 00 0 001000 000 1 1 1
000 00 1 000 0
11
1 0 1
0
0
1 0 0 0101 0
1
11
1 1 001 01001 000110 000 0111 0 1 1 11
0 00
0100000 000 10 001 0 1 1 1
0
1 1 00
1 0
1
0 1 0001000 1
01 0 00100000011 0
1 00000000001000010 0 1 0
1 0 1 00 00000000000000 00001101 1
00 000 10
1 1
1
1
1
0 0 0 10 0100 0 0 1
00 0
11 0 111 0 00000010010111 101010 01
1 1 0 0 11 10 0010 100 1 0 0
1
0
1 0 01100 0 0 0 101 0 01 1
00 0
0 01 000 01 0 0
1
1 1000000 0 000 1 1
1
00
1 1 0 10 11 111 0000001 1 111 11 1 1
0001 01000000 1 11 1
1
0 0 0
0 011 0 00 0
0 01 1 1
1
0
1 1 1
1
0
1 11 11 1 1111
0 1 1 10 01000 0 1 1 1111 1 1
1 00 1
0 11 0
11 0 0 0 1
0
0
111 11 1 11 1001 0 1 10 01 1 1 1 1
0
1 1 1
1
1
0
1 1 1 1 11
1
1 1
10
1
1
1 1 1 1 11 1 1
1
1
1
11
1 11 1
1 1
1
1
1
1
1
1 11
1 11 11 1 1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
1
1
1
1
1
1

-6

-4

-2

• Data X and Y , with Y
taking values +1 or −1.

1

1

1

1

1

0

X1

2

4

• Here X =∈ IR2

1

1
1

6

• The black boundary
is the Bayes Decision
Boundary - the best
one can do. Here 25%
• Goal: given N training
pairs (Xi , Yi ) produce
ˆ
a classiﬁer C(X) ∈
{−1, 1}

• Also estimate the probability of the class labels Pr(Y = +1|X).

231



Toy Example — No Noise
Bayes Error Rate: 0
3

1
1
1 1 1
11
1
11
11 1
1 11
1 11 1
1
1 1
1
11 1 1 1 11 1 1 1 1
11
1 1
1
11 1 1
1
1
1 1 11 1 1 11 1 1 11 11 1 1
1 1
1
1 1 11 1 1 1
1 11
11
1 11 1 1 11 1 0 000
1
1 0 0
11 1
1
1
1
1
1 1 1 1111 1
1
00 00 101 1 1
1
00 0
1
1
1
0 0
1 1 1 11 00 0 00 0000 00 001 111 1 1
0 0 0 0 0 0 00 0 1 1 1
1 1 11 0 0
0
1 1 11
1 0 0 0 0 00 000 0 00
1
1
1
1 10 0 0
1
1
0 0 00 0
1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1
0
1 11 0 0 00000 00 0 0 0 0 0 0 1
1
0
1 0 0 0 0 0 0 0000000
11
0 0 0 0 0 000 0 00 001 11 1 1
1
1
0
1 0 0 00 00 0 00 0 00 0 0
0
1
1 0 00 000 0 0 0 00
0
0 0
1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1
1 1
0
1
1 0 000 00000 0 0000 0 000
1
000 0 000 0 00 00 0 0
0
1
0
0 0 00
1
00
0
11
00 0 0 0
11
1
1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1
1 1
1
0
0 00 0
1
1 1
00 0 0
1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1
1 1
0
0 0 00000 0 0 00 1 11 1
0 000000 0
1
11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1
0 0
0
1
0
0
00 0 000 00 0 0 0
1
1
1
1 1 1
1
0
1 11
0 00 00 0 00 00 1 1 11
1
1 00 0 0
0
0
1 1
1 1
1
1
1 1 01 00 0 0 00
1 0 0
11 1 1
0
1
1 1 111 001 0 010 1 1 111 1 1
1 1
1 1 0 11
1 1 01
1
1 1 1 1 1
1
1
1
1
1 1 1 1 1 11 1
1
1
1 11 11 1 1 111 11
1
11
1
1 11 1 1 1 1
1 11 1 1 1 1
1
1 1
11
1
1
1
1 1
1
1
1
1 1
11 1
1
1
1
1
1 11
1
1
1
11
1
11

0
-1
-2
-3

X2

1

2

1

-3

-2

1

11

-1

1
11

1
1

1
1 1

0

X1

1

2

3

• Deterministic problem;
noise comes from sampling distribution of X.
• Use a training sample
of size 200.
• Here Bayes Error is
0%.

232



Classiﬁcation Tree
1
94/200
x.2<-1.06711
x.2>-1.06711
0

1

72/166
x.2<1.14988
x.2>1.14988

0/34

1

0
40/134
x.1<1.13632
x.1>1.13632
0

1

23/117

0/32

0/17

x.1<-0.900735
x.1>-0.900735
1

0

5/26
x.1<-1.1668
x.1>-1.1668
1

1
0/12

2/91
x.2<-0.823968
x.2>-0.823968
0

5/14
x.1<-1.07831
x.1>-1.07831
1

1

1/5

4/9

0

2/8

0/83

233



234

Decision Boundary: Tree
Error Rate: 0.073

2

3

• The eight-node tree is
boxy, and has a testerror rate 0f 7.3%

1

1

0

1
1

-1

1
1

-2

1

1
1 1
1
1 1
1
1
1
1 1
1
1
1
11
1
0
0 01
1
1 11
0 11 1
1
1 1
0
000
1 0 00
0 000 0 00
0
0
00
0 0 0 00
1 0 0
00
1
1 11 1
1 0 0 0 0 0 0 0 00 0 1
1
1 0 0 00
0
0
0
00 0 0 0
00
0
00 0 0
0 00 0 0 0 0 0
1 0 0
0
0
0 0 00 0 1
00
0
1 1
0 0
1
00 0 0
1
1 000 00 0 0 0 1
0
1 1 1
0
11
1 10 01
11
1
11
1 11
1 1
1
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1

1

1

-3

X2

1

1

11

-3

-2

-1

0
X1

1

2

3

• When
the
nested
spheres are in 10dimensions, a classiﬁcation tree produces
a rather noisy and
ˆ
inaccurate rule C(X),
with error rates around
30%.



Model Averaging
Classiﬁcation trees can be simple, but often produce noisy (bushy)
or weak (stunted) classiﬁers.
• Bagging (Breiman, 1996): Fit many large trees to
bootstrap-resampled versions of the training data, and classify
by majority vote.
• Boosting (Freund & Shapire, 1996): Fit many large or small
trees to reweighted versions of the training data. Classify by
weighted majority vote.
• Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree.

235



Bagging
• Bagging or bootstrap aggregation averages a noisy fitted
function refit to many bootstrap samples, to reduce its
variance. Bagging is a poor man’s Bayes; See

pp 282.

• Suppose C(S, x) is a fitted classifier, such as a tree, fit to our
training data S, producing a predicted class label at input
point x.
• To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size
N with replacement from the training data. Then
Cbag (x) = Majority Vote {C(S ∗b , x)}B .
b=1
• Bagging can dramatically reduce the variance of unstable
procedures (like trees), leading to improved prediction.
However any simple structure in C (e.g a tree) is lost.

236


Original Tree


b=1

b=2

x.1 < 0.395

x.1 < 0.555

x.2 < 0.205

|

|

|

1

1
0

1

0

1

0

0

1

0

0
0

1

b=3

0

0

1

0

1

b=4

1

b=5

x.2 < 0.285

x.3 < 0.985

x.4 < −1.36

|

|

|
0
1

0

0
1

1

0

1

1

1

0
1

0

1

b=6

1

1

0

0

1

b=7

b=8

x.1 < 0.395

x.1 < 0.395

x.3 < 0.985

|

|

|

1
1

1
0
1

1

0

0

0

1

0
0

1

b=9

0

0

1

b = 10

b = 11

x.1 < 0.395

x.1 < 0.555

x.1 < 0.555

|

|

|

1
0
1
0

1

1
0

1

0

0
1

0

1
0

1

237



Decision Boundary: Bagging

2

3

Error Rate: 0.032

1

1

0

1
1

-1

1
1

-2

1

1
1 1
1
1 1
1
1
1
1 1
1
1
1
11
1
0
0 01
1
1 11
0 11 1
1
1 1
0
000
1 0 00
0 000 0 00
0
0
00
0 0 0 00
1 0 0
00
1
1 11 1
1 0 0 0 0 0 0 0 00 0 1
1
1 0 0 00
0
0
0
00 0 0 0
00
0
00 0 0
0 00 0 0 0 0 0
1 0 0
0
0
0 0 00 0 1
00
0
1 1
0 0
1
00 0 0
1
1 000 00 0 0 0 1
0
1 1 1
0
11
1 10 01
11
1
11
1 11
1 1
1
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1

1

1

• Bagging reduces error
by variance reduction.

-3

X2

1

1

11

-3

-2

-1

0
X1

1

2

• Bagging averages many
trees, and produces
smoother
decision
boundaries.

3

• Bias is slightly increased because bagged
trees are slightly shallower.

238



Random forests
• reﬁnement of bagged trees; very popular—

chapter 15.

• at each tree split, a random sample of m features is drawn, and
only those m features are considered for splitting. Typically
√
m = p or log2 p, where p is the number of features
• For each tree grown on a bootstrap sample, the error rate for
observations left out of the bootstrap sample is monitored.
This is called the “out-of-bag” or OOB error rate.
• random forests tries to improve on bagging by “de-correlating”
the trees, and reduce the variance.
• Each tree has the same expectation, so increasing the number
of trees does not alter the bias of bagging or random forests.

239


Wisdom of Crowds

10

Wisdom of Crowds
Consensus
Individual

8

• 10 movie categories, 4 choices
in each category

6

• 50 people, for each question 15
informed and 35 guessers

4

Expected Correct out of 10


• For informed people,

0
0.25

0.50

0.75

P − Probability of Informed Person Being Correct

1.00

=

p

Pr(incorrect)

2

Pr(correct)

=

(1 − p)/3

• For guessers,
Pr(all choices) = 1/4.

Consensus (orange) improve over individuals (teal) even for
moderate p.

240



241

1.0

ROC curve for TREE, SVM and Random Forest on SPAM data

0.8

TREE, SVM and RF
Random Forest − Error: 4.75%
SVM − Error: 6.7%
TREE − Error: 8.7%

Random Forest dominates
both other methods on the
SPAM data — 4.75% error.
Used 500 trees with default
settings for random Forest
package in R.

0.0

0.2

0.4

Sensitivity

0.6

o
o
o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0


242

0.07

0.08

RF: decorrelation

0.06

ˆ
• VarfRF (x) = ρ(x)σ 2 (x)

0.03

0.04

0.05

• The correlation is because
we are growing trees to the
same data.

0.02

Correlation between Trees


1

4

7 10

16

22

28

34

40

Number of Randomly Selected Splitting Variables m

46

• The correlation decreases
as m decreases—m is the
number of variables sampled at each split.

Based on a simple experiment with random forest regression
models.



Random Forest Ensemble

0.15
0.10
0.05
Mean Squared Error
Squared Bias
Variance

0.0
0

10

20

30
m

40

50

Variance

0.80
0.70

0.75

0.20

0.85

Bias & Variance

0.65

Mean Squared Error and Squared Bias

243

• The smaller m, the
lower the variance of
the random forest ensemble
• Small m is also associated with higher
bias, because important variables can be
missed by the sampling.



244

Boosting
• Breakthrough invention of Freund &
Schapire (1997)
Weighted Sample

CM (x)

• Average many trees, each grown to reweighted versions of the training data.
• Weighting decorrelates the trees, by focussing on regions missed by past trees.

Weighted Sample

C3 (x)

Weighted Sample

C2 (x)

• Final Classiﬁer is weighted average of
classiﬁers:
M

αm Cm (x)

C(x) = sign
Training Sample

C1 (x)

m=1

Note: Cm (x) ∈ {−1, +1}.



245

100 Node Trees

Boosting vs Bagging
0.4

Bagging
AdaBoost

• Bayes error rate is 0%.

0.2

• Trees are grown best
ﬁrst without pruning.

0.1

• Leftmost term is a single tree.

0.0

Test Error

0.3

• 2000
points
from
Nested Spheres in R10

0

100

200
Number of Terms

300

400



AdaBoost (Freund & Schapire, 1996)
1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .
2. For m = 1 to M repeat steps (a)–(d):
(a) Fit a classiﬁer Cm (x) to the training data using weights wi .
(b) Compute weighted error of newest tree
errm =

N
i=1

wi I(yi = Cm (xi ))
N
i=1

wi

(c) Compute αm = log[(1 − errm )/errm ].

(d) Update weights for i = 1, . . . , N :
wi ← wi · exp[αm · I(yi = Cm (xi ))]
and renormalize to wi to sum to 1.
3. Output C(x) = sign

M
m=1

αm Cm (x) .

.

246


0.5


247

Boosting Stumps

0.4

Single Stump

0.3

• Boosting stumps works
remarkably well on
the
nested-spheres
problem.

0.1

0.2

400 Node Tree

0.0

Test Error

• A stump is a two-node
tree, after a single split.

0

100

200
Boosting Iterations

300

400



248

0.4

• Bayes error is 0%.

in

10-

0.2

• Boosting drives the training
error to zero (green curve).

0.1

• Further iterations continue to
improve test error in many examples (red curve).

0.0

Train and Test Error

• Nested
spheres
Dimensions.

0.3

0.5

Training Error

0

100

200

300
Number of Terms

400

500

600


0.5


249

0.3

• Nested Gaussians in
10-Dimensions.
• Bayes error is 25%.

Bayes Error
0.2

• Boosting with stumps

0.1

• Here the test error
does increase, but quite
slowly.

0.0

Train and Test Error

0.4

Noisy Problems

0

100

200

300
Number of Terms

400

500

600



Stagewise Additive Modeling
Boosting builds an additive model
M

βm b(x; γm ).

F (x) =
m=1

Here b(x, γm ) is a tree, and γm parametrizes the splits.
We do things like that in statistics all the time!
• GAMs: F (x) =

j

fj (xj )

• Basis expansions: F (x) =

M
m=1 θm hm (x)

Traditionally the parameters fm , θm are fit jointly (i.e. least
squares, maximum likelihood).
With boosting, the parameters (βm , γm ) are fit in a stagewise
fashion. This slows the process down, and overfits less quickly.

250



Example: Least Squares Boosting
1. Start with function F0 (x) = 0 and residual r = y, m = 0
2. m ← m + 1
3. Fit a CART regression tree to r giving g(x)
4. Set fm (x) ← ǫ · g(x)
5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat
step 2–5 many times
• g(x) typically shallow tree; e.g. constrained to have k = 3 splits
only (depth).
• Stagewise fitting (no backfitting!). Slows down (over-) fitting
process.
• 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further.

251



Least Squares Boosting and ǫ
Prediction Error

Single tree

ǫ=1

ǫ = .01

Number of steps m

• If instead of trees, we use linear regression terms, boosting with
small ǫ very similar to lasso!

252



Adaboost: Stagewise Modeling
• AdaBoost builds an additive logistic regression model
M

Pr(Y = 1|x)
αm fm (x)
F (x) = log
=
Pr(Y = −1|x) m=1
by stagewise fitting using the loss function
L(y, F (x)) = exp(−y F (x)).
• Instead of fitting trees to residuals, the special form of the
exponential loss leads to fitting trees to weighted versions of
the original data. See

pp 343.

253



3.0

Why Exponential Loss?

2.0
0.5

1.0

1.5

• Leads to simple reweighting
scheme.

0.0

Loss

2.5

Misclassification
Exponential
Binomial Deviance
Squared Error
Support Vector

• e−yF (x) is a monotone,
smooth upper bound on
misclassiﬁcation loss at x.

−2

−1

0

1

2

• Has logit transform as population minimizer

y·F

F ∗ (x) =
• y · F is the margin; positive
margin means correct classiﬁcation.

1
Pr(Y = 1|x)
log
2
Pr(Y = −1|x)

• Other more robust loss functions, like binomial deviance.

254



General Boosting Algorithms
The boosting paradigm can be extended to general loss functions
— not only squared error or exponential.
1. Initialize F0 (x) = 0.
2. For m = 1 to M :
(a) Compute
(βm , γm ) = arg minβ,γ

N
i=1

L(yi , Fm−1 (xi ) + βb(xi ; γ)).

(b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ).
Sometimes we replace step (b) in item 2 by
(b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm )
Here ǫ is a shrinkage factor; in gbm package in R, the default is
ǫ = 0.001. Shrinkage slows the stagewise model-building even more,
and typically leads to better performance.

255



Gradient Boosting
• General boosting algorithm that works with a variety of
diﬀerent loss functions. Models include regression, resistant
regression, K-class classiﬁcation and risk modeling.
• Gradient Boosting builds additive tree models, for example, for
representing the logits in logistic regression.
• Tree size is a parameter that determines the order of
interaction (next slide).
• Gradient Boosting inherits all the good features of trees
(variable selection, missing data, mixed predictors), and
improves on the weak features, such as prediction performance.
• Gradient Boosting is described in detail in
, section 10.10,
and is implemented in Greg Ridgeways gbm package in R.

256



0.070

Spam Data

0.055
0.050
0.045
0.040

Test Error

0.060

0.065

Bagging
Random Forest
Gradient Boosting (5 Node)

0

500

1000

1500

Number of Trees

2000

2500

257



Spam Example Results
With 3000 training and 1500 test observations, Gradient Boosting
ﬁts an additive logistic model
f (x) = log

Pr(spam|x)
Pr(email|x)

using trees with J = 6 terminal-node trees.
Gradient Boosting achieves a test error of 4.5%, compared to 5.3%
for an additive GAM, 4.75% for Random Forests, and 8.7% for
CART.

258



Spam: Variable Importance
3d
addresses
labs
telnet
857
415
direct
cs
table
85
#
parts
credit
[
lab
conference
report
original
data
project
font
make
address
order
all
hpl
technology
people
pm
mail
over
650
;
meeting
email
000
internet
receive
(
re
business
1999
will
money
our
you
edu
CAPTOT
george
CAPMAX
your
CAPAVE
free
remove
hp
$
!

0

20

40

60

Relative importance

80

100

259



0.0

0.2

0.4

0.6

0.8

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

Spam: Partial Dependence

1.0

0.0

0.2

0.4

0.6

-0.2 0.0 0.2
-1.0

-0.6

-0.6

Partial Dependence

-0.2 0.0 0.2

remove

-1.0

Partial Dependence

!

0.0

0.2

0.4

0.6
edu

0.8

1.0

0.0

0.5

1.0

1.5
hp

2.0

2.5

3.0

260



California Housing Data

0.42
0.40
0.38
0.36
0.34
0.32

Test Average Absolute Error

0.44

RF m=2
RF m=6
GBM depth=4
GBM depth=6

0

200

400

600

Number of Trees

800

1000

261



Tree Size

0.4

Stumps
10 Node
100 Node
Adaboost

0.2

0.3

The tree size J determines
the interaction order of the
model:
η(X)

=

ηj (Xj )

0.1

j

+

ηjk (Xj , Xk )
jk

0.0

Test Error

262

+
0

100

200
Number of Terms

300

400

ηjkl (Xj , Xk , Xl )
jkl

+···



Stumps win!
Since the true decision boundary is the surface of a sphere, the
function that describes it has the form
2
2
2
f (X) = X1 + X2 + . . . + Xp − c = 0.

Boosted stumps via Gradient Boosting returns reasonable
approximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees

f1 (x1 )

f2 (x2 )

f3 (x3 )

f4 (x4 )

f5 (x5 )

f6 (x6 )

f7 (x7 )

f8 (x8 )

f9 (x9 )

f10 (x10 )

263



Learning Ensembles
Ensemble Learning consists of two steps:
1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)}
of basis elements (weak learners) Tm (X).
2. Fitting a model f (X) =

m∈D

αm Tm (X).

Simple examples of ensembles
• Linear regression: The ensemble consists of the coordinate
functions Tm (X) = Xm . The ﬁtting is done by least squares.
• Random Forests: The ensemble consists of trees grown to
booststrapped versions of the data, with additional
randomization at each split. The ﬁtting simply averages.
• Gradient Boosting: The ensemble is grown in an adaptive
fashion, but then simply averaged at the end.

264



Learning Ensembles and the Lasso
Proposed by Popescu and Friedman (2004):
Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso
regularization:
α(λ) = arg min
α

αm Tm (xi )] + λ

L[yi , α0 +
i=1

M

M

N

m=1

m=1

|αm |.

Advantages:
• Often ensembles are large, involving thousands of small trees.
The post-processing selects a smaller subset of these and
combines them eﬃciently.
• Often the post-processing improves the process that generated
the ensemble.

265



Post-Processing Random Forest Ensembles

0.09

Spam Data

0.07
0.06
0.05
0.04

Test Error

0.08

Random Forest
Random Forest (5%, 6)
Gradient Boost (5 node)

0

100

200

300

Number of Trees

400

500

266



Software
• R: free GPL statistical computing environment available from
CRAN, implements the S language. Includes:
– randomForest: implementation of Leo Breimans algorithms.
– rpart: Terry Therneau’s implementation of classiﬁcation and
regression trees.
– gbm: Greg Ridgeway’s implementation of Friedman’s
gradient boosting algorithm.
• Salford Systems: Commercial implementation of trees, random
forests and gradient boosting.
• Splus (Insightful): Commerical version of S.
• Weka: GPL software from University of Waikato, New Zealand.
Includes Trees, Random Forests and many other procedures.

267

Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Plus de Sri Ambati

Plus de Sri Ambati (20)

Dernier

Dernier (20)

Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)