SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Ensemble Learners
• Classification Trees (Done)
• Bagging: Averaging Trees
• Random Forests: Cleverer Averaging of Trees
• Boosting: Cleverest Averaging of Trees
• Details in

chaps 10, 15 and 16

Methods for dramatically improving the performance of weak
learners such as Trees. Classification trees are adaptive and robust,
but do not make good prediction models. The techniques discussed
here enhance their performance considerably.

227
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Properties of Trees
• [ ]Can handle huge datasets
• [ ]Can handle mixed predictors—quantitative and qualitative
• [ ]Easily ignore redundant variables
• [ ]Handle missing data elegantly through surrogate splits
• [ ]Small trees are easy to interpret
• ["]Large trees are hard to interpret
• ["]Prediction performance is often poor

228
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

email
600/1536
ch$<0.0555
ch$>0.0555

email
280/1177

spam
48/359

remove<0.06

hp<0.405
hp>0.405

remove>0.06

email
180/1065

spam

ch!<0.191

spam

george<0.005
george>0.005

email

email

spam

80/652

0/209

36/123

16/81

free<0.065
free>0.065

email

email

email

spam

77/423

3/229

16/94

9/29

CAPMAX<10.5
business<0.145
CAPMAX>10.5
business>0.145

email

email

email

spam

20/238

57/185

14/89

3/5

receive<0.125
edu<0.045
receive>0.125
edu>0.045

email

spam

email

email

19/236

1/2

48/113

9/72

our<1.2
our>1.2

email

spam

37/101

1/12

0/3

CAPAVE<2.7505
CAPAVE>2.7505

email

hp<0.03
hp>0.03

email

6/109

email
100/204

80/861

0/22

george<0.15
CAPAVE<2.907
george>0.15
CAPAVE>2.907

ch!>0.191

email

email

spam
26/337

9/112

spam
19/110

spam

7/227

1999<0.58
1999>0.58

spam
18/109

email

0/1

229
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

ROC curve for pruned tree on SPAM data

0.8

1.0

SPAM Data
TREE − Error: 8.7%

0.0

0.2

0.4

Sensitivity

0.6

o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0

Overall error rate on test data:
8.7%.
ROC curve obtained by varying the threshold c0 of the classifier:
C(X) = +1 if Pr(+1|X) > c0 .
Sensitivity: proportion of true
spam identified
Specificity: proportion of true
email identified.

We may want specificity to be high, and suffer some spam:
Specificity : 95% =⇒ Sensitivity : 79%

230
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Toy Classification Problem
Bayes Error Rate: 0.25
6
4

1

2
0
-2
-4
-6

X2

1

1

1

1

1
1
111
11
1 1
1
1
1
1 0 110111 1 0 1 1
1
1 1 11 0 1
1
11 1
1 1
1
1 1
1
1 1
1
1
11 1 0 1 11 0 1 1 1 0
1
0
1
00 1 11
1
1
1 1 010 010 11 010 1 00 01 1 1 1
1
1 1
1 01
0
1
0 00 00
0
1 1 11 1 0 0 00 0 01011 0 0111
101 1
01 0
0 0 0 0 1 0011 1 11 0
1 1 1 101 1010 0 0 0000010 0 0 1
0 00 1
0 100 0 1 0 0 0
1
0
0 00 01 100 000 011
0 00 0 0 0
1 00 0 001000 000 1 1 1
000 00 1 000 0
11
1 0 1
0
0
1 0 0 0101 0
1
11
1 1 001 01001 000110 000 0111 0 1 1 11
0 00
0100000 000 10 001 0 1 1 1
0
1 1 00
1 0
1
0 1 0001000 1
01 0 00100000011 0
1 00000000001000010 0 1 0
1 0 1 00 00000000000000 00001101 1
00 000 10
1 1
1
1
1
0 0 0 10 0100 0 0 1
00 0
11 0 111 0 00000010010111 101010 01
1 1 0 0 11 10 0010 100 1 0 0
1
0
1 0 01100 0 0 0 101 0 01 1
00 0
0 01 000 01 0 0
1
1 1000000 0 000 1 1
1
00
1 1 0 10 11 111 0000001 1 111 11 1 1
0001 01000000 1 11 1
1
0 0 0
0 011 0 00 0
0 01 1 1
1
0
1 1 1
1
0
1 11 11 1 1111
0 1 1 10 01000 0 1 1 1111 1 1
1 00 1
0 11 0
11 0 0 0 1
0
0
111 11 1 11 1001 0 1 10 01 1 1 1 1
0
1 1 1
1
1
0
1 1 1 1 11
1
1 1
10
1
1
1 1 1 1 11 1 1
1
1
1
11
1 11 1
1 1
1
1
1
1
1
1 11
1 11 11 1 1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
1
1
1
1
1
1

-6

-4

-2

• Data X and Y , with Y
taking values +1 or −1.

1

1

1

1

1

0

X1

2

4

• Here X =∈ IR2

1

1
1

6

• The black boundary
is the Bayes Decision
Boundary - the best
one can do. Here 25%
• Goal: given N training
pairs (Xi , Yi ) produce
ˆ
a classifier C(X) ∈
{−1, 1}

• Also estimate the probability of the class labels Pr(Y = +1|X).

231
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Toy Example — No Noise
Bayes Error Rate: 0
3

1
1
1 1 1
11
1
11
11 1
1 11
1 11 1
1
1 1
1
11 1 1 1 11 1 1 1 1
11
1 1
1
11 1 1
1
1
1 1 11 1 1 11 1 1 11 11 1 1
1 1
1
1 1 11 1 1 1
1 11
11
1 11 1 1 11 1 0 000
1
1 0 0
11 1
1
1
1
1
1 1 1 1111 1
1
00 00 101 1 1
1
00 0
1
1
1
0 0
1 1 1 11 00 0 00 0000 00 001 111 1 1
0 0 0 0 0 0 00 0 1 1 1
1 1 11 0 0
0
1 1 11
1 0 0 0 0 00 000 0 00
1
1
1
1 10 0 0
1
1
0 0 00 0
1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1
0
1 11 0 0 00000 00 0 0 0 0 0 0 1
1
0
1 0 0 0 0 0 0 0000000
11
0 0 0 0 0 000 0 00 001 11 1 1
1
1
0
1 0 0 00 00 0 00 0 00 0 0
0
1
1 0 00 000 0 0 0 00
0
0 0
1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1
1 1
0
1
1 0 000 00000 0 0000 0 000
1
000 0 000 0 00 00 0 0
0
1
0
0 0 00
1
00
0
11
00 0 0 0
11
1
1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1
1 1
1
0
0 00 0
1
1 1
00 0 0
1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1
1 1
0
0 0 00000 0 0 00 1 11 1
0 000000 0
1
11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1
0 0
0
1
0
0
00 0 000 00 0 0 0
1
1
1
1 1 1
1
0
1 11
0 00 00 0 00 00 1 1 11
1
1 00 0 0
0
0
1 1
1 1
1
1
1 1 01 00 0 0 00
1 0 0
11 1 1
0
1
1 1 111 001 0 010 1 1 111 1 1
1 1
1 1 0 11
1 1 01
1
1 1 1 1 1
1
1
1
1
1 1 1 1 1 11 1
1
1
1 11 11 1 1 111 11
1
11
1
1 11 1 1 1 1
1 11 1 1 1 1
1
1 1
11
1
1
1
1 1
1
1
1
1 1
11 1
1
1
1
1
1 11
1
1
1
11
1
11

0
-1
-2
-3

X2

1

2

1

-3

-2

1

11

-1

1
11

1
1

1
1 1

0

X1

1

2

3

• Deterministic problem;
noise comes from sampling distribution of X.
• Use a training sample
of size 200.
• Here Bayes Error is
0%.

232
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Classification Tree
1
94/200
x.2<-1.06711
x.2>-1.06711
0

1

72/166
x.2<1.14988
x.2>1.14988

0/34

1

0
40/134
x.1<1.13632
x.1>1.13632
0

1

23/117

0/32

0/17

x.1<-0.900735
x.1>-0.900735
1

0

5/26
x.1<-1.1668
x.1>-1.1668
1

1
0/12

2/91
x.2<-0.823968
x.2>-0.823968
0

5/14
x.1<-1.07831
x.1>-1.07831
1

1

1/5

4/9

0

2/8

0/83

233
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

234

Decision Boundary: Tree
Error Rate: 0.073

2

3

• The eight-node tree is
boxy, and has a testerror rate 0f 7.3%

1

1

0

1
1

-1

1
1

-2

1

1
1 1
1
1 1
1
1
1
1 1
1
1
1
11
1
0
0 01
1
1 11
0 11 1
1
1 1
0
000
1 0 00
0 000 0 00
0
0
00
0 0 0 00
1 0 0
00
1
1 11 1
1 0 0 0 0 0 0 0 00 0 1
1
1 0 0 00
0
0
0
00 0 0 0
00
0
00 0 0
0 00 0 0 0 0 0
1 0 0
0
0
0 0 00 0 1
00
0
1 1
0 0
1
00 0 0
1
1 000 00 0 0 0 1
0
1 1 1
0
11
1 10 01
11
1
11
1 11
1 1
1
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1

1

1

-3

X2

1

1

11

-3

-2

-1

0
X1

1

2

3

• When
the
nested
spheres are in 10dimensions, a classification tree produces
a rather noisy and
ˆ
inaccurate rule C(X),
with error rates around
30%.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Model Averaging
Classification trees can be simple, but often produce noisy (bushy)
or weak (stunted) classifiers.
• Bagging (Breiman, 1996): Fit many large trees to
bootstrap-resampled versions of the training data, and classify
by majority vote.
• Boosting (Freund & Shapire, 1996): Fit many large or small
trees to reweighted versions of the training data. Classify by
weighted majority vote.
• Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree.

235
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Bagging
• Bagging or bootstrap aggregation averages a noisy fitted
function refit to many bootstrap samples, to reduce its
variance. Bagging is a poor man’s Bayes; See

pp 282.

• Suppose C(S, x) is a fitted classifier, such as a tree, fit to our
training data S, producing a predicted class label at input
point x.
• To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size
N with replacement from the training data. Then
Cbag (x) = Majority Vote {C(S ∗b , x)}B .
b=1
• Bagging can dramatically reduce the variance of unstable
procedures (like trees), leading to improved prediction.
However any simple structure in C (e.g a tree) is lost.

236
SLDM III c Hastie & Tibshirani - October 9, 2013

Original Tree

Random Forests and Boosting

b=1

b=2

x.1 < 0.395

x.1 < 0.555

x.2 < 0.205

|

|

|

1

1
0

1

0

1

0

0

1

0

0
0

1

b=3

0

0

1

0

1

b=4

1

b=5

x.2 < 0.285

x.3 < 0.985

x.4 < −1.36

|

|

|
0
1

0

0
1

1

0

1

1

1

0
1

0

1

b=6

1

1

0

0

1

b=7

b=8

x.1 < 0.395

x.1 < 0.395

x.3 < 0.985

|

|

|

1
1

1
0
1

1

0

0

0

1

0
0

1

b=9

0

0

1

b = 10

b = 11

x.1 < 0.395

x.1 < 0.555

x.1 < 0.555

|

|

|

1
0
1
0

1

1
0

1

0

0
1

0

1
0

1

237
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Decision Boundary: Bagging

2

3

Error Rate: 0.032

1

1

0

1
1

-1

1
1

-2

1

1
1 1
1
1 1
1
1
1
1 1
1
1
1
11
1
0
0 01
1
1 11
0 11 1
1
1 1
0
000
1 0 00
0 000 0 00
0
0
00
0 0 0 00
1 0 0
00
1
1 11 1
1 0 0 0 0 0 0 0 00 0 1
1
1 0 0 00
0
0
0
00 0 0 0
00
0
00 0 0
0 00 0 0 0 0 0
1 0 0
0
0
0 0 00 0 1
00
0
1 1
0 0
1
00 0 0
1
1 000 00 0 0 0 1
0
1 1 1
0
11
1 10 01
11
1
11
1 11
1 1
1
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1

1

1

• Bagging reduces error
by variance reduction.

-3

X2

1

1

11

-3

-2

-1

0
X1

1

2

• Bagging averages many
trees, and produces
smoother
decision
boundaries.

3

• Bias is slightly increased because bagged
trees are slightly shallower.

238
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Random forests
• refinement of bagged trees; very popular—

chapter 15.

• at each tree split, a random sample of m features is drawn, and
only those m features are considered for splitting. Typically
√
m = p or log2 p, where p is the number of features
• For each tree grown on a bootstrap sample, the error rate for
observations left out of the bootstrap sample is monitored.
This is called the “out-of-bag” or OOB error rate.
• random forests tries to improve on bagging by “de-correlating”
the trees, and reduce the variance.
• Each tree has the same expectation, so increasing the number
of trees does not alter the bias of bagging or random forests.

239
SLDM III c Hastie & Tibshirani - October 9, 2013

Wisdom of Crowds

10

Wisdom of Crowds
Consensus
Individual

8

• 10 movie categories, 4 choices
in each category

6

• 50 people, for each question 15
informed and 35 guessers

4

Expected Correct out of 10

Random Forests and Boosting

• For informed people,

0
0.25

0.50

0.75

P − Probability of Informed Person Being Correct

1.00

=

p

Pr(incorrect)

2

Pr(correct)

=

(1 − p)/3

• For guessers,
Pr(all choices) = 1/4.

Consensus (orange) improve over individuals (teal) even for
moderate p.

240
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

241

1.0

ROC curve for TREE, SVM and Random Forest on SPAM data

0.8

TREE, SVM and RF
Random Forest − Error: 4.75%
SVM − Error: 6.7%
TREE − Error: 8.7%

Random Forest dominates
both other methods on the
SPAM data — 4.75% error.
Used 500 trees with default
settings for random Forest
package in R.

0.0

0.2

0.4

Sensitivity

0.6

o
o
o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0
SLDM III c Hastie & Tibshirani - October 9, 2013

242

0.07

0.08

RF: decorrelation

0.06

ˆ
• VarfRF (x) = ρ(x)σ 2 (x)

0.03

0.04

0.05

• The correlation is because
we are growing trees to the
same data.

0.02

Correlation between Trees

Random Forests and Boosting

1

4

7 10

16

22

28

34

40

Number of Randomly Selected Splitting Variables m

46

• The correlation decreases
as m decreases—m is the
number of variables sampled at each split.

Based on a simple experiment with random forest regression
models.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Random Forest Ensemble

0.15
0.10
0.05
Mean Squared Error
Squared Bias
Variance

0.0
0

10

20

30
m

40

50

Variance

0.80
0.70

0.75

0.20

0.85

Bias & Variance

0.65

Mean Squared Error and Squared Bias

243

• The smaller m, the
lower the variance of
the random forest ensemble
• Small m is also associated with higher
bias, because important variables can be
missed by the sampling.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

244

Boosting
• Breakthrough invention of Freund &
Schapire (1997)
Weighted Sample

CM (x)

• Average many trees, each grown to reweighted versions of the training data.
• Weighting decorrelates the trees, by focussing on regions missed by past trees.

Weighted Sample

C3 (x)

Weighted Sample

C2 (x)

• Final Classifier is weighted average of
classifiers:
M

αm Cm (x)

C(x) = sign
Training Sample

C1 (x)

m=1

Note: Cm (x) ∈ {−1, +1}.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

245

100 Node Trees

Boosting vs Bagging
0.4

Bagging
AdaBoost

• Bayes error rate is 0%.

0.2

• Trees are grown best
first without pruning.

0.1

• Leftmost term is a single tree.

0.0

Test Error

0.3

• 2000
points
from
Nested Spheres in R10

0

100

200
Number of Terms

300

400
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

AdaBoost (Freund & Schapire, 1996)
1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .
2. For m = 1 to M repeat steps (a)–(d):
(a) Fit a classifier Cm (x) to the training data using weights wi .
(b) Compute weighted error of newest tree
errm =

N
i=1

wi I(yi = Cm (xi ))
N
i=1

wi

(c) Compute αm = log[(1 − errm )/errm ].

(d) Update weights for i = 1, . . . , N :
wi ← wi · exp[αm · I(yi = Cm (xi ))]
and renormalize to wi to sum to 1.
3. Output C(x) = sign

M
m=1

αm Cm (x) .

.

246
Random Forests and Boosting

0.5

SLDM III c Hastie & Tibshirani - October 9, 2013

247

Boosting Stumps

0.4

Single Stump

0.3

• Boosting stumps works
remarkably well on
the
nested-spheres
problem.

0.1

0.2

400 Node Tree

0.0

Test Error

• A stump is a two-node
tree, after a single split.

0

100

200
Boosting Iterations

300

400
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

248

0.4

• Bayes error is 0%.

in

10-

0.2

• Boosting drives the training
error to zero (green curve).

0.1

• Further iterations continue to
improve test error in many examples (red curve).

0.0

Train and Test Error

• Nested
spheres
Dimensions.

0.3

0.5

Training Error

0

100

200

300
Number of Terms

400

500

600
Random Forests and Boosting

0.5

SLDM III c Hastie & Tibshirani - October 9, 2013

249

0.3

• Nested Gaussians in
10-Dimensions.
• Bayes error is 25%.

Bayes Error
0.2

• Boosting with stumps

0.1

• Here the test error
does increase, but quite
slowly.

0.0

Train and Test Error

0.4

Noisy Problems

0

100

200

300
Number of Terms

400

500

600
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Stagewise Additive Modeling
Boosting builds an additive model
M

βm b(x; γm ).

F (x) =
m=1

Here b(x, γm ) is a tree, and γm parametrizes the splits.
We do things like that in statistics all the time!
• GAMs: F (x) =

j

fj (xj )

• Basis expansions: F (x) =

M
m=1 θm hm (x)

Traditionally the parameters fm , θm are fit jointly (i.e. least
squares, maximum likelihood).
With boosting, the parameters (βm , γm ) are fit in a stagewise
fashion. This slows the process down, and overfits less quickly.

250
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Example: Least Squares Boosting
1. Start with function F0 (x) = 0 and residual r = y, m = 0
2. m ← m + 1
3. Fit a CART regression tree to r giving g(x)
4. Set fm (x) ← ǫ · g(x)
5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat
step 2–5 many times
• g(x) typically shallow tree; e.g. constrained to have k = 3 splits
only (depth).
• Stagewise fitting (no backfitting!). Slows down (over-) fitting
process.
• 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further.

251
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Least Squares Boosting and ǫ
Prediction Error

Single tree

ǫ=1

ǫ = .01

Number of steps m

• If instead of trees, we use linear regression terms, boosting with
small ǫ very similar to lasso!

252
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Adaboost: Stagewise Modeling
• AdaBoost builds an additive logistic regression model
M

Pr(Y = 1|x)
αm fm (x)
F (x) = log
=
Pr(Y = −1|x) m=1
by stagewise fitting using the loss function
L(y, F (x)) = exp(−y F (x)).
• Instead of fitting trees to residuals, the special form of the
exponential loss leads to fitting trees to weighted versions of
the original data. See

pp 343.

253
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

3.0

Why Exponential Loss?

2.0
0.5

1.0

1.5

• Leads to simple reweighting
scheme.

0.0

Loss

2.5

Misclassification
Exponential
Binomial Deviance
Squared Error
Support Vector

• e−yF (x) is a monotone,
smooth upper bound on
misclassification loss at x.

−2

−1

0

1

2

• Has logit transform as population minimizer

y·F

F ∗ (x) =
• y · F is the margin; positive
margin means correct classification.

1
Pr(Y = 1|x)
log
2
Pr(Y = −1|x)

• Other more robust loss functions, like binomial deviance.

254
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

General Boosting Algorithms
The boosting paradigm can be extended to general loss functions
— not only squared error or exponential.
1. Initialize F0 (x) = 0.
2. For m = 1 to M :
(a) Compute
(βm , γm ) = arg minβ,γ

N
i=1

L(yi , Fm−1 (xi ) + βb(xi ; γ)).

(b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ).
Sometimes we replace step (b) in item 2 by
(b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm )
Here ǫ is a shrinkage factor; in gbm package in R, the default is
ǫ = 0.001. Shrinkage slows the stagewise model-building even more,
and typically leads to better performance.

255
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Gradient Boosting
• General boosting algorithm that works with a variety of
different loss functions. Models include regression, resistant
regression, K-class classification and risk modeling.
• Gradient Boosting builds additive tree models, for example, for
representing the logits in logistic regression.
• Tree size is a parameter that determines the order of
interaction (next slide).
• Gradient Boosting inherits all the good features of trees
(variable selection, missing data, mixed predictors), and
improves on the weak features, such as prediction performance.
• Gradient Boosting is described in detail in
, section 10.10,
and is implemented in Greg Ridgeways gbm package in R.

256
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

0.070

Spam Data

0.055
0.050
0.045
0.040

Test Error

0.060

0.065

Bagging
Random Forest
Gradient Boosting (5 Node)

0

500

1000

1500

Number of Trees

2000

2500

257
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Spam Example Results
With 3000 training and 1500 test observations, Gradient Boosting
fits an additive logistic model
f (x) = log

Pr(spam|x)
Pr(email|x)

using trees with J = 6 terminal-node trees.
Gradient Boosting achieves a test error of 4.5%, compared to 5.3%
for an additive GAM, 4.75% for Random Forests, and 8.7% for
CART.

258
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Spam: Variable Importance
3d
addresses
labs
telnet
857
415
direct
cs
table
85
#
parts
credit
[
lab
conference
report
original
data
project
font
make
address
order
all
hpl
technology
people
pm
mail
over
650
;
meeting
email
000
internet
receive
(
re
business
1999
will
money
our
you
edu
CAPTOT
george
CAPMAX
your
CAPAVE
free
remove
hp
$
!

0

20

40

60

Relative importance

80

100

259
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

0.0

0.2

0.4

0.6

0.8

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

Spam: Partial Dependence

1.0

0.0

0.2

0.4

0.6

-0.2 0.0 0.2
-1.0

-0.6

-0.6

Partial Dependence

-0.2 0.0 0.2

remove

-1.0

Partial Dependence

!

0.0

0.2

0.4

0.6
edu

0.8

1.0

0.0

0.5

1.0

1.5
hp

2.0

2.5

3.0

260
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

California Housing Data

0.42
0.40
0.38
0.36
0.34
0.32

Test Average Absolute Error

0.44

RF m=2
RF m=6
GBM depth=4
GBM depth=6

0

200

400

600

Number of Trees

800

1000

261
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Tree Size

0.4

Stumps
10 Node
100 Node
Adaboost

0.2

0.3

The tree size J determines
the interaction order of the
model:
η(X)

=

ηj (Xj )

0.1

j

+

ηjk (Xj , Xk )
jk

0.0

Test Error

262

+
0

100

200
Number of Terms

300

400

ηjkl (Xj , Xk , Xl )
jkl

+···
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Stumps win!
Since the true decision boundary is the surface of a sphere, the
function that describes it has the form
2
2
2
f (X) = X1 + X2 + . . . + Xp − c = 0.

Boosted stumps via Gradient Boosting returns reasonable
approximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees

f1 (x1 )

f2 (x2 )

f3 (x3 )

f4 (x4 )

f5 (x5 )

f6 (x6 )

f7 (x7 )

f8 (x8 )

f9 (x9 )

f10 (x10 )

263
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Learning Ensembles
Ensemble Learning consists of two steps:
1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)}
of basis elements (weak learners) Tm (X).
2. Fitting a model f (X) =

m∈D

αm Tm (X).

Simple examples of ensembles
• Linear regression: The ensemble consists of the coordinate
functions Tm (X) = Xm . The fitting is done by least squares.
• Random Forests: The ensemble consists of trees grown to
booststrapped versions of the data, with additional
randomization at each split. The fitting simply averages.
• Gradient Boosting: The ensemble is grown in an adaptive
fashion, but then simply averaged at the end.

264
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Learning Ensembles and the Lasso
Proposed by Popescu and Friedman (2004):
Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso
regularization:
α(λ) = arg min
α

αm Tm (xi )] + λ

L[yi , α0 +
i=1

M

M

N

m=1

m=1

|αm |.

Advantages:
• Often ensembles are large, involving thousands of small trees.
The post-processing selects a smaller subset of these and
combines them efficiently.
• Often the post-processing improves the process that generated
the ensemble.

265
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Post-Processing Random Forest Ensembles

0.09

Spam Data

0.07
0.06
0.05
0.04

Test Error

0.08

Random Forest
Random Forest (5%, 6)
Gradient Boost (5 node)

0

100

200

300

Number of Trees

400

500

266
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Software
• R: free GPL statistical computing environment available from
CRAN, implements the S language. Includes:
– randomForest: implementation of Leo Breimans algorithms.
– rpart: Terry Therneau’s implementation of classification and
regression trees.
– gbm: Greg Ridgeway’s implementation of Friedman’s
gradient boosting algorithm.
• Salford Systems: Commercial implementation of trees, random
forests and gradient boosting.
• Splus (Insightful): Commerical version of S.
• Weka: GPL software from University of Waikato, New Zealand.
Includes Trees, Random Forests and many other procedures.

267

Contenu connexe

Tendances

はじパタ6章前半
はじパタ6章前半はじパタ6章前半
はじパタ6章前半
T T
 
はじパタ11章 後半
はじパタ11章 後半はじパタ11章 後半
はじパタ11章 後半
Atsushi Hayakawa
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
nextlib
 
はじパタ8章 svm
はじパタ8章 svmはじパタ8章 svm
はじパタ8章 svm
tetsuro ito
 

Tendances (20)

凸包
凸包凸包
凸包
 
はじパタ6章前半
はじパタ6章前半はじパタ6章前半
はじパタ6章前半
 
はじめてのパターン認識 第6章 後半
はじめてのパターン認識 第6章 後半はじめてのパターン認識 第6章 後半
はじめてのパターン認識 第6章 後半
 
はじパタ11章 後半
はじパタ11章 後半はじパタ11章 後半
はじパタ11章 後半
 
Anatomy of YOLO - v1
Anatomy of YOLO - v1Anatomy of YOLO - v1
Anatomy of YOLO - v1
 
MLP輪読会 バンディット問題の理論とアルゴリズム 第3章
MLP輪読会 バンディット問題の理論とアルゴリズム 第3章MLP輪読会 バンディット問題の理論とアルゴリズム 第3章
MLP輪読会 バンディット問題の理論とアルゴリズム 第3章
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Contextual Bandit Survey
Contextual Bandit SurveyContextual Bandit Survey
Contextual Bandit Survey
 
東京都市大学 データ解析入門 5 スパース性と圧縮センシング 2
東京都市大学 データ解析入門 5 スパース性と圧縮センシング 2東京都市大学 データ解析入門 5 スパース性と圧縮センシング 2
東京都市大学 データ解析入門 5 スパース性と圧縮センシング 2
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Faster R-CNN
Faster R-CNNFaster R-CNN
Faster R-CNN
 
はじパタ8章 svm
はじパタ8章 svmはじパタ8章 svm
はじパタ8章 svm
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
はじめてのパターン認識 第11章 11.1-11.2
はじめてのパターン認識 第11章 11.1-11.2はじめてのパターン認識 第11章 11.1-11.2
はじめてのパターン認識 第11章 11.1-11.2
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 

En vedette

En vedette (20)

Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to Practice
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2O
 
Q trade presentation
Q trade presentationQ trade presentation
Q trade presentation
 
Tree advanced
Tree advancedTree advanced
Tree advanced
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overview
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers Ensembles
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning
 
Cybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith BarthurCybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith Barthur
 

Plus de Sri Ambati

Plus de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Dernier

Dernier (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

  • 1. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Ensemble Learners • Classification Trees (Done) • Bagging: Averaging Trees • Random Forests: Cleverer Averaging of Trees • Boosting: Cleverest Averaging of Trees • Details in chaps 10, 15 and 16 Methods for dramatically improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not make good prediction models. The techniques discussed here enhance their performance considerably. 227
  • 2. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Properties of Trees • [ ]Can handle huge datasets • [ ]Can handle mixed predictors—quantitative and qualitative • [ ]Easily ignore redundant variables • [ ]Handle missing data elegantly through surrogate splits • [ ]Small trees are easy to interpret • ["]Large trees are hard to interpret • ["]Prediction performance is often poor 228
  • 3. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting email 600/1536 ch$<0.0555 ch$>0.0555 email 280/1177 spam 48/359 remove<0.06 hp<0.405 hp>0.405 remove>0.06 email 180/1065 spam ch!<0.191 spam george<0.005 george>0.005 email email spam 80/652 0/209 36/123 16/81 free<0.065 free>0.065 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 email email email spam 20/238 57/185 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12 0/3 CAPAVE<2.7505 CAPAVE>2.7505 email hp<0.03 hp>0.03 email 6/109 email 100/204 80/861 0/22 george<0.15 CAPAVE<2.907 george>0.15 CAPAVE>2.907 ch!>0.191 email email spam 26/337 9/112 spam 19/110 spam 7/227 1999<0.58 1999>0.58 spam 18/109 email 0/1 229
  • 4. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting ROC curve for pruned tree on SPAM data 0.8 1.0 SPAM Data TREE − Error: 8.7% 0.0 0.2 0.4 Sensitivity 0.6 o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0 Overall error rate on test data: 8.7%. ROC curve obtained by varying the threshold c0 of the classifier: C(X) = +1 if Pr(+1|X) > c0 . Sensitivity: proportion of true spam identified Specificity: proportion of true email identified. We may want specificity to be high, and suffer some spam: Specificity : 95% =⇒ Sensitivity : 79% 230
  • 5. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Classification Problem Bayes Error Rate: 0.25 6 4 1 2 0 -2 -4 -6 X2 1 1 1 1 1 1 111 11 1 1 1 1 1 1 0 110111 1 0 1 1 1 1 1 11 0 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 0 1 11 0 1 1 1 0 1 0 1 00 1 11 1 1 1 1 010 010 11 010 1 00 01 1 1 1 1 1 1 1 01 0 1 0 00 00 0 1 1 11 1 0 0 00 0 01011 0 0111 101 1 01 0 0 0 0 0 1 0011 1 11 0 1 1 1 101 1010 0 0 0000010 0 0 1 0 00 1 0 100 0 1 0 0 0 1 0 0 00 01 100 000 011 0 00 0 0 0 1 00 0 001000 000 1 1 1 000 00 1 000 0 11 1 0 1 0 0 1 0 0 0101 0 1 11 1 1 001 01001 000110 000 0111 0 1 1 11 0 00 0100000 000 10 001 0 1 1 1 0 1 1 00 1 0 1 0 1 0001000 1 01 0 00100000011 0 1 00000000001000010 0 1 0 1 0 1 00 00000000000000 00001101 1 00 000 10 1 1 1 1 1 0 0 0 10 0100 0 0 1 00 0 11 0 111 0 00000010010111 101010 01 1 1 0 0 11 10 0010 100 1 0 0 1 0 1 0 01100 0 0 0 101 0 01 1 00 0 0 01 000 01 0 0 1 1 1000000 0 000 1 1 1 00 1 1 0 10 11 111 0000001 1 111 11 1 1 0001 01000000 1 11 1 1 0 0 0 0 011 0 00 0 0 01 1 1 1 0 1 1 1 1 0 1 11 11 1 1111 0 1 1 10 01000 0 1 1 1111 1 1 1 00 1 0 11 0 11 0 0 0 1 0 0 111 11 1 11 1001 0 1 10 01 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 11 1 1 1 10 1 1 1 1 1 1 11 1 1 1 1 1 11 1 11 1 1 1 1 1 1 1 1 1 11 1 11 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 -6 -4 -2 • Data X and Y , with Y taking values +1 or −1. 1 1 1 1 1 0 X1 2 4 • Here X =∈ IR2 1 1 1 6 • The black boundary is the Bayes Decision Boundary - the best one can do. Here 25% • Goal: given N training pairs (Xi , Yi ) produce ˆ a classifier C(X) ∈ {−1, 1} • Also estimate the probability of the class labels Pr(Y = +1|X). 231
  • 6. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Example — No Noise Bayes Error Rate: 0 3 1 1 1 1 1 11 1 11 11 1 1 11 1 11 1 1 1 1 1 11 1 1 1 11 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 11 1 1 11 1 1 11 11 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 11 1 1 11 1 0 000 1 1 0 0 11 1 1 1 1 1 1 1 1 1111 1 1 00 00 101 1 1 1 00 0 1 1 1 0 0 1 1 1 11 00 0 00 0000 00 001 111 1 1 0 0 0 0 0 0 00 0 1 1 1 1 1 11 0 0 0 1 1 11 1 0 0 0 0 00 000 0 00 1 1 1 1 10 0 0 1 1 0 0 00 0 1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1 0 1 11 0 0 00000 00 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0000000 11 0 0 0 0 0 000 0 00 001 11 1 1 1 1 0 1 0 0 00 00 0 00 0 00 0 0 0 1 1 0 00 000 0 0 0 00 0 0 0 1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1 1 1 0 1 1 0 000 00000 0 0000 0 000 1 000 0 000 0 00 00 0 0 0 1 0 0 0 00 1 00 0 11 00 0 0 0 11 1 1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1 1 1 1 0 0 00 0 1 1 1 00 0 0 1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1 1 1 0 0 0 00000 0 0 00 1 11 1 0 000000 0 1 11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1 0 0 0 1 0 0 00 0 000 00 0 0 0 1 1 1 1 1 1 1 0 1 11 0 00 00 0 00 00 1 1 11 1 1 00 0 0 0 0 1 1 1 1 1 1 1 1 01 00 0 0 00 1 0 0 11 1 1 0 1 1 1 111 001 0 010 1 1 111 1 1 1 1 1 1 0 11 1 1 01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 1 111 11 1 11 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 11 1 11 0 -1 -2 -3 X2 1 2 1 -3 -2 1 11 -1 1 11 1 1 1 1 1 0 X1 1 2 3 • Deterministic problem; noise comes from sampling distribution of X. • Use a training sample of size 200. • Here Bayes Error is 0%. 232
  • 7. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Classification Tree 1 94/200 x.2<-1.06711 x.2>-1.06711 0 1 72/166 x.2<1.14988 x.2>1.14988 0/34 1 0 40/134 x.1<1.13632 x.1>1.13632 0 1 23/117 0/32 0/17 x.1<-0.900735 x.1>-0.900735 1 0 5/26 x.1<-1.1668 x.1>-1.1668 1 1 0/12 2/91 x.2<-0.823968 x.2>-0.823968 0 5/14 x.1<-1.07831 x.1>-1.07831 1 1 1/5 4/9 0 2/8 0/83 233
  • 8. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 234 Decision Boundary: Tree Error Rate: 0.073 2 3 • The eight-node tree is boxy, and has a testerror rate 0f 7.3% 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 3 • When the nested spheres are in 10dimensions, a classification tree produces a rather noisy and ˆ inaccurate rule C(X), with error rates around 30%.
  • 9. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Model Averaging Classification trees can be simple, but often produce noisy (bushy) or weak (stunted) classifiers. • Bagging (Breiman, 1996): Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote. • Boosting (Freund & Shapire, 1996): Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote. • Random Forests (Breiman 1999): Fancier version of bagging. In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree. 235
  • 10. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Bagging • Bagging or bootstrap aggregation averages a noisy fitted function refit to many bootstrap samples, to reduce its variance. Bagging is a poor man’s Bayes; See pp 282. • Suppose C(S, x) is a fitted classifier, such as a tree, fit to our training data S, producing a predicted class label at input point x. • To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size N with replacement from the training data. Then Cbag (x) = Majority Vote {C(S ∗b , x)}B . b=1 • Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction. However any simple structure in C (e.g a tree) is lost. 236
  • 11. SLDM III c Hastie & Tibshirani - October 9, 2013 Original Tree Random Forests and Boosting b=1 b=2 x.1 < 0.395 x.1 < 0.555 x.2 < 0.205 | | | 1 1 0 1 0 1 0 0 1 0 0 0 1 b=3 0 0 1 0 1 b=4 1 b=5 x.2 < 0.285 x.3 < 0.985 x.4 < −1.36 | | | 0 1 0 0 1 1 0 1 1 1 0 1 0 1 b=6 1 1 0 0 1 b=7 b=8 x.1 < 0.395 x.1 < 0.395 x.3 < 0.985 | | | 1 1 1 0 1 1 0 0 0 1 0 0 1 b=9 0 0 1 b = 10 b = 11 x.1 < 0.395 x.1 < 0.555 x.1 < 0.555 | | | 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 237
  • 12. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Decision Boundary: Bagging 2 3 Error Rate: 0.032 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 • Bagging reduces error by variance reduction. -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 • Bagging averages many trees, and produces smoother decision boundaries. 3 • Bias is slightly increased because bagged trees are slightly shallower. 238
  • 13. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random forests • refinement of bagged trees; very popular— chapter 15. • at each tree split, a random sample of m features is drawn, and only those m features are considered for splitting. Typically √ m = p or log2 p, where p is the number of features • For each tree grown on a bootstrap sample, the error rate for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” or OOB error rate. • random forests tries to improve on bagging by “de-correlating” the trees, and reduce the variance. • Each tree has the same expectation, so increasing the number of trees does not alter the bias of bagging or random forests. 239
  • 14. SLDM III c Hastie & Tibshirani - October 9, 2013 Wisdom of Crowds 10 Wisdom of Crowds Consensus Individual 8 • 10 movie categories, 4 choices in each category 6 • 50 people, for each question 15 informed and 35 guessers 4 Expected Correct out of 10 Random Forests and Boosting • For informed people, 0 0.25 0.50 0.75 P − Probability of Informed Person Being Correct 1.00 = p Pr(incorrect) 2 Pr(correct) = (1 − p)/3 • For guessers, Pr(all choices) = 1/4. Consensus (orange) improve over individuals (teal) even for moderate p. 240
  • 15. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 241 1.0 ROC curve for TREE, SVM and Random Forest on SPAM data 0.8 TREE, SVM and RF Random Forest − Error: 4.75% SVM − Error: 6.7% TREE − Error: 8.7% Random Forest dominates both other methods on the SPAM data — 4.75% error. Used 500 trees with default settings for random Forest package in R. 0.0 0.2 0.4 Sensitivity 0.6 o o o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0
  • 16. SLDM III c Hastie & Tibshirani - October 9, 2013 242 0.07 0.08 RF: decorrelation 0.06 ˆ • VarfRF (x) = ρ(x)σ 2 (x) 0.03 0.04 0.05 • The correlation is because we are growing trees to the same data. 0.02 Correlation between Trees Random Forests and Boosting 1 4 7 10 16 22 28 34 40 Number of Randomly Selected Splitting Variables m 46 • The correlation decreases as m decreases—m is the number of variables sampled at each split. Based on a simple experiment with random forest regression models.
  • 17. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random Forest Ensemble 0.15 0.10 0.05 Mean Squared Error Squared Bias Variance 0.0 0 10 20 30 m 40 50 Variance 0.80 0.70 0.75 0.20 0.85 Bias & Variance 0.65 Mean Squared Error and Squared Bias 243 • The smaller m, the lower the variance of the random forest ensemble • Small m is also associated with higher bias, because important variables can be missed by the sampling.
  • 18. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 244 Boosting • Breakthrough invention of Freund & Schapire (1997) Weighted Sample CM (x) • Average many trees, each grown to reweighted versions of the training data. • Weighting decorrelates the trees, by focussing on regions missed by past trees. Weighted Sample C3 (x) Weighted Sample C2 (x) • Final Classifier is weighted average of classifiers: M αm Cm (x) C(x) = sign Training Sample C1 (x) m=1 Note: Cm (x) ∈ {−1, +1}.
  • 19. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 245 100 Node Trees Boosting vs Bagging 0.4 Bagging AdaBoost • Bayes error rate is 0%. 0.2 • Trees are grown best first without pruning. 0.1 • Leftmost term is a single tree. 0.0 Test Error 0.3 • 2000 points from Nested Spheres in R10 0 100 200 Number of Terms 300 400
  • 20. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting AdaBoost (Freund & Schapire, 1996) 1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N . 2. For m = 1 to M repeat steps (a)–(d): (a) Fit a classifier Cm (x) to the training data using weights wi . (b) Compute weighted error of newest tree errm = N i=1 wi I(yi = Cm (xi )) N i=1 wi (c) Compute αm = log[(1 − errm )/errm ]. (d) Update weights for i = 1, . . . , N : wi ← wi · exp[αm · I(yi = Cm (xi ))] and renormalize to wi to sum to 1. 3. Output C(x) = sign M m=1 αm Cm (x) . . 246
  • 21. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 247 Boosting Stumps 0.4 Single Stump 0.3 • Boosting stumps works remarkably well on the nested-spheres problem. 0.1 0.2 400 Node Tree 0.0 Test Error • A stump is a two-node tree, after a single split. 0 100 200 Boosting Iterations 300 400
  • 22. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 248 0.4 • Bayes error is 0%. in 10- 0.2 • Boosting drives the training error to zero (green curve). 0.1 • Further iterations continue to improve test error in many examples (red curve). 0.0 Train and Test Error • Nested spheres Dimensions. 0.3 0.5 Training Error 0 100 200 300 Number of Terms 400 500 600
  • 23. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 249 0.3 • Nested Gaussians in 10-Dimensions. • Bayes error is 25%. Bayes Error 0.2 • Boosting with stumps 0.1 • Here the test error does increase, but quite slowly. 0.0 Train and Test Error 0.4 Noisy Problems 0 100 200 300 Number of Terms 400 500 600
  • 24. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stagewise Additive Modeling Boosting builds an additive model M βm b(x; γm ). F (x) = m=1 Here b(x, γm ) is a tree, and γm parametrizes the splits. We do things like that in statistics all the time! • GAMs: F (x) = j fj (xj ) • Basis expansions: F (x) = M m=1 θm hm (x) Traditionally the parameters fm , θm are fit jointly (i.e. least squares, maximum likelihood). With boosting, the parameters (βm , γm ) are fit in a stagewise fashion. This slows the process down, and overfits less quickly. 250
  • 25. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Example: Least Squares Boosting 1. Start with function F0 (x) = 0 and residual r = y, m = 0 2. m ← m + 1 3. Fit a CART regression tree to r giving g(x) 4. Set fm (x) ← ǫ · g(x) 5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat step 2–5 many times • g(x) typically shallow tree; e.g. constrained to have k = 3 splits only (depth). • Stagewise fitting (no backfitting!). Slows down (over-) fitting process. • 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further. 251
  • 26. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Least Squares Boosting and ǫ Prediction Error Single tree ǫ=1 ǫ = .01 Number of steps m • If instead of trees, we use linear regression terms, boosting with small ǫ very similar to lasso! 252
  • 27. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Adaboost: Stagewise Modeling • AdaBoost builds an additive logistic regression model M Pr(Y = 1|x) αm fm (x) F (x) = log = Pr(Y = −1|x) m=1 by stagewise fitting using the loss function L(y, F (x)) = exp(−y F (x)). • Instead of fitting trees to residuals, the special form of the exponential loss leads to fitting trees to weighted versions of the original data. See pp 343. 253
  • 28. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 3.0 Why Exponential Loss? 2.0 0.5 1.0 1.5 • Leads to simple reweighting scheme. 0.0 Loss 2.5 Misclassification Exponential Binomial Deviance Squared Error Support Vector • e−yF (x) is a monotone, smooth upper bound on misclassification loss at x. −2 −1 0 1 2 • Has logit transform as population minimizer y·F F ∗ (x) = • y · F is the margin; positive margin means correct classification. 1 Pr(Y = 1|x) log 2 Pr(Y = −1|x) • Other more robust loss functions, like binomial deviance. 254
  • 29. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting General Boosting Algorithms The boosting paradigm can be extended to general loss functions — not only squared error or exponential. 1. Initialize F0 (x) = 0. 2. For m = 1 to M : (a) Compute (βm , γm ) = arg minβ,γ N i=1 L(yi , Fm−1 (xi ) + βb(xi ; γ)). (b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ). Sometimes we replace step (b) in item 2 by (b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm ) Here ǫ is a shrinkage factor; in gbm package in R, the default is ǫ = 0.001. Shrinkage slows the stagewise model-building even more, and typically leads to better performance. 255
  • 30. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Gradient Boosting • General boosting algorithm that works with a variety of different loss functions. Models include regression, resistant regression, K-class classification and risk modeling. • Gradient Boosting builds additive tree models, for example, for representing the logits in logistic regression. • Tree size is a parameter that determines the order of interaction (next slide). • Gradient Boosting inherits all the good features of trees (variable selection, missing data, mixed predictors), and improves on the weak features, such as prediction performance. • Gradient Boosting is described in detail in , section 10.10, and is implemented in Greg Ridgeways gbm package in R. 256
  • 31. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.070 Spam Data 0.055 0.050 0.045 0.040 Test Error 0.060 0.065 Bagging Random Forest Gradient Boosting (5 Node) 0 500 1000 1500 Number of Trees 2000 2500 257
  • 32. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam Example Results With 3000 training and 1500 test observations, Gradient Boosting fits an additive logistic model f (x) = log Pr(spam|x) Pr(email|x) using trees with J = 6 terminal-node trees. Gradient Boosting achieves a test error of 4.5%, compared to 5.3% for an additive GAM, 4.75% for Random Forests, and 8.7% for CART. 258
  • 33. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam: Variable Importance 3d addresses labs telnet 857 415 direct cs table 85 # parts credit [ lab conference report original data project font make address order all hpl technology people pm mail over 650 ; meeting email 000 internet receive ( re business 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $ ! 0 20 40 60 Relative importance 80 100 259
  • 34. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.0 0.2 0.4 0.6 0.8 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence Spam: Partial Dependence 1.0 0.0 0.2 0.4 0.6 -0.2 0.0 0.2 -1.0 -0.6 -0.6 Partial Dependence -0.2 0.0 0.2 remove -1.0 Partial Dependence ! 0.0 0.2 0.4 0.6 edu 0.8 1.0 0.0 0.5 1.0 1.5 hp 2.0 2.5 3.0 260
  • 35. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting California Housing Data 0.42 0.40 0.38 0.36 0.34 0.32 Test Average Absolute Error 0.44 RF m=2 RF m=6 GBM depth=4 GBM depth=6 0 200 400 600 Number of Trees 800 1000 261
  • 36. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Tree Size 0.4 Stumps 10 Node 100 Node Adaboost 0.2 0.3 The tree size J determines the interaction order of the model: η(X) = ηj (Xj ) 0.1 j + ηjk (Xj , Xk ) jk 0.0 Test Error 262 + 0 100 200 Number of Terms 300 400 ηjkl (Xj , Xk , Xl ) jkl +···
  • 37. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stumps win! Since the true decision boundary is the surface of a sphere, the function that describes it has the form 2 2 2 f (X) = X1 + X2 + . . . + Xp − c = 0. Boosted stumps via Gradient Boosting returns reasonable approximations to these quadratic functions. Coordinate Functions for Additive Logistic Trees f1 (x1 ) f2 (x2 ) f3 (x3 ) f4 (x4 ) f5 (x5 ) f6 (x6 ) f7 (x7 ) f8 (x8 ) f9 (x9 ) f10 (x10 ) 263
  • 38. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles Ensemble Learning consists of two steps: 1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)} of basis elements (weak learners) Tm (X). 2. Fitting a model f (X) = m∈D αm Tm (X). Simple examples of ensembles • Linear regression: The ensemble consists of the coordinate functions Tm (X) = Xm . The fitting is done by least squares. • Random Forests: The ensemble consists of trees grown to booststrapped versions of the data, with additional randomization at each split. The fitting simply averages. • Gradient Boosting: The ensemble is grown in an adaptive fashion, but then simply averaged at the end. 264
  • 39. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles and the Lasso Proposed by Popescu and Friedman (2004): Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso regularization: α(λ) = arg min α αm Tm (xi )] + λ L[yi , α0 + i=1 M M N m=1 m=1 |αm |. Advantages: • Often ensembles are large, involving thousands of small trees. The post-processing selects a smaller subset of these and combines them efficiently. • Often the post-processing improves the process that generated the ensemble. 265
  • 40. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Post-Processing Random Forest Ensembles 0.09 Spam Data 0.07 0.06 0.05 0.04 Test Error 0.08 Random Forest Random Forest (5%, 6) Gradient Boost (5 node) 0 100 200 300 Number of Trees 400 500 266
  • 41. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Software • R: free GPL statistical computing environment available from CRAN, implements the S language. Includes: – randomForest: implementation of Leo Breimans algorithms. – rpart: Terry Therneau’s implementation of classification and regression trees. – gbm: Greg Ridgeway’s implementation of Friedman’s gradient boosting algorithm. • Salford Systems: Commercial implementation of trees, random forests and gradient boosting. • Splus (Insightful): Commerical version of S. • Weka: GPL software from University of Waikato, New Zealand. Includes Trees, Random Forests and many other procedures. 267