7 - Model Assessment and Selection

Model Assessment and Selection
Machine Learning Seminar Series'11

Nikita Zhiltsov

Kazan (Volga Region) Federal University, Russia

18 November 2011

1 / 34

Outline
1 Bias, Variance and Model Complexity

2 Nature of Prediction Error

3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach

4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping

5 Model Assessment in R

2 / 34

Outline


AIC
BIC
SRM Approach

Cross-validation
Bootstrapping


3 / 34

Notation
x = (x1 , . . . , xD ) ∈ X a vector of inputs
t ∈ T a target variable
y(x) a prediction model

L(t, y(x)) the loss function for measuring errors.
Usual choices for regression:
(y(x) − t)2 squared error
L(t, y(x)) =
|y(x) − t| absolute error
... and classication:
I(y(x) = t) 0-1 loss
L(t, y(x)) =
−2 log pt (x) log-likelihood loss
4 / 34

Notation (cont.)

1 N
err = N i=1 L(ti , xi ) training error

ErrD = ED [L(t, y(x))] test error (prediction error) for a given
training set D

Err = E[ErrD ] = E[L(t, y(x))] expected test error

NB
Most methods eectively estimate only Err.

5 / 34

Typical behavior of test and training error
Example

Training error is not a good estimate of the test error

There is some intermediate model complexity that gives
minimum expected test error

6 / 34

Dening our goals

Model Selection
Estimating the performance of dierent models in order to choose
the best one

Model Assessment
Having chosen a nal model, estimating its generalization error on
new data

7 / 34

Data-rich situation

Training set is used to learn the models

Validation set is used to estimate prediction error for model
selection

Test set is used for assessment of the generalization error of the
chosen model

8 / 34

Outline


AIC
BIC
SRM Approach

Cross-validation
Bootstrapping


9 / 34

Bias-Variance Decomposition
Let's consider expected loss E[L] for regression task:

E[L] = L(t, y(x)) p(x, t)dxdt
R X

Under squared error loss, h(x) = E[t|x] = tp(t|x)dt is the optimal
prediction.
Then, E[L] can be decomposed into the sum of three parts:

E[L] = bias2 + variance + noise
where

2
bias = (ED [y(x; D)] − h(x))2 p(x)dx
variance = ED [(y(x; D) − ED [y(x; D)])2 ] p(x)dx
noise = (h(x) − t)2 p(x, t)dxdt

10 / 34

Bias-Variance Decomposition
Examples

p
For a linear model y(x, w) = j=1 wj xj , ∀wj = 0,
the in-sample error is:

N
1 p 2
Err = (¯(xi ) − h(xi ))2 +
y σ + σ2
N i=1
N

For a ridge regression model (Tikhonov regularization):

N
1
Err = {(ˆ(xi ) − h(xi ))2 + (ˆ(xi ) − y (xi ))2 } + V ar + σ 2
y y ¯
N i=1

where y (xi )
ˆ the best-tting linear approximation to h

11 / 34

Behavior of bias and variance

12 / 34

Bias-variance tradeo
Example

Regression with squared loss

Classication with 0-1 loss

In the 2nd case, prediction error is no
longer the sum of squared bias and
variance

⇒ The best choices of tuning parameters
may dier substantially in the two
settings

13 / 34

Outline


AIC
BIC
SRM Approach

Cross-validation
Bootstrapping


14 / 34

Analytical methods: AIC, BIC, SRM

They give the in-sample estimates in the general form:

ˆ
Err = err + w
ˆ

where w
ˆ is an estimate of the average optimism

By using w,
ˆ the methods penalize too complex models

Unlike regularization, they do not impose a specic
regularization parameter λ
Each criterion denes its notion of model complexity involved in
the penalizing term

15 / 34

Akaike Information Criterion (AIC)

Applicable for linear models

Either log-likelihood loss or squared error loss is used

Given a set of models indexed by a tuning parameter α, denote
by d(α) number of parameters for each model. Then,

d(α) 2
AIC(α) = err + 2 σ
ˆ
N
where σ2
ˆ is typically estimated by the mean squared error of a
low-bias model

Finally, we choose the model giving smallest AIC

16 / 34

Akaike Information Criterion (AIC)
Example

Phoneme recognition task (N = 1000)
Input vector is the log-periodogram of
the spoken vowel quantized to 256
uniformly space frequencies

Linear logistic regression is used to
predict the phonem class

Here d(α) is a number of basis
functions

17 / 34

Bayesian Information Criterion (BIC)
BIC, like AIC, is applicable in settings where log-likehood
maximization is involved

N d
BIC = 2
(err + (log N ) σ 2 )
ˆ
σ
ˆ N

BIC is proportional to AIC with the factor 2 replaced by log N
Having N 8, BIC tends to penalize complex models more
heavily than AIC

BIC also provides the posterior probability of each model m:
1
e− 2 BICm
M 1
− 2 BICl
l=1 e

BIC is asympotically consistent as N →∞
18 / 34

Structural Risk Minimization
The Vapnik-Chervonenkis (VC) theory provides a general
measure of the model complexity and gives associated bounds
on the optimism

Such a complexity measure, VC dimension, is dened as follows:

VC dimension of the class functions {f (x, α)} is
the largest number of points that can be shattered by
members of {f (x, α)}

E.g. a linear indicator function in p dimensions has VC
dimension p + 1; sin(αx) has innite VC dimension

19 / 34

Structural Risk Minimization (cont.)
If we t N training points using {f (x, α)} having VC dimension
h, then with probability at least 1 − η the following bound holds:

h 2N ln η
Err err + (ln + 1) − )
N h N
SRM approach ts a nested sequence of models of increasing VC
dimensions h1 h2 . . . and then chooses the model with the
smallest upper bound

SVM classier eciently carries out the SRM approach

Issues
ˆ There exists the diculty in calculating the VC dimension of a class
of functions
ˆ In practice, often the upper bound is very loose

20 / 34

Outline


AIC
BIC
SRM Approach

Cross-validation
Bootstrapping


21 / 34

Sample re-use: cross-validation, bootstrapping

These methods directly (and quite accurately) estimate
the average generalization error
The extra-sample error is evaluated rather than
in-sample one (test input vectors do not need to
coincide with training ones)
They can be used with any loss function, and with
nonlinear, adaptive and tting techniques
However, they may underestimate true error for such
tting methods as trees

22 / 34

Cross-validation
Probably the simplest and widely used method

However, time-consuming method

CV procedure looks as follows:
1 Split data into K roughly equal-sized parts
2 For k-th part we t the model y −k (x) to other K − 1 parts
3 Then the cross-validation estimate of the prediction error is
N
1
CV = L(ti , y −k(i) (xi ))
N
i=1

The case K=N (leave-one-out cross-validation) is roughly
unbiased, but can have high variance
23 / 34

Cross-validation (cont.)
In practice, 5- or 10-fold cross-validation is recommended

CV tends to overestimate the true prediction error on small
datasets

Often one-standard error rule is used with CV. See example:

We choose the most
parsimonious model
whose error is no more
than one standard error
above the error of the
best model

A model with p=9
would be chosen

24 / 34

Bootstrapping
General method for assessing statistical accuracy
Given a training set, here the bootstrapping procedure steps are:
1 Randomly draw datasets of with replacement from it; each
sample is of the same size as the original one
2 This is done by B times, producing B bootstrap datasets
3 Fit the model to each of the bootstrap datasets
4 Examine the prediction error using the original training set as a
test set:
N
1 1
ˆ
Errboot = L(ti , y ∗b (xi ))
N |C −i |
i=1 b∈C −i

where C (−i) is the set of indices of the bootstrap samples that
do not contain observation i
To alleviate the upward bias, the .632 estimator is used:

ˆ (.632) = 0.368 err + 0.632 Errboot
Err ˆ
25 / 34

Outline


AIC
BIC
SRM Approach

Cross-validation
Bootstrapping


26 / 34

http://r-project.org

Free software environment for statistical
computing and graphics
R packages for machine learning and data
mining: kernlab, rpart, randomForest,
animation, gbm, tm etc.
R packages for evaluation: bootstrap,boot
RStudio IDE
27 / 34

Housing dataset at UCI Machine learning
repository
http://archive.ics.uci.edu/ml/datasets/Housing

Housing values in suburbs of Boston

506 intances, 13 attributes + 1 numeric class attribute
(MEDV)

28 / 34

Loading data in R

housing - read.table(∼/projects/r/housing.data,
+ header=T)
attach(housing)

29 / 34

Cross-validation example in R
Helper function

Creating a function using crossval() from bootstrap package

eval - function(fit,k=10){
+ require(bootstrap)
+ theta.fit - function(x,y){lsfit(x,y)}
+ theta.predict - function(fit,x){cbind(1,x)%*%fit$coef}
+ x - fit$model[,2:ncol(fit$model)]
+ y - fit$model[,1]
+ results - crossval(x,y,theta.fit,theta.predict,
+ ngroup=k)
+ squared.error=sum((y-results$cv.fit)^2)/length(y)
+ cat(Cross-validated squared error =,
+ squared.error, n)}

30 / 34

Cross-validation example in R
Model assessment

fit - lm(MEDV∼.,data=housing) # A linear model that uses
all the attributes
eval(fit)
Cross-validated squared error = 23.15827
fit - lm(MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS,
+ data=housing) # Less complex model
eval(fit)
fit - lm(MEDV∼ RM,data=housing) # Too simple model
eval(fit)

31 / 34

Bootstrapping example in R
Helper function

Creating a function using boot() function from boot package

sqer - function(formula,data,indices){
+ d - data[indices,]
+ fit - lm(formula, data=d)
+ return (sum(fit$residuals^2)/length(fit$residuals))
+ }

32 / 34

Bootstrapping example in R
Model assessment

results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼.) # 1000 bootstrapped datasets
print(results)
Bootstrap Statistics :
original bias std. error
t1* 21.89483 -0.76001 2.296025
formula=MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS)
print(results)
t1* 22.88726 -0.5400892 2.744437
formula=MEDV∼ RM)
print(results)
t1* 43.60055 -0.3379168 5.407933
33 / 34

Resources

T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical
Learning, 2008
Stanford Engineering Everywhere CS229 Machine Learning.
Handouts 4 and 5
http://videolectures.net/stanfordcs229f07_machine_
learning/

34 / 34

7 - Model Assessment and Selection

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 7 - Model Assessment and Selection

Similaire à 7 - Model Assessment and Selection (20)

Dernier

Dernier (20)

7 - Model Assessment and Selection