Slides of a report on Machine Learning Seminar Series'11 at Kazan (Volga Region) Federal University. See http://cll.niimm.ksu.ru/cms/main/seminars/mlseminar
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
7 - Model Assessment and Selection
1. Model Assessment and Selection
Machine Learning Seminar Series'11
Nikita Zhiltsov
Kazan (Volga Region) Federal University, Russia
18 November 2011
1 / 34
2. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
2 / 34
3. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
3 / 34
4. Notation
x = (x1 , . . . , xD ) ∈ X a vector of inputs
t ∈ T a target variable
y(x) a prediction model
L(t, y(x)) the loss function for measuring errors.
Usual choices for regression:
(y(x) − t)2 squared error
L(t, y(x)) =
|y(x) − t| absolute error
... and classication:
I(y(x) = t) 0-1 loss
L(t, y(x)) =
−2 log pt (x) log-likelihood loss
4 / 34
5. Notation (cont.)
1 N
err = N i=1 L(ti , xi ) training error
ErrD = ED [L(t, y(x))] test error (prediction error) for a given
training set D
Err = E[ErrD ] = E[L(t, y(x))] expected test error
NB
Most methods eectively estimate only Err.
5 / 34
6. Typical behavior of test and training error
Example
Training error is not a good estimate of the test error
There is some intermediate model complexity that gives
minimum expected test error
6 / 34
7. Dening our goals
Model Selection
Estimating the performance of dierent models in order to choose
the best one
Model Assessment
Having chosen a nal model, estimating its generalization error on
new data
7 / 34
8. Data-rich situation
Training set is used to learn the models
Validation set is used to estimate prediction error for model
selection
Test set is used for assessment of the generalization error of the
chosen model
8 / 34
9. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
9 / 34
10. Bias-Variance Decomposition
Let's consider expected loss E[L] for regression task:
E[L] = L(t, y(x)) p(x, t)dxdt
R X
Under squared error loss, h(x) = E[t|x] = tp(t|x)dt is the optimal
prediction.
Then, E[L] can be decomposed into the sum of three parts:
E[L] = bias2 + variance + noise
where
2
bias = (ED [y(x; D)] − h(x))2 p(x)dx
variance = ED [(y(x; D) − ED [y(x; D)])2 ] p(x)dx
noise = (h(x) − t)2 p(x, t)dxdt
10 / 34
11. Bias-Variance Decomposition
Examples
p
For a linear model y(x, w) = j=1 wj xj , ∀wj = 0,
the in-sample error is:
N
1 p 2
Err = (¯(xi ) − h(xi ))2 +
y σ + σ2
N i=1
N
For a ridge regression model (Tikhonov regularization):
N
1
Err = {(ˆ(xi ) − h(xi ))2 + (ˆ(xi ) − y (xi ))2 } + V ar + σ 2
y y ¯
N i=1
where y (xi )
ˆ the best-tting linear approximation to h
11 / 34
13. Bias-variance tradeo
Example
Regression with squared loss
Classication with 0-1 loss
In the 2nd case, prediction error is no
longer the sum of squared bias and
variance
⇒ The best choices of tuning parameters
may dier substantially in the two
settings
13 / 34
14. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
14 / 34
15. Analytical methods: AIC, BIC, SRM
They give the in-sample estimates in the general form:
ˆ
Err = err + w
ˆ
where w
ˆ is an estimate of the average optimism
By using w,
ˆ the methods penalize too complex models
Unlike regularization, they do not impose a specic
regularization parameter λ
Each criterion denes its notion of model complexity involved in
the penalizing term
15 / 34
16. Akaike Information Criterion (AIC)
Applicable for linear models
Either log-likelihood loss or squared error loss is used
Given a set of models indexed by a tuning parameter α, denote
by d(α) number of parameters for each model. Then,
d(α) 2
AIC(α) = err + 2 σ
ˆ
N
where σ2
ˆ is typically estimated by the mean squared error of a
low-bias model
Finally, we choose the model giving smallest AIC
16 / 34
17. Akaike Information Criterion (AIC)
Example
Phoneme recognition task (N = 1000)
Input vector is the log-periodogram of
the spoken vowel quantized to 256
uniformly space frequencies
Linear logistic regression is used to
predict the phonem class
Here d(α) is a number of basis
functions
17 / 34
18. Bayesian Information Criterion (BIC)
BIC, like AIC, is applicable in settings where log-likehood
maximization is involved
N d
BIC = 2
(err + (log N ) σ 2 )
ˆ
σ
ˆ N
BIC is proportional to AIC with the factor 2 replaced by log N
Having N 8, BIC tends to penalize complex models more
heavily than AIC
BIC also provides the posterior probability of each model m:
1
e− 2 BICm
M 1
− 2 BICl
l=1 e
BIC is asympotically consistent as N →∞
18 / 34
19. Structural Risk Minimization
The Vapnik-Chervonenkis (VC) theory provides a general
measure of the model complexity and gives associated bounds
on the optimism
Such a complexity measure, VC dimension, is dened as follows:
VC dimension of the class functions {f (x, α)} is
the largest number of points that can be shattered by
members of {f (x, α)}
E.g. a linear indicator function in p dimensions has VC
dimension p + 1; sin(αx) has innite VC dimension
19 / 34
20. Structural Risk Minimization (cont.)
If we t N training points using {f (x, α)} having VC dimension
h, then with probability at least 1 − η the following bound holds:
h 2N ln η
Err err + (ln + 1) − )
N h N
SRM approach ts a nested sequence of models of increasing VC
dimensions h1 h2 . . . and then chooses the model with the
smallest upper bound
SVM classier eciently carries out the SRM approach
Issues
ˆ There exists the diculty in calculating the VC dimension of a class
of functions
ˆ In practice, often the upper bound is very loose
20 / 34
21. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
21 / 34
22. Sample re-use: cross-validation, bootstrapping
These methods directly (and quite accurately) estimate
the average generalization error
The extra-sample error is evaluated rather than
in-sample one (test input vectors do not need to
coincide with training ones)
They can be used with any loss function, and with
nonlinear, adaptive and tting techniques
However, they may underestimate true error for such
tting methods as trees
22 / 34
23. Cross-validation
Probably the simplest and widely used method
However, time-consuming method
CV procedure looks as follows:
1 Split data into K roughly equal-sized parts
2 For k-th part we t the model y −k (x) to other K − 1 parts
3 Then the cross-validation estimate of the prediction error is
N
1
CV = L(ti , y −k(i) (xi ))
N
i=1
The case K=N (leave-one-out cross-validation) is roughly
unbiased, but can have high variance
23 / 34
24. Cross-validation (cont.)
In practice, 5- or 10-fold cross-validation is recommended
CV tends to overestimate the true prediction error on small
datasets
Often one-standard error rule is used with CV. See example:
We choose the most
parsimonious model
whose error is no more
than one standard error
above the error of the
best model
A model with p=9
would be chosen
24 / 34
25. Bootstrapping
General method for assessing statistical accuracy
Given a training set, here the bootstrapping procedure steps are:
1 Randomly draw datasets of with replacement from it; each
sample is of the same size as the original one
2 This is done by B times, producing B bootstrap datasets
3 Fit the model to each of the bootstrap datasets
4 Examine the prediction error using the original training set as a
test set:
N
1 1
ˆ
Errboot = L(ti , y ∗b (xi ))
N |C −i |
i=1 b∈C −i
where C (−i) is the set of indices of the bootstrap samples that
do not contain observation i
To alleviate the upward bias, the .632 estimator is used:
ˆ (.632) = 0.368 err + 0.632 Errboot
Err ˆ
25 / 34
26. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
26 / 34
27. http://r-project.org
Free software environment for statistical
computing and graphics
R packages for machine learning and data
mining: kernlab, rpart, randomForest,
animation, gbm, tm etc.
R packages for evaluation: bootstrap,boot
RStudio IDE
27 / 34
28. Housing dataset at UCI Machine learning
repository
http://archive.ics.uci.edu/ml/datasets/Housing
Housing values in suburbs of Boston
506 intances, 13 attributes + 1 numeric class attribute
(MEDV)
28 / 34
29. Loading data in R
housing - read.table(∼/projects/r/housing.data,
+ header=T)
attach(housing)
29 / 34
30. Cross-validation example in R
Helper function
Creating a function using crossval() from bootstrap package
eval - function(fit,k=10){
+ require(bootstrap)
+ theta.fit - function(x,y){lsfit(x,y)}
+ theta.predict - function(fit,x){cbind(1,x)%*%fit$coef}
+ x - fit$model[,2:ncol(fit$model)]
+ y - fit$model[,1]
+ results - crossval(x,y,theta.fit,theta.predict,
+ ngroup=k)
+ squared.error=sum((y-results$cv.fit)^2)/length(y)
+ cat(Cross-validated squared error =,
+ squared.error, n)}
30 / 34
31. Cross-validation example in R
Model assessment
fit - lm(MEDV∼.,data=housing) # A linear model that uses
all the attributes
eval(fit)
Cross-validated squared error = 23.15827
fit - lm(MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS,
+ data=housing) # Less complex model
eval(fit)
Cross-validated squared error = 23.24319
fit - lm(MEDV∼ RM,data=housing) # Too simple model
eval(fit)
Cross-validated squared error = 44.38424
31 / 34
32. Bootstrapping example in R
Helper function
Creating a function using boot() function from boot package
sqer - function(formula,data,indices){
+ d - data[indices,]
+ fit - lm(formula, data=d)
+ return (sum(fit$residuals^2)/length(fit$residuals))
+ }
32 / 34
33. Bootstrapping example in R
Model assessment
results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼.) # 1000 bootstrapped datasets
print(results)
Bootstrap Statistics :
original bias std. error
t1* 21.89483 -0.76001 2.296025
results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS)
print(results)
Bootstrap Statistics :
original bias std. error
t1* 22.88726 -0.5400892 2.744437
results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼ RM)
print(results)
Bootstrap Statistics :
original bias std. error
t1* 43.60055 -0.3379168 5.407933
33 / 34
34. Resources
T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical
Learning, 2008
Stanford Engineering Everywhere CS229 Machine Learning.
Handouts 4 and 5
http://videolectures.net/stanfordcs229f07_machine_
learning/
34 / 34