SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Model Assessment and Selection
      Machine Learning Seminar Series'11




                Nikita Zhiltsov


 Kazan (Volga Region) Federal University, Russia




             18 November 2011




                                                   1 / 34
Outline
1   Bias, Variance and Model Complexity


2   Nature of Prediction Error


3   Error Estimation: Analytical methods
      AIC
      BIC
      SRM Approach


4   Error Estimation: Sample re-use
      Cross-validation
      Bootstrapping


5   Model Assessment in R



                                           2 / 34
Outline
1   Bias, Variance and Model Complexity


2   Nature of Prediction Error


3   Error Estimation: Analytical methods
      AIC
      BIC
      SRM Approach


4   Error Estimation: Sample re-use
      Cross-validation
      Bootstrapping


5   Model Assessment in R



                                           3 / 34
Notation
   x = (x1 , . . . , xD ) ∈ X  a vector of inputs
   t ∈ T  a target variable
   y(x)  a prediction model

   L(t, y(x))  the loss function for measuring errors.
   Usual choices for regression:
                            (y(x) − t)2 squared error
        L(t, y(x)) =
                             |y(x) − t| absolute error
   ... and classication:
                        I(y(x) = t) 0-1 loss
      L(t, y(x)) =
                        −2 log pt (x) log-likelihood loss
                                                            4 / 34
Notation (cont.)

             1   N
     err =   N   i=1   L(ti , xi )    training error


     ErrD = ED [L(t, y(x))]           test error (prediction error) for a given
     training set D


     Err = E[ErrD ] = E[L(t, y(x))]            expected test error


NB
Most methods eectively estimate only           Err.




                                                                             5 / 34
Typical behavior of test and training error
Example




     Training error is not a good estimate of the test error

     There is some intermediate model complexity that gives
     minimum expected test error

                                                               6 / 34
Dening our goals


Model Selection
Estimating the performance of dierent models in order to choose
the best one




Model Assessment
Having chosen a nal model, estimating its generalization error on
new data




                                                                     7 / 34
Data-rich situation




   Training set is used to learn the models

   Validation set is used to estimate prediction error for model
   selection

   Test set is used for assessment of the generalization error of the
   chosen model




                                                                   8 / 34
Outline
1   Bias, Variance and Model Complexity


2   Nature of Prediction Error


3   Error Estimation: Analytical methods
      AIC
      BIC
      SRM Approach


4   Error Estimation: Sample re-use
      Cross-validation
      Bootstrapping


5   Model Assessment in R



                                           9 / 34
Bias-Variance Decomposition
Let's consider expected loss       E[L]   for regression task:


                     E[L] =            L(t, y(x)) p(x, t)dxdt
                               R   X

Under squared error loss,      h(x) = E[t|x] =       tp(t|x)dt   is the optimal
prediction.
Then,   E[L]   can be decomposed into the sum of three parts:


                      E[L] = bias2 + variance + noise
where

                 2
          bias        =   (ED [y(x; D)] − h(x))2 p(x)dx
        variance      =       ED [(y(x; D) − ED [y(x; D)])2 ] p(x)dx
          noise       =   (h(x) − t)2 p(x, t)dxdt

                                                                             10 / 34
Bias-Variance Decomposition
Examples



                                             p
     For a linear model      y(x, w) =       j=1   wj xj , ∀wj = 0,
     the in-sample error is:

                                N
                        1                                   p 2
                  Err =              (¯(xi ) − h(xi ))2 +
                                      y                       σ + σ2
                        N      i=1
                                                            N

     For a ridge regression model (Tikhonov regularization):

                   N
           1
     Err =              {(ˆ(xi ) − h(xi ))2 + (ˆ(xi ) − y (xi ))2 } + V ar + σ 2
                          y                    y        ¯
           N      i=1

     where   y (xi )
             ˆ          the best-tting linear approximation to      h

                                                                            11 / 34
Behavior of bias and variance




                                12 / 34
Bias-variance tradeo
Example




                     Regression with squared loss

                     Classication with 0-1 loss

                     In the 2nd case, prediction error is no
                     longer the sum of squared bias and
                     variance

                 ⇒   The best choices of tuning parameters
                     may dier substantially in the two
                     settings




                                                               13 / 34
Outline
1   Bias, Variance and Model Complexity


2   Nature of Prediction Error


3   Error Estimation: Analytical methods
      AIC
      BIC
      SRM Approach


4   Error Estimation: Sample re-use
      Cross-validation
      Bootstrapping


5   Model Assessment in R



                                           14 / 34
Analytical methods: AIC, BIC, SRM

   They give the in-sample estimates in the general form:

                                ˆ
                               Err = err + w
                                           ˆ

   where   w
           ˆ   is an estimate of the average optimism

   By using    w,
               ˆ    the methods penalize too complex models

   Unlike regularization, they do not impose a specic
   regularization parameter    λ
   Each criterion denes its notion of model complexity involved in
   the penalizing term




                                                                15 / 34
Akaike Information Criterion (AIC)

   Applicable for linear models

   Either log-likelihood loss or squared error loss is used

   Given a set of models indexed by a tuning parameter        α,   denote
   by   d(α)   number of parameters for each model. Then,


                                            d(α) 2
                         AIC(α) = err + 2       σ
                                                ˆ
                                             N
   where   σ2
           ˆ    is typically estimated by the mean squared error of a
   low-bias model

   Finally, we choose the model giving smallest AIC




                                                                       16 / 34
Akaike Information Criterion (AIC)
Example




                     Phoneme recognition task (N      = 1000)
                     Input vector is the log-periodogram of
                     the spoken vowel quantized to 256
                     uniformly space frequencies

                     Linear logistic regression is used to
                     predict the phonem class

                     Here   d(α)   is a number of basis
                     functions




                                                             17 / 34
Bayesian Information Criterion (BIC)
   BIC, like AIC, is applicable in settings where log-likehood
   maximization is involved

                              N                  d
                     BIC =      2
                                  (err + (log N ) σ 2 )
                                                   ˆ
                              σ
                              ˆ                  N

   BIC is proportional to AIC with the factor 2 replaced by    log N
   Having   N  8,   BIC tends to penalize complex models more
   heavily than AIC

   BIC also provides the posterior probability of each model     m:
                                    1
                                 e− 2 BICm
                                 M      1
                                      − 2 BICl
                                 l=1 e

   BIC is asympotically consistent as   N →∞
                                                                      18 / 34
Structural Risk Minimization
   The Vapnik-Chervonenkis (VC) theory provides a general
   measure of the model complexity and gives associated bounds
   on the optimism

   Such a complexity measure, VC dimension, is dened as follows:

             VC dimension of the class functions {f (x, α)} is
             the largest number of points that can be shattered by
             members of {f (x, α)}

   E.g. a linear indicator function in   p   dimensions has VC
   dimension   p + 1; sin(αx)   has innite VC dimension




                                                                 19 / 34
Structural Risk Minimization (cont.)
    If we t   N   training points using   {f (x, α)} having VC dimension
    h,   then with probability at least    1 − η the following bound holds:

                                       h     2N        ln η
                    Err  err +          (ln    + 1) −      )
                                       N      h         N
    SRM approach ts a nested sequence of models of increasing VC
    dimensions     h1  h2 . . .   and then chooses the model with the
    smallest upper bound

    SVM classier eciently carries out the SRM approach

Issues
  ˆ There exists the diculty in calculating the VC dimension of a class
    of functions
  ˆ In practice, often the upper bound is very loose

                                                                         20 / 34
Outline
1   Bias, Variance and Model Complexity


2   Nature of Prediction Error


3   Error Estimation: Analytical methods
      AIC
      BIC
      SRM Approach


4   Error Estimation: Sample re-use
      Cross-validation
      Bootstrapping


5   Model Assessment in R



                                           21 / 34
Sample re-use: cross-validation, bootstrapping

   These methods directly (and quite accurately) estimate
   the average generalization error
   The extra-sample error is evaluated rather than
   in-sample one (test input vectors do not need to
   coincide with training ones)
   They can be used with any loss function, and with
   nonlinear, adaptive and tting techniques
   However, they may underestimate true error for such
   tting methods as trees


                                                       22 / 34
Cross-validation
   Probably the simplest and widely used method

   However, time-consuming method

   CV procedure looks as follows:
     1   Split data into K roughly equal-sized parts
     2   For k-th part we t the model y −k (x) to other K − 1 parts
     3   Then the cross-validation estimate of the prediction error is
                                     N
                               1
                          CV =             L(ti , y −k(i) (xi ))
                               N
                                     i=1




   The case   K=N      (leave-one-out cross-validation) is roughly
   unbiased, but can have high variance
                                                                         23 / 34
Cross-validation (cont.)
    In practice, 5- or 10-fold cross-validation is recommended

    CV tends to overestimate the true prediction error on small
    datasets

    Often one-standard error rule is used with CV. See example:



                                         We choose the most
                                         parsimonious model
                                         whose error is no more
                                         than one standard error
                                         above the error of the
                                         best model

                                         A model with   p=9
                                         would be chosen


                                                                   24 / 34
Bootstrapping
   General method for assessing statistical accuracy
   Given a training set, here the bootstrapping procedure steps are:
     1   Randomly draw datasets of with replacement from it; each
         sample is of the same size as the original one
     2   This is done by B times, producing B bootstrap datasets
     3   Fit the model to each of the bootstrap datasets
     4   Examine the prediction error using the original training set as a
         test set:
                                   N
                             1            1
                    ˆ
                   Errboot =                               L(ti , y ∗b (xi ))
                             N          |C −i |
                                  i=1             b∈C −i

         where C (−i) is the set of indices of the bootstrap samples that
         do not contain observation i
   To alleviate the upward bias, the .632 estimator is used:

                   ˆ (.632) = 0.368 err + 0.632 Errboot
                  Err                            ˆ
                                                                                25 / 34
Outline
1   Bias, Variance and Model Complexity


2   Nature of Prediction Error


3   Error Estimation: Analytical methods
      AIC
      BIC
      SRM Approach


4   Error Estimation: Sample re-use
      Cross-validation
      Bootstrapping


5   Model Assessment in R



                                           26 / 34
http://r-project.org



     Free software environment for statistical
     computing and graphics
     R packages for machine learning and data
     mining: kernlab, rpart, randomForest,
     animation, gbm, tm etc.
     R packages for evaluation: bootstrap,boot
     RStudio IDE
                                                 27 / 34
Housing dataset at UCI Machine learning
repository
http://archive.ics.uci.edu/ml/datasets/Housing

     Housing values in suburbs of Boston

     506 intances, 13 attributes + 1 numeric class attribute
     (MEDV)




                                                                 28 / 34
Loading data in R



 housing - read.table(∼/projects/r/housing.data,
+ header=T)
 attach(housing)




                                                       29 / 34
Cross-validation example in R
Helper function




Creating a function using crossval() from bootstrap package


   eval - function(fit,k=10){
+   require(bootstrap)
+   theta.fit - function(x,y){lsfit(x,y)}
+   theta.predict - function(fit,x){cbind(1,x)%*%fit$coef}
+   x - fit$model[,2:ncol(fit$model)]
+   y - fit$model[,1]
+   results - crossval(x,y,theta.fit,theta.predict,
+   ngroup=k)
+   squared.error=sum((y-results$cv.fit)^2)/length(y)
+   cat(Cross-validated squared error =,
+   squared.error, n)}

                                                              30 / 34
Cross-validation example in R
Model assessment




 fit - lm(MEDV∼.,data=housing) # A linear model that uses
 all the attributes
 eval(fit)
Cross-validated squared error = 23.15827
 fit - lm(MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS,
+ data=housing) # Less complex model
 eval(fit)
Cross-validated squared error = 23.24319
 fit - lm(MEDV∼ RM,data=housing) # Too simple model
 eval(fit)
Cross-validated squared error = 44.38424




                                                             31 / 34
Bootstrapping example in R
Helper function




Creating a function using boot() function from boot package


   sqer - function(formula,data,indices){
+   d - data[indices,]
+   fit - lm(formula, data=d)
+   return (sum(fit$residuals^2)/length(fit$residuals))
+   }




                                                              32 / 34
Bootstrapping example in R
Model assessment

 results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼.) # 1000 bootstrapped datasets
 print(results)
Bootstrap Statistics :
    original   bias     std. error
t1* 21.89483 -0.76001     2.296025
 results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS)
 print(results)
Bootstrap Statistics :
    original     bias     std. error
t1* 22.88726 -0.5400892     2.744437
 results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼ RM)
 print(results)
Bootstrap Statistics :
    original     bias     std. error
t1* 43.60055 -0.3379168     5.407933
                                                             33 / 34
Resources


   T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical
   Learning, 2008
   Stanford Engineering Everywhere CS229  Machine Learning.
   Handouts 4 and 5
   http://videolectures.net/stanfordcs229f07_machine_
   learning/




                                                                34 / 34

Contenu connexe

Tendances

Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selectionDavis David
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
GANs Presentation.pptx
GANs Presentation.pptxGANs Presentation.pptx
GANs Presentation.pptxMAHMOUD729246
 
Classification
ClassificationClassification
ClassificationCloudxLab
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessingKrish_ver2
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regressionAkhilesh Joshi
 
Edge linking in image processing
Edge linking in image processingEdge linking in image processing
Edge linking in image processingVARUN KUMAR
 
Polynomial regression
Polynomial regressionPolynomial regression
Polynomial regressionnaveedaliabad
 

Tendances (20)

Linear regression
Linear regressionLinear regression
Linear regression
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Structured Knowledge Representation
Structured Knowledge RepresentationStructured Knowledge Representation
Structured Knowledge Representation
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
GANs Presentation.pptx
GANs Presentation.pptxGANs Presentation.pptx
GANs Presentation.pptx
 
Classification
ClassificationClassification
Classification
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Siamese networks
Siamese networksSiamese networks
Siamese networks
 
Digital Image Fundamentals - II
Digital Image Fundamentals - IIDigital Image Fundamentals - II
Digital Image Fundamentals - II
 
CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Hit and-miss transform
Hit and-miss transformHit and-miss transform
Hit and-miss transform
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regression
 
Edge linking in image processing
Edge linking in image processingEdge linking in image processing
Edge linking in image processing
 
Polynomial regression
Polynomial regressionPolynomial regression
Polynomial regression
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 

Similaire à 7 - Model Assessment and Selection

The Magic of Auto Differentiation
The Magic of Auto DifferentiationThe Magic of Auto Differentiation
The Magic of Auto DifferentiationSanyam Kapoor
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Ukraine
 
A Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR FiltersA Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR FiltersIDES Editor
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fieldslswing
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...Jialin LIU
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsUniversity of Glasgow
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisSilvio Cesare
 
Jam 2006 Test Papers Mathematical Statistics
Jam 2006 Test Papers Mathematical StatisticsJam 2006 Test Papers Mathematical Statistics
Jam 2006 Test Papers Mathematical Statisticsashu29
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regressionAlexander Decker
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_fariaPaulo Faria
 

Similaire à 7 - Model Assessment and Selection (20)

Section6 stochastic
Section6 stochasticSection6 stochastic
Section6 stochastic
 
The Magic of Auto Differentiation
The Magic of Auto DifferentiationThe Magic of Auto Differentiation
The Magic of Auto Differentiation
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
A Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR FiltersA Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR Filters
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
 
Adaline and Madaline.ppt
Adaline and Madaline.pptAdaline and Madaline.ppt
Adaline and Madaline.ppt
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Jam 2006 Test Papers Mathematical Statistics
Jam 2006 Test Papers Mathematical StatisticsJam 2006 Test Papers Mathematical Statistics
Jam 2006 Test Papers Mathematical Statistics
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regression
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
Automatic bayesian cubature
Automatic bayesian cubatureAutomatic bayesian cubature
Automatic bayesian cubature
 
Paper computer
Paper computerPaper computer
Paper computer
 
Paper computer
Paper computerPaper computer
Paper computer
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria
 

Dernier

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Dernier (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

7 - Model Assessment and Selection

  • 1. Model Assessment and Selection Machine Learning Seminar Series'11 Nikita Zhiltsov Kazan (Volga Region) Federal University, Russia 18 November 2011 1 / 34
  • 2. Outline 1 Bias, Variance and Model Complexity 2 Nature of Prediction Error 3 Error Estimation: Analytical methods AIC BIC SRM Approach 4 Error Estimation: Sample re-use Cross-validation Bootstrapping 5 Model Assessment in R 2 / 34
  • 3. Outline 1 Bias, Variance and Model Complexity 2 Nature of Prediction Error 3 Error Estimation: Analytical methods AIC BIC SRM Approach 4 Error Estimation: Sample re-use Cross-validation Bootstrapping 5 Model Assessment in R 3 / 34
  • 4. Notation x = (x1 , . . . , xD ) ∈ X a vector of inputs t ∈ T a target variable y(x) a prediction model L(t, y(x)) the loss function for measuring errors. Usual choices for regression: (y(x) − t)2 squared error L(t, y(x)) = |y(x) − t| absolute error ... and classication: I(y(x) = t) 0-1 loss L(t, y(x)) = −2 log pt (x) log-likelihood loss 4 / 34
  • 5. Notation (cont.) 1 N err = N i=1 L(ti , xi ) training error ErrD = ED [L(t, y(x))] test error (prediction error) for a given training set D Err = E[ErrD ] = E[L(t, y(x))] expected test error NB Most methods eectively estimate only Err. 5 / 34
  • 6. Typical behavior of test and training error Example Training error is not a good estimate of the test error There is some intermediate model complexity that gives minimum expected test error 6 / 34
  • 7. Dening our goals Model Selection Estimating the performance of dierent models in order to choose the best one Model Assessment Having chosen a nal model, estimating its generalization error on new data 7 / 34
  • 8. Data-rich situation Training set is used to learn the models Validation set is used to estimate prediction error for model selection Test set is used for assessment of the generalization error of the chosen model 8 / 34
  • 9. Outline 1 Bias, Variance and Model Complexity 2 Nature of Prediction Error 3 Error Estimation: Analytical methods AIC BIC SRM Approach 4 Error Estimation: Sample re-use Cross-validation Bootstrapping 5 Model Assessment in R 9 / 34
  • 10. Bias-Variance Decomposition Let's consider expected loss E[L] for regression task: E[L] = L(t, y(x)) p(x, t)dxdt R X Under squared error loss, h(x) = E[t|x] = tp(t|x)dt is the optimal prediction. Then, E[L] can be decomposed into the sum of three parts: E[L] = bias2 + variance + noise where 2 bias = (ED [y(x; D)] − h(x))2 p(x)dx variance = ED [(y(x; D) − ED [y(x; D)])2 ] p(x)dx noise = (h(x) − t)2 p(x, t)dxdt 10 / 34
  • 11. Bias-Variance Decomposition Examples p For a linear model y(x, w) = j=1 wj xj , ∀wj = 0, the in-sample error is: N 1 p 2 Err = (¯(xi ) − h(xi ))2 + y σ + σ2 N i=1 N For a ridge regression model (Tikhonov regularization): N 1 Err = {(ˆ(xi ) − h(xi ))2 + (ˆ(xi ) − y (xi ))2 } + V ar + σ 2 y y ¯ N i=1 where y (xi ) ˆ the best-tting linear approximation to h 11 / 34
  • 12. Behavior of bias and variance 12 / 34
  • 13. Bias-variance tradeo Example Regression with squared loss Classication with 0-1 loss In the 2nd case, prediction error is no longer the sum of squared bias and variance ⇒ The best choices of tuning parameters may dier substantially in the two settings 13 / 34
  • 14. Outline 1 Bias, Variance and Model Complexity 2 Nature of Prediction Error 3 Error Estimation: Analytical methods AIC BIC SRM Approach 4 Error Estimation: Sample re-use Cross-validation Bootstrapping 5 Model Assessment in R 14 / 34
  • 15. Analytical methods: AIC, BIC, SRM They give the in-sample estimates in the general form: ˆ Err = err + w ˆ where w ˆ is an estimate of the average optimism By using w, ˆ the methods penalize too complex models Unlike regularization, they do not impose a specic regularization parameter λ Each criterion denes its notion of model complexity involved in the penalizing term 15 / 34
  • 16. Akaike Information Criterion (AIC) Applicable for linear models Either log-likelihood loss or squared error loss is used Given a set of models indexed by a tuning parameter α, denote by d(α) number of parameters for each model. Then, d(α) 2 AIC(α) = err + 2 σ ˆ N where σ2 ˆ is typically estimated by the mean squared error of a low-bias model Finally, we choose the model giving smallest AIC 16 / 34
  • 17. Akaike Information Criterion (AIC) Example Phoneme recognition task (N = 1000) Input vector is the log-periodogram of the spoken vowel quantized to 256 uniformly space frequencies Linear logistic regression is used to predict the phonem class Here d(α) is a number of basis functions 17 / 34
  • 18. Bayesian Information Criterion (BIC) BIC, like AIC, is applicable in settings where log-likehood maximization is involved N d BIC = 2 (err + (log N ) σ 2 ) ˆ σ ˆ N BIC is proportional to AIC with the factor 2 replaced by log N Having N 8, BIC tends to penalize complex models more heavily than AIC BIC also provides the posterior probability of each model m: 1 e− 2 BICm M 1 − 2 BICl l=1 e BIC is asympotically consistent as N →∞ 18 / 34
  • 19. Structural Risk Minimization The Vapnik-Chervonenkis (VC) theory provides a general measure of the model complexity and gives associated bounds on the optimism Such a complexity measure, VC dimension, is dened as follows: VC dimension of the class functions {f (x, α)} is the largest number of points that can be shattered by members of {f (x, α)} E.g. a linear indicator function in p dimensions has VC dimension p + 1; sin(αx) has innite VC dimension 19 / 34
  • 20. Structural Risk Minimization (cont.) If we t N training points using {f (x, α)} having VC dimension h, then with probability at least 1 − η the following bound holds: h 2N ln η Err err + (ln + 1) − ) N h N SRM approach ts a nested sequence of models of increasing VC dimensions h1 h2 . . . and then chooses the model with the smallest upper bound SVM classier eciently carries out the SRM approach Issues ˆ There exists the diculty in calculating the VC dimension of a class of functions ˆ In practice, often the upper bound is very loose 20 / 34
  • 21. Outline 1 Bias, Variance and Model Complexity 2 Nature of Prediction Error 3 Error Estimation: Analytical methods AIC BIC SRM Approach 4 Error Estimation: Sample re-use Cross-validation Bootstrapping 5 Model Assessment in R 21 / 34
  • 22. Sample re-use: cross-validation, bootstrapping These methods directly (and quite accurately) estimate the average generalization error The extra-sample error is evaluated rather than in-sample one (test input vectors do not need to coincide with training ones) They can be used with any loss function, and with nonlinear, adaptive and tting techniques However, they may underestimate true error for such tting methods as trees 22 / 34
  • 23. Cross-validation Probably the simplest and widely used method However, time-consuming method CV procedure looks as follows: 1 Split data into K roughly equal-sized parts 2 For k-th part we t the model y −k (x) to other K − 1 parts 3 Then the cross-validation estimate of the prediction error is N 1 CV = L(ti , y −k(i) (xi )) N i=1 The case K=N (leave-one-out cross-validation) is roughly unbiased, but can have high variance 23 / 34
  • 24. Cross-validation (cont.) In practice, 5- or 10-fold cross-validation is recommended CV tends to overestimate the true prediction error on small datasets Often one-standard error rule is used with CV. See example: We choose the most parsimonious model whose error is no more than one standard error above the error of the best model A model with p=9 would be chosen 24 / 34
  • 25. Bootstrapping General method for assessing statistical accuracy Given a training set, here the bootstrapping procedure steps are: 1 Randomly draw datasets of with replacement from it; each sample is of the same size as the original one 2 This is done by B times, producing B bootstrap datasets 3 Fit the model to each of the bootstrap datasets 4 Examine the prediction error using the original training set as a test set: N 1 1 ˆ Errboot = L(ti , y ∗b (xi )) N |C −i | i=1 b∈C −i where C (−i) is the set of indices of the bootstrap samples that do not contain observation i To alleviate the upward bias, the .632 estimator is used: ˆ (.632) = 0.368 err + 0.632 Errboot Err ˆ 25 / 34
  • 26. Outline 1 Bias, Variance and Model Complexity 2 Nature of Prediction Error 3 Error Estimation: Analytical methods AIC BIC SRM Approach 4 Error Estimation: Sample re-use Cross-validation Bootstrapping 5 Model Assessment in R 26 / 34
  • 27. http://r-project.org Free software environment for statistical computing and graphics R packages for machine learning and data mining: kernlab, rpart, randomForest, animation, gbm, tm etc. R packages for evaluation: bootstrap,boot RStudio IDE 27 / 34
  • 28. Housing dataset at UCI Machine learning repository http://archive.ics.uci.edu/ml/datasets/Housing Housing values in suburbs of Boston 506 intances, 13 attributes + 1 numeric class attribute (MEDV) 28 / 34
  • 29. Loading data in R housing - read.table(∼/projects/r/housing.data, + header=T) attach(housing) 29 / 34
  • 30. Cross-validation example in R Helper function Creating a function using crossval() from bootstrap package eval - function(fit,k=10){ + require(bootstrap) + theta.fit - function(x,y){lsfit(x,y)} + theta.predict - function(fit,x){cbind(1,x)%*%fit$coef} + x - fit$model[,2:ncol(fit$model)] + y - fit$model[,1] + results - crossval(x,y,theta.fit,theta.predict, + ngroup=k) + squared.error=sum((y-results$cv.fit)^2)/length(y) + cat(Cross-validated squared error =, + squared.error, n)} 30 / 34
  • 31. Cross-validation example in R Model assessment fit - lm(MEDV∼.,data=housing) # A linear model that uses all the attributes eval(fit) Cross-validated squared error = 23.15827 fit - lm(MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS, + data=housing) # Less complex model eval(fit) Cross-validated squared error = 23.24319 fit - lm(MEDV∼ RM,data=housing) # Too simple model eval(fit) Cross-validated squared error = 44.38424 31 / 34
  • 32. Bootstrapping example in R Helper function Creating a function using boot() function from boot package sqer - function(formula,data,indices){ + d - data[indices,] + fit - lm(formula, data=d) + return (sum(fit$residuals^2)/length(fit$residuals)) + } 32 / 34
  • 33. Bootstrapping example in R Model assessment results - boot(data=housing,statistic=sqer,R=1000, formula=MEDV∼.) # 1000 bootstrapped datasets print(results) Bootstrap Statistics : original bias std. error t1* 21.89483 -0.76001 2.296025 results - boot(data=housing,statistic=sqer,R=1000, formula=MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS) print(results) Bootstrap Statistics : original bias std. error t1* 22.88726 -0.5400892 2.744437 results - boot(data=housing,statistic=sqer,R=1000, formula=MEDV∼ RM) print(results) Bootstrap Statistics : original bias std. error t1* 43.60055 -0.3379168 5.407933 33 / 34
  • 34. Resources T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical Learning, 2008 Stanford Engineering Everywhere CS229 Machine Learning. Handouts 4 and 5 http://videolectures.net/stanfordcs229f07_machine_ learning/ 34 / 34