Introduction to Statistical Learning

A short introduction to statistical learning
Nathalie Villa-Vialaneix
nathalie.villa@toulouse.inra.fr
http://www.nathalievilla.org
Axe “Apprentissage et Processus”
October 15th, 2014 - Unité MIA-T, INRA, Toulouse
Nathalie Villa-Vialaneix | Introduction to statistical learning 1/25

Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 SVM

Outline
1 Introduction
Consistency
2 SVM

Background
Purpose: predict Y from X;

Background
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn);

Background
What we want: estimate unknown Y from new X: xn+1, . . . , xm.

Background
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.

Background
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y 2 R) ) (supervised) regression régression;
a factor ) (supervised) classification discrimination.

Basics
From (xi ; yi)i , definition of a machine, n s.t.:
^ynew = n(xnew):

Basics
^ynew = n(xnew):
if Y is numeric, n is called a regression function fonction de
classification;
if Y is a factor, n is called a classifier classifieur;

Basics
^ynew = n(xnew):
classification;
n is said to be trained or learned from the observations (xi ; yi)i .

Basics
^ynew = n(xnew):
classification;
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;

Basics
^ynew = n(xnew):
classification;
generalization ability: predictions made on new data are also
accurate.

Basics
^ynew = n(xnew):
classification;
generalization ability: predictions made on new data are also
accurate.
Conflicting objectives!!

Underfitting/Overfitting sous/sur - apprentissage
Function x ! y to be estimated

Observations we might have

Observations we do have

First estimation from the observations: underfitting

Second estimation from the observations: accurate estimation

Third estimation from the observations: overfitting

Summary

Errors
training error (measures the accuracy to the observations)

Errors
I if y is a factor: misclassification rate
]f^yi , yi ; i = 1; : : : ; ng
n

Errors
]f^yi , yi ; i = 1; : : : ; ng
n
I if y is numeric: mean square error (MSE)
1
n
Xn
i=1
(^yi yi)2

Errors
]f^yi , yi ; i = 1; : : : ; ng
n
1
n
Xn
i=1
(^yi yi)2
or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i)

Errors
]f^yi , yi ; i = 1; : : : ; ng
n
1
n
Xn
i=1
(^yi yi)2
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation

Errors
]f^yi , yi ; i = 1; : : : ; ng
n
1
n
Xn
i=1
(^yi yi)2
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train n from the training dataset
3 calculate the test error from the remaining data

Example
Observations

Example
Training/Test datasets

Example
Training/Test errors

Example
Summary

Consistency in the parametric/non parametric case
Example in the parametric framework (linear methods)
an assumption is made on the form of the relation between X and Y:
Y =

is estimated from the observations (x1; y1), . . . , (xn; yn) by a given
method which calculates a

n.
The estimation is said to be consistent if

under (eventually)
technical assumptions on X, , Y.

Consistency in the parametric/non parametric case
Example in the nonparametric framework
the form of the relation between X and Y is unknown:
Y = (X) +
is estimated from the observations (x1; y1), . . . , (xn; yn) by a given
method which calculates a n.
The estimation is said to be consistent if n n!+1
! under (eventually)
technical assumptions on X, , Y.

Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating or...

Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating or...
... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine n from
the observations is said to be (universally) consistent if, given a risk
function R : R R ! R+ (which calculates an error),
E (R(n(X); Y))
n!+1
! inf
:X!R
E (R((X); Y)) ;
for any distribution of (X; Y) 2 X R.
Definitions: L = inf:X!R E (R((X); Y)) and L = E (R((X); Y)).

Desirable properties from a mathematical perspective
Simplified framework: X 2 X and Y 2 f1; 1g (binary classification)
Learning process: choose a machine n in a class of functions
C f : X ! Rg (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
Ln L

Ln inf
2C
L

+

inf
2C
L L

with
inf2C L L is the richness of C (i.e., C must be rich to ensure that
this term is small);

Desirable properties from a mathematical perspective
Simplified framework: X 2 X and Y 2 f1; 1g (binary classification)
Learning process: choose a machine n in a class of functions
C f : X ! Rg (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
Ln L

Ln inf
2C
L

+

inf
2C
L L

with
inf2C L L is the richness of C (i.e., C must be rich to ensure that
this term is small);
Ln inf2C L 2 sup2C jLn Lj, Ln = 1
n
Pni
=1 R((xi); yi) is
the generalization capability of C (i.e., in the worst case, the empirical
error must be close to the true error: C must not be too rich to ensure
that this term is small).

Outline
1 Introduction
Consistency
2 SVM

Basic introduction
Binary classification problem: X 2 H et Y 2 f1; 1g
A training set is given: (x1; y1); : : : ; (xn; yn)

Basic introduction
Binary classification problem: X 2 H et Y 2 f1; 1g
A training set is given: (x1; y1); : : : ; (xn; yn)
SVM is a method based on kernels. It is universally consistent method,
given that the kernel is universal [Steinwart, 2002].
Extensions to the regression case exist (SVR or LS-SVM) that are also
universally consistent when the kernel is universal.

Optimal margin classification

w
margin: 1
kwk2
Support Vector

w
margin: 1
kwk2
Support Vector
w is chosen such that:
minw kwk2 (the margin is the largest),
under the constraints: yi(hw; xii + b) 1; 1 i n (the separation
between the two classes is perfect).
) ensures a good generalization capability.

Soft margin classification

w
margin: 1
kwk2
Support Vector

w
margin: 1
kwk2
Support Vector
w is chosen such that:
minw; kwk2 + C
Pni
=1 i (the margin is the largest),
under the constraints: yi(hw; xii + b) 1 i ; 1 i n;
i 0; 1 i n:
(the separation between the two classes is almost perfect).
) allowing a few errors improves the richness of the class.

Non linear SVM
Original space X

Non linear SVM
Original space X Feature space H
(non linear)

Non linear SVM
Original space X Feature space H
(non linear)
w 2 H is chosen such that (PC;H):
minw; kwk2
H + C
Pni
=1 i (the margin in the feature space is the
largest),
under the constraints: yi(hw; (xi)iH + b) 1 i ; 1 i n;
i 0; 1 i n:
(the separation between the two classes in the feature space is
almost perfect).

SVM from different points of view
A regularization problem: (PC;H) ,
(P2
;H) : min
w2H
1
n
Xn
i=1
R(fw(xi); yi)
| {z }
error term
+ kwk2
|{zH}
;
penalization term
where fw(x) = h (x);wiH and R(^y; y) = max(0; 1 ^yy) (hinge loss
function)
errors versus ^y for y = 1:
I blue: hinge loss;
I green: misclassification error.

(P2
;H) : min
w2H
1
n
Xn
i=1
R(fw(xi); yi)
| {z }
error term
+ kwk2
|{zH}
;
penalization term
function)
A dual problem: (PC;H) ,
(DC;X) : max2Rn
Pni
=1 i
Pni
=1
Pnj
=1 ijyiyjh (xi); (xj)iH;
with
PNi
=1 iyi = 0;
0 i C; 1 i n:

(P2
;H) : min
w2H
1
n
Xn
i=1
R(fw(xi); yi)
| {z }
error term
+ kwk2
|{zH}
;
penalization term
function)
A dual problem: (PC;H) ,
(DC;X) : max2Rn
Pni
=1 i
Pni
=1
Pnj
=1 ijyiyjK(xi ; xj);
with
PNi
=1 iyi = 0;
0 i C; 1 i n:
There is no need to know and H:
I choose a function K with a few good properties;
I use it as the dot product in H:
8 u; v 2 H; K(u; v) = h (u); (v)iH.

Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u; u0) = K(u0; u)
positivity: 8N 2 N, 8 (i) RN, 8 (xi) XN,
P
i;j ijK(xi ; xj) 0.
[Aronszajn, 1950]: 9 a Hilbert space (H; h:; :iH) and a function : X ! H
such that:
8 u; v 2 H; K(u; v) = h (u); (v)iH

Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u; u0) = K(u0; u)
positivity: 8N 2 N, 8 (i) RN, 8 (xi) XN,
P
i;j ijK(xi ; xj) 0.
[Aronszajn, 1950]: 9 a Hilbert space (H; h:; :iH) and a function : X ! H
such that:
8 u; v 2 H; K(u; v) = h (u); (v)iH
Examples
the Gaussian kernel: 8 x; x0 2 Rd, K(x; x0) = e
kxx0k2 (it is universal
for all bounded subset of Rd);
the linear kernel: 8 x; x0 2 Rd, K(x; x0) = xT (x0) (it is not universal).

In summary, how does the solution write????
n(x) =
X
i
iyiK(xi ; x)
where only a few i , 0. i such that i , 0 are the support vectors!

I’m almost dead with all these stuffs on my mind!!!
What in practice?
data(iris)
iris - iris[iris$Species%in%c(versicolor,virginica),]
plot(iris$Petal.Length , iris$Petal.Width , col=iris$Species ,
pch=19)
legend(topleft, pch=19, col=c(2,3),
legend=c(versicolor, virginica))

Introduction to Statistical Learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (11)

En vedette

En vedette (20)

Similaire à Introduction to Statistical Learning

Similaire à Introduction to Statistical Learning (20)

Plus de tuxette

Plus de tuxette (20)

Dernier

Dernier (20)

Introduction to Statistical Learning