5. Background
Purpose: predict Y from X;
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn);
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
6. Background
Purpose: predict Y from X;
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
7. Background
Purpose: predict Y from X;
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
8. Background
Purpose: predict Y from X;
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y 2 R) ) (supervised) regression régression;
a factor ) (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
9. Basics
From (xi ; yi)i , definition of a machine, n s.t.:
^ynew = n(xnew):
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
10. Basics
From (xi ; yi)i , definition of a machine, n s.t.:
^ynew = n(xnew):
if Y is numeric, n is called a regression function fonction de
classification;
if Y is a factor, n is called a classifier classifieur;
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
11. Basics
From (xi ; yi)i , definition of a machine, n s.t.:
^ynew = n(xnew):
if Y is numeric, n is called a regression function fonction de
classification;
if Y is a factor, n is called a classifier classifieur;
n is said to be trained or learned from the observations (xi ; yi)i .
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
12. Basics
From (xi ; yi)i , definition of a machine, n s.t.:
^ynew = n(xnew):
if Y is numeric, n is called a regression function fonction de
classification;
if Y is a factor, n is called a classifier classifieur;
n is said to be trained or learned from the observations (xi ; yi)i .
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
13. Basics
From (xi ; yi)i , definition of a machine, n s.t.:
^ynew = n(xnew):
if Y is numeric, n is called a regression function fonction de
classification;
if Y is a factor, n is called a classifier classifieur;
n is said to be trained or learned from the observations (xi ; yi)i .
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
generalization ability: predictions made on new data are also
accurate.
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
14. Basics
From (xi ; yi)i , definition of a machine, n s.t.:
^ynew = n(xnew):
if Y is numeric, n is called a regression function fonction de
classification;
if Y is a factor, n is called a classifier classifieur;
n is said to be trained or learned from the observations (xi ; yi)i .
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
generalization ability: predictions made on new data are also
accurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
15. Underfitting/Overfitting sous/sur - apprentissage
Function x ! y to be estimated
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
16. Underfitting/Overfitting sous/sur - apprentissage
Observations we might have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
17. Underfitting/Overfitting sous/sur - apprentissage
Observations we do have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
18. Underfitting/Overfitting sous/sur - apprentissage
First estimation from the observations: underfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
19. Underfitting/Overfitting sous/sur - apprentissage
Second estimation from the observations: accurate estimation
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
20. Underfitting/Overfitting sous/sur - apprentissage
Third estimation from the observations: overfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
22. Errors
training error (measures the accuracy to the observations)
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
23. Errors
training error (measures the accuracy to the observations)
I if y is a factor: misclassification rate
]f^yi , yi ; i = 1; : : : ; ng
n
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
24. Errors
training error (measures the accuracy to the observations)
I if y is a factor: misclassification rate
]f^yi , yi ; i = 1; : : : ; ng
n
I if y is numeric: mean square error (MSE)
1
n
Xn
i=1
(^yi yi)2
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
25. Errors
training error (measures the accuracy to the observations)
I if y is a factor: misclassification rate
]f^yi , yi ; i = 1; : : : ; ng
n
I if y is numeric: mean square error (MSE)
1
n
Xn
i=1
(^yi yi)2
or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i)
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
26. Errors
training error (measures the accuracy to the observations)
I if y is a factor: misclassification rate
]f^yi , yi ; i = 1; : : : ; ng
n
I if y is numeric: mean square error (MSE)
1
n
Xn
i=1
(^yi yi)2
or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i)
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
27. Errors
training error (measures the accuracy to the observations)
I if y is a factor: misclassification rate
]f^yi , yi ; i = 1; : : : ; ng
n
I if y is numeric: mean square error (MSE)
1
n
Xn
i=1
(^yi yi)2
or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i)
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train n from the training dataset
3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
32. Consistency in the parametric/non parametric case
Example in the parametric framework (linear methods)
an assumption is made on the form of the relation between X and Y:
Y =
37. under (eventually)
technical assumptions on X, , Y.
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25
38. Consistency in the parametric/non parametric case
Example in the nonparametric framework
the form of the relation between X and Y is unknown:
Y = (X) +
is estimated from the observations (x1; y1), . . . , (xn; yn) by a given
method which calculates a n.
The estimation is said to be consistent if n n!+1
! under (eventually)
technical assumptions on X, , Y.
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25
39. Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating or...
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25
40. Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating or...
... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine n from
the observations is said to be (universally) consistent if, given a risk
function R : R R ! R+ (which calculates an error),
E (R(n(X); Y))
n!+1
! inf
:X!R
E (R((X); Y)) ;
for any distribution of (X; Y) 2 X R.
Definitions: L = inf:X!R E (R((X); Y)) and L = E (R((X); Y)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25
41. Desirable properties from a mathematical perspective
Simplified framework: X 2 X and Y 2 f1; 1g (binary classification)
Learning process: choose a machine n in a class of functions
C f : X ! Rg (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
Ln L
Ln inf
2C
L
+
inf
2C
L L
with
inf2C L L is the richness of C (i.e., C must be rich to ensure that
this term is small);
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25
42. Desirable properties from a mathematical perspective
Simplified framework: X 2 X and Y 2 f1; 1g (binary classification)
Learning process: choose a machine n in a class of functions
C f : X ! Rg (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
Ln L
Ln inf
2C
L
+
inf
2C
L L
with
inf2C L L is the richness of C (i.e., C must be rich to ensure that
this term is small);
Ln inf2C L 2 sup2C jLn Lj, Ln = 1
n
Pni
=1 R((xi); yi) is
the generalization capability of C (i.e., in the worst case, the empirical
error must be close to the true error: C must not be too rich to ensure
that this term is small).
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25
44. Basic introduction
Binary classification problem: X 2 H et Y 2 f1; 1g
A training set is given: (x1; y1); : : : ; (xn; yn)
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25
45. Basic introduction
Binary classification problem: X 2 H et Y 2 f1; 1g
A training set is given: (x1; y1); : : : ; (xn; yn)
SVM is a method based on kernels. It is universally consistent method,
given that the kernel is universal [Steinwart, 2002].
Extensions to the regression case exist (SVR or LS-SVM) that are also
universally consistent when the kernel is universal.
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25
48. Optimal margin classification
w
margin: 1
kwk2
Support Vector
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
49. Optimal margin classification
w
margin: 1
kwk2
Support Vector
w is chosen such that:
minw kwk2 (the margin is the largest),
under the constraints: yi(hw; xii + b) 1; 1 i n (the separation
between the two classes is perfect).
) ensures a good generalization capability.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
52. Soft margin classification
w
margin: 1
kwk2
Support Vector
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
53. Soft margin classification
w
margin: 1
kwk2
Support Vector
w is chosen such that:
minw; kwk2 + C
Pni
=1 i (the margin is the largest),
under the constraints: yi(hw; xii + b) 1 i ; 1 i n;
i 0; 1 i n:
(the separation between the two classes is almost perfect).
) allowing a few errors improves the richness of the class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
54. Non linear SVM
Original space X
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
55. Non linear SVM
Original space X Feature space H
(non linear)
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
56. Non linear SVM
Original space X Feature space H
(non linear)
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
57. Non linear SVM
Original space X Feature space H
(non linear)
w 2 H is chosen such that (PC;H):
minw; kwk2
H + C
Pni
=1 i (the margin in the feature space is the
largest),
under the constraints: yi(hw; (xi)iH + b) 1 i ; 1 i n;
i 0; 1 i n:
(the separation between the two classes in the feature space is
almost perfect).
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
58. SVM from different points of view
A regularization problem: (PC;H) ,
(P2
;H) : min
w2H
1
n
Xn
i=1
R(fw(xi); yi)
| {z }
error term
+ kwk2
|{zH}
;
penalization term
where fw(x) = h (x);wiH and R(^y; y) = max(0; 1 ^yy) (hinge loss
function)
errors versus ^y for y = 1:
I blue: hinge loss;
I green: misclassification error.
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
59. SVM from different points of view
A regularization problem: (PC;H) ,
(P2
;H) : min
w2H
1
n
Xn
i=1
R(fw(xi); yi)
| {z }
error term
+ kwk2
|{zH}
;
penalization term
where fw(x) = h (x);wiH and R(^y; y) = max(0; 1 ^yy) (hinge loss
function)
A dual problem: (PC;H) ,
(DC;X) : max2Rn
Pni
=1 i
Pni
=1
Pnj
=1 ijyiyjh (xi); (xj)iH;
with
PNi
=1 iyi = 0;
0 i C; 1 i n:
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
60. SVM from different points of view
A regularization problem: (PC;H) ,
(P2
;H) : min
w2H
1
n
Xn
i=1
R(fw(xi); yi)
| {z }
error term
+ kwk2
|{zH}
;
penalization term
where fw(x) = h (x);wiH and R(^y; y) = max(0; 1 ^yy) (hinge loss
function)
A dual problem: (PC;H) ,
(DC;X) : max2Rn
Pni
=1 i
Pni
=1
Pnj
=1 ijyiyjK(xi ; xj);
with
PNi
=1 iyi = 0;
0 i C; 1 i n:
There is no need to know and H:
I choose a function K with a few good properties;
I use it as the dot product in H:
8 u; v 2 H; K(u; v) = h (u); (v)iH.
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
61. Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u; u0) = K(u0; u)
positivity: 8N 2 N, 8 (i) RN, 8 (xi) XN,
P
i;j ijK(xi ; xj) 0.
[Aronszajn, 1950]: 9 a Hilbert space (H; h:; :iH) and a function : X ! H
such that:
8 u; v 2 H; K(u; v) = h (u); (v)iH
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25
62. Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u; u0) = K(u0; u)
positivity: 8N 2 N, 8 (i) RN, 8 (xi) XN,
P
i;j ijK(xi ; xj) 0.
[Aronszajn, 1950]: 9 a Hilbert space (H; h:; :iH) and a function : X ! H
such that:
8 u; v 2 H; K(u; v) = h (u); (v)iH
Examples
the Gaussian kernel: 8 x; x0 2 Rd, K(x; x0) = e
kxx0k2 (it is universal
for all bounded subset of Rd);
the linear kernel: 8 x; x0 2 Rd, K(x; x0) = xT (x0) (it is not universal).
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25
63. In summary, how does the solution write????
n(x) =
X
i
iyiK(xi ; x)
where only a few i , 0. i such that i , 0 are the support vectors!
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/25
64. I’m almost dead with all these stuffs on my mind!!!
What in practice?
data(iris)
iris - iris[iris$Species%in%c(versicolor,virginica),]
plot(iris$Petal.Length , iris$Petal.Width , col=iris$Species ,
pch=19)
legend(topleft, pch=19, col=c(2,3),
legend=c(versicolor, virginica))
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/25
65. I’m almost dead with all these stuffs on my mind!!!
What in practice?
library(e1071)
res.tune - tune.svm(Species ~ ., data=iris , kernel=linear,
cost = 2^(-1:4))
# Parameter tuning of 'svm':
# - sampling method: 10fold cross validation
# - best parameters:
# cost
# 0.5
# - best performance: 0.05
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4),
# kernel = linear)
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: linear
# cost: 0.5
# gamma: 0.25
# Number of Support Vectors: 21
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/25
66. I’m almost dead with all these stuffs on my mind!!!
What in practice?
table(res.tune$best.model$fitted , iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 45 0
% virginica 0 5 50
plot(res.tune$best.model , data=iris , Petal.Width~Petal.Length ,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/25
67. I’m almost dead with all these stuffs on my mind!!!
What in practice?
res.tune - tune.svm(Species ~ ., data=iris , gamma = 2^(-1:1),
cost = 2^(2:4))
# Parameter tuning of 'svm':
# - sampling method: 10fold cross validation
# - best parameters:
# gamma cost
# 0.5 4
# - best performance: 0.08
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1),
# cost = 2^(2:4))
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: radial
# cost: 4
# gamma: 0.5
# Number of Support Vectors: 32
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/25
68. I’m almost dead with all these stuffs on my mind!!!
What in practice?
table(res.tune$best.model$fitted , iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 49 0
% virginica 0 1 50
plot(res.tune$best.model , data=iris , Petal.Width~Petal.Length ,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/25
69. References
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Steinwart, I. (2002).
Support vector machines are universally consistent.
Journal of Complexity, 18:768–791.
Vapnik, V. (1995).
The Nature of Statistical Learning Theory.
Springer Verlag, New York, USA.
and more can be found on my website:
http://nathalievilla.org/learning.html
Nathalie Villa-Vialaneix | Introduction to statistical learning 25/25