1. FDA and Statistical learning theory
Nathalie Villa-Vialaneix - nathalie.villa@math.univ-toulouse.fr
http://www.nathalievilla.org
Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de Perpignan
France
La Havane, September 17th, 2008
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 1 / 39
2. Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 2 / 39
3. Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
4. Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
The aim of statistical learning theory is slightly different: find a regression
function that has a small error.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
5. Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
The aim of statistical learning theory is slightly different: find a regression
function that has a small error.
More precisely, binary classification case:
we are given a pair of random variable, (X, Y) from X × {−1, 1}
where X is any topological space;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
6. Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
The aim of statistical learning theory is slightly different: find a regression
function that has a small error.
More precisely, binary classification case:
we are given a pair of random variable, (X, Y) from X × {−1, 1}
where X is any topological space;
we observe n i.i.d. realizations of (X, Y), (x1, y1), . . . , (xn, yn), called
the learning set;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
7. Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
The aim of statistical learning theory is slightly different: find a regression
function that has a small error.
More precisely, binary classification case:
we are given a pair of random variable, (X, Y) from X × {−1, 1}
where X is any topological space;
we observe n i.i.d. realizations of (X, Y), (x1, y1), . . . , (xn, yn), called
the learning set;
we intend to find a function, built from (x1, y1), . . . , (xn, yn),
Ψn
: X → {−1, 1} that minimizes
P (Ψn
(X) Y) .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
8. First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
9. First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.
2 Generally, Ψn
is chosen in a restricted class of functions from X to
{−1, 1}, C; then the performance of Ψn
can be quantified by:
P (Ψn
(X) Y) − L∗
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
10. First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.
2 Generally, Ψn
is chosen in a restricted class of functions from X to
{−1, 1}, C; then the performance of Ψn
can be quantified by:
P (Ψn
(X) Y) − L∗
= P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
+ inf
Ψ∈C
P (Ψ(X) Y) − L∗
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
11. First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.
2 Generally, Ψn
is chosen in a restricted class of functions from X to
{−1, 1}, C; then the performance of Ψn
can be quantified by:
P (Ψn
(X) Y) − L∗
= P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
Error due to the training method
+ inf
Ψ∈C
P (Ψ(X) Y) − L∗
Error due to the choice of C
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
12. Consistency
From this last remark, we can define:
Definition: Weak consistency
A algorithm leading to build the classifier Ψn
is said to be (weakly
universally) consistent if, for all distribution of the random pair (X, Y), we
have
E (LΨn
)
n→+∞
−−−−−−→ L∗
where LΨn
:= P (Ψn
(X) Y | (xi, yi)i)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39
13. Consistency
From this last remark, we can define:
Definition: Weak consistency
A algorithm leading to build the classifier Ψn
is said to be (weakly
universally) consistent if, for all distribution of the random pair (X, Y), we
have
E (LΨn
)
n→+∞
−−−−−−→ L∗
where LΨn
:= P (Ψn
(X) Y | (xi, yi)i)
Definition: Strong consistency
Moreover, it is said to be strongly (universally) consistent if, for all
distribution of the random pair (X, Y), we have
LΨn n→+∞
−−−−−−→ L∗
p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39
14. Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performances
of Ψn
:
too small (not rich) C have a poor value of
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
but too rich C have a poor value of
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
because the learning algorithm tends to overfit the data.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
15. Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performances
of Ψn
:
too small (not rich) C have a poor value of
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
but too rich C have a poor value of
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
because the learning algorithm tends to overfit the data.
2 A naive approach to find a good Ψn
over the class C could be to
minimize the empirical risk of C:
Ψn
:= arg min
Ψ∈C
LnΨ
where LnΨ := 1
n
n
i=1 I{Ψ(xi) yi}.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
16. Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performances
of Ψn
:
too small (not rich) C have a poor value of
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
but too rich C have a poor value of
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
because the learning algorithm tends to overfit the data.
2 A naive approach to find a good Ψn
over the class C could be to
minimize the empirical risk of C:
Ψn
:= arg min
Ψ∈C
LnΨ
where LnΨ := 1
n
n
i=1 I{Ψ(xi) yi}.
The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to the
accuracy of the empirical risk.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
17. VC-dimension
A way to quantify the “richness” of a class of functions is to calculate its
VC-dimension:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39
18. VC-dimension
A way to quantify the “richness” of a class of functions is to calculate its
VC-dimension:
Definition: VC-dimension
A class of classifiers (functions from X in {−1, 1}), C, is said to shatter a
set of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to those
points, m1, m2, . . . , md ∈ {−1, 1}, there exists a Ψ ∈ C such that:
∀ i = 1, . . . , d, Ψ(zi) = mi.
The VC-dimension of class of functions C is the maximum number of
points that can be shattered by C.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39
19. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
20. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
2 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
21. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
2 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
22. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
2 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
23. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
24. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
25. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
26. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
27. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
28. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points cannot be shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
29. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points canto be shattered by C:
no Ψ ∈ C can have value 1 on the red circles and −1 on the black ones.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
30. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points canto be shattered by C:
then, VC-dimension of C = 3.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
31. Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd
is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
32. Relationship between VC-dimension and empirical
error
Theorem [Vapnik, 1995, Vapnik, 1998]
With a probability almost equal to 1 − η,
sup
Ψ∈C
E (LΨ) − LnΨ ≤
VC(C) − log(η/4)
n
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 9 / 39
33. An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,
another quantity can also be considered:
Definition: Shatter coefficient
The k-th shatter coefficient of the set of functions C is the maximum
numbers of partitions of n points into two sets that can be obtained from C.
This number, denoted by S(C, n), is almost equal to 2n
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
34. An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,
another quantity can also be considered:
Definition: Shatter coefficient
The k-th shatter coefficient of the set of functions C is the maximum
numbers of partitions of n points into two sets that can be obtained from C.
This number, denoted by S(C, n), is almost equal to 2n
Example: If C is the space of hyperplans in Rd
,
S(C, n) =
2n
if n ≤ d
2d+1
= 2VC(C) if n ≥ d + 1
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
35. An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,
another quantity can also be considered:
Definition: Shatter coefficient
The k-th shatter coefficient of the set of functions C is the maximum
numbers of partitions of n points into two sets that can be obtained from C.
This number, denoted by S(C, n), is almost equal to 2n
Example: If C is the space of hyperplans in Rd
,
S(C, n) =
2n
if n ≤ d
2d+1
= 2VC(C) if n ≥ d + 1
Remark: For all n > 2, S(C, n) ≤ nVC(C).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
36. Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
37. Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Consequences for the learning error on C: If Ψn has been chosen by
minimizing the empirical risk, i.e.,
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
38. Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Consequences for the learning error on C: If Ψn has been chosen by
minimizing the empirical risk, i.e.,
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}
as
P E (LΨn) − inf
Ψ∈C
E (LΨ) > ≤ P 2 sup
Ψ∈C
LnΨ − E (LΨ) >
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
39. Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Consequences for the learning error on C: If Ψn has been chosen by
minimizing the empirical risk, i.e.,
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}
and then,
P E (LΨn) − inf
Ψ∈C
E (LΨ) > ≤ S(C, n)e−n 2
/128
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
40. Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
41. Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
42. Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
the empirical risk: for Ψ : X → R, LnΨ = 1
n
n
i=1(yi − Ψ(xi))2
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
43. Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
the empirical risk: for Ψ : X → R, LnΨ = 1
n
n
i=1(yi − Ψ(xi))2
.
Hence, in this case, a consistent regression scheme, Ψn, satisfies:
lim
n→+∞
E (LΨn) = L∗
;
and a strongly consistent regression scheme, Ψn, satisfies:
lim
n→+∞
LΨn = L∗
p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
44. Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 13 / 39
45. Remains on functional multilayer perceptron by
projection approach
Data: Suppose that we are given a random pair (X, Y) taking its values in
X × R where (X, ., . X) is a Hilbert space. Suppose also that we have n
i.i.d. observations of (X, Y), (x1, y1), . . . , (xn, yn).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39
46. Remains on functional multilayer perceptron by
projection approach
Data: Suppose that we are given a random pair (X, Y) taking its values in
X × R where (X, ., . X) is a Hilbert space. Suppose also that we have n
i.i.d. observations of (X, Y), (x1, y1), . . . , (xn, yn).
Functional MLP: The projection approach is based on the knowledge of
a Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i and
also the weights of the MLP are projected on this basis truncated at q:
Cn
q =
Ψ : X → R : ∀ x ∈ X, Ψ(x) =
pn
l=1
w
(2)
l
G
w
(0)
l
+
q
k=1
β
(1)
lk
(Pq(x))k
pn
l=1
|w
(2)
l
| ≤ αn
where (pn)n is a sequence of integer, (αn)n is a sequence of positive real
numbers, G is a given continuous functions and the weights (w
(2)
l
)l,
(w
(0)
l
)l and (β
(1)
lk
)l,k have to be learned from the data set in R (see
Presentation 2 for further details).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39
47. Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
48. Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, with
limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
49. Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, with
limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0;
(A2) limn→+∞
pnαn log(pn log αn)
n = 0 and ∃ δ > 0: limn→+∞
α2
n
n1−δ = 0;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
50. Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, with
limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0;
(A2) limn→+∞
pnαn log(pn log αn)
n = 0 and ∃ δ > 0: limn→+∞
α2
n
n1−δ = 0;
(A3) Y is squared integrable.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
51. Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
52. Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The fist one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
53. Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The fist one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
2 The second one shows that, for any fixed p
lim
n→+∞
LΨp
n = L∗
p.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
54. Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The fist one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
2 The second one shows that, for any fixed p
lim
n→+∞
LΨp
n = L∗
p.
Remark: The limitation of this result is in the fact that it is a double limit
and that no indication on the way n and p should be linked in given.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
55. Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The fist one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
2 The second one shows that, for any fixed p
lim
n→+∞
LΨp
n = L∗
p.
Remark 2: The principle of the proof is very general and can be applied to
any other consistent method in Rp
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
56. Presentation of k-nearest neighbors for functional
classification
This method has been introduced in [Biau et al., 2005] for the binary
classification case and it exists a regression version in the work of
[Laloë, 2008].
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
57. Presentation of k-nearest neighbors for functional
classification
This method has been introduced in [Biau et al., 2005] for the binary
classification case and it exists a regression version in the work of
[Laloë, 2008].
Context: We are given a random pair (X, Y) taking its values in
X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n
i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
58. Presentation of k-nearest neighbors for functional
classification
This method has been introduced in [Biau et al., 2005] for the binary
classification case and it exists a regression version in the work of
[Laloë, 2008].
Context: We are given a random pair (X, Y) taking its values in
X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n
i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn).
Functional k-nearest neighbors also consists in using the projection of
the data on a Hilbert basis, (φj)j≥1: denote xd
i
= (xd
i1
, . . . , xd
id
) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = xi, φj X.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
59. Presentation of k-nearest neighbors for functional
classification
This method has been introduced in [Biau et al., 2005] for the binary
classification case and it exists a regression version in the work of
[Laloë, 2008].
Context: We are given a random pair (X, Y) taking its values in
X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n
i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn).
Functional k-nearest neighbors also consists in using the projection of
the data on a Hilbert basis, (φj)j≥1: denote xd
i
= (xd
i1
, . . . , xd
id
) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = xi, φj X.
k-nearest neighbors for d-dimensional data is then performed on the
dataset (xd
1
, y1), . . . , (xd
n, yn): if for all u ∈ Rd
,
Vk (u) := {i ∈ [[1, n]] : xd
i
− u Rd belongs to the k smallest of these values},
Ψn : x ∈ X →
−1 if i∈Vk (xd ) I{yi=−1} > i∈Vk (xd ) I{yi=1}
+1 otherwise
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
60. Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗,
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
61. Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors
classifier, Ψd,l,k
n , from data {(xd
i
, yi)}i=1,...,l.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
62. Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors
classifier, Ψd,l,k
n , from data {(xd
i
, yi)}i=1,...,l.
2 Choose
(dn
, kn
) = arg min
k∈N∗, d∈N∗
1
n − l
n
i=l+1
I Ψd,l,k
n (xi) yi
+
λd
√
n − l
where λd is a penalization term to avoid the selection of (possibly
overfitting) very large dimensions.
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
63. Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors
classifier, Ψd,l,k
n , from data {(xd
i
, yi)}i=1,...,l.
2 Choose
(dn
, kn
) = arg min
k∈N∗, d∈N∗
1
n − l
n
i=l+1
I Ψd,l,k
n (xi) yi
+
λd
√
n − l
where λd is a penalization term to avoid the selection of (possibly
overfitting) very large dimensions.
Then, define Ψn = Ψdn
,l,kn
n .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
64. An oracle inequality
Oracle inequality [Biau et al., 2005]
Note ∆ = +∞
d=1
e−2λ2
d < +∞. Then, it exists C > 0, only depending on ∆,
such that ∀ l > 1/∆,
E (LΨn) − L∗
≤ inf
d≥1
(L∗
d − L∗
) + inf
1≤k≤l
E LΨl,k,d
n − L∗
d +
λd
√
n − l
+C
log l
n − l
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39
65. An oracle inequality
Oracle inequality [Biau et al., 2005]
Note ∆ = +∞
d=1
e−2λ2
d < +∞. Then, it exists C > 0, only depending on ∆,
such that ∀ l > 1/∆,
E (LΨn) − L∗
≤ inf
d≥1
(L∗
d − L∗
) + inf
1≤k≤l
E LΨl,k,d
n − L∗
d +
λd
√
n − l
+C
log l
n − l
Then, we have:
by a martingale property: limd→+∞ L∗
d
= L∗,
by consistency of k-nearest neighbors in Rd
: for all d ≥ 1,
inf1≤k≤l E LΨl,k,d
n − L∗
d
l→+∞
−−−−−→ 0,
the rest of the right hand side of the inequality can be set to converge
to 0 when n grows to infinity, for suitable choices of n, l and λd.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39
66. Consistency of functional k-nearest neighbors
Theorem [Biau et al., 2005]
Suppose that
lim
n→+∞
l = +∞ lim
n→+∞
(n − l) = +∞ lim
n→+∞
log l
n − l
= 0
then
lim
n→+∞
E (LΨn) = L∗
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 20 / 39
67. Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 21 / 39
68. A binary classification problem
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in {−1, 1}.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
69. A binary classification problem
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in {−1, 1}.
Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
70. A binary classification problem
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in {−1, 1}.
Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).
We try to learn a classification machine, Ψn, of the form
x → Sign ( x, w Rd + b), or, more precisely, of the form
x → Sign ( φ(x), w X + b)
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
71. Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
72. Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
73. Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)
w
margin: 1
w 2
Rd
Support Vector
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
74. Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)
w
margin: 1
w 2
Rd
Support Vector
w is such that:
minw,b w Rd ,
such that: yi(wT
xi + b) ≥ 1, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
75. Linear discrimination with soft margin
Learn Ψn : x → Sign ( x, w Rd + b)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
76. Linear discrimination with soft margin
Learn Ψn : x → Sign ( x, w Rd + b)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
77. Linear discrimination with soft margin
Learn Ψn : x → Sign ( x, w Rd + b)
w
margin: 1
w 2
Rd
Support Vector
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
78. Linear discrimination with soft margin
Learn Ψn : x → Sign ( x, w Rd + b)
w
margin: 1
w 2
Rd
Support Vector
w is such that:
minw,b,ξ w Rd + C n
i=1 ξi,
where: yi(wT
xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
79. Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
80. Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd
Feature space X
φ (nonlinear)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
81. Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd
Feature space X
φ (nonlinear)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
82. Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd
Feature space X
φ (nonlinear)
w is such that:
(PC,X) minw,b,ξ w X + C n
i=1 ξi,
where: yi( w, φ(xi) X + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
83. Details about the feature space: a regularization
framework
Regularization framework: (PC,X) ⇔
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
84. Details about the feature space: a regularization
framework
Regularization framework: (PC,X) ⇔
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .
Dual problem: (PC,X) ⇔
(DC,X) maxα
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj φ(xi), φ(xj) X
where N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
85. Details about the feature space: a regularization
framework
Regularization framework: (PC,X) ⇔
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .
Dual problem: (PC,X) ⇔
(DC,X) maxα
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj φ(xi), φ(xj) X
where N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
Inner product in X:
∀ u, v ∈ X, K(u, v) = φ(u), φ(v) X
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
86. Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
87. Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e
−σ2
u−v 2
Rd
for σ > 0;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
88. Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e
−σ2
u−v 2
Rd
for σ > 0;
The exponential kernel: K(u, v) = e u,v R ;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
89. Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e
−σ2
u−v 2
Rd
for σ > 0;
The exponential kernel: K(u, v) = e u,v R ;
Vovk’s real infinite polynomial: K(u, v) = (1 − u, v Rd )−α for α > 0;
. . .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
90. Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
91. Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
92. Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
∀ > 0, the -covering number of φ(W), that is, the minimum number of
balls of radius that are needed to cover φ(W), is such that:
N(K, ) = O ( −α) for α > 0;
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
93. Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
∀ > 0, the -covering number of φ(W), that is, the minimum number of
balls of radius that are needed to cover φ(W), is such that:
N(K, ) = O ( −α) for α > 0;
the regularization parameter, C, depends on n by:
limn→+∞ nCn = +∞, Cn = O nβ−1
for 0 < β < 1/α.
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
94. Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
∀ > 0, the -covering number of φ(W), that is, the minimum number of
balls of radius that are needed to cover φ(W), is such that:
N(K, ) = O ( −α) for α > 0;
the regularization parameter, C, depends on n by:
limn→+∞ nCn = +∞, Cn = O nβ−1
for 0 < β < 1/α.
Remark: The Gaussian kernel satisfies all these assumptions with
N(K, ) = O n−d
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
95. Consistency of SVM in Rd
Theorem [Steinwart, 2002]
Under assumptions (A1)-(A4), SVM are consistent.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 29 / 39
96. Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, ., . X).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
97. Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, ., . X).
1 We already talk about the advantages of regularization or
projection of the functional data as a pre-processing;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
98. Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, ., . X).
1 We already talk about the advantages of regularization or
projection of the functional data as a pre-processing;
2 The consistency result can’t be directly applied with infinite
dimensional data because the condition of covering number for
infinite dimensional Gaussian kernel is not valid.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
99. A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
100. A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
101. A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
102. A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);
Learn a SVM on B1: Ψl,a
n ;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
103. A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);
Learn a SVM on B1: Ψl,a
n ;
Validation on B2:
a∗
= arg min
a
Ln−lΨl,a
n +
λd
√
n − l
with Ln−lΨl,a
n = 1
n−l
n
i=l+1 I Ψl,a
n (xi ) yi
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
104. A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);
Learn a SVM on B1: Ψl,a
n ;
Validation on B2:
a∗
= arg min
a
Ln−lΨl,a
n +
λd
√
n − l
with Ln−lΨl,a
n = 1
n−l
n
i=l+1 I Ψl,a
n (xi ) yi
.
⇒ The obtained classifier is denoted Ψn.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
105. Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
106. Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Assumptions on the parameters: ∀ d ≥ 1,
(A2) Jd is a finite set;
(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd
and
∃νd > 0 : N(Kd, ) = O ( −νd );
(A4) Cd > 1;
(A5) d≥1 |Jd|e−2λ2
d < +∞.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
107. Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Assumptions on the parameters: ∀ d ≥ 1,
(A2) Jd is a finite set;
(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd
and
∃νd > 0 : N(Kd, ) = O ( −νd );
(A4) Cd > 1;
(A5) d≥1 |Jd|e−2λ2
d < +∞.
Assumptions on training/validation sets
(A6) limn→+∞ l = +∞;
(A7) limn→+∞ n − l = +∞;
(A8) limn→+∞
l log(n−l)
n−l = 0.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
108. Consistency
Theorem [Rossi and Villa, 2006]
Under assumptions (A1)-(A8), Ψn is consistent:
E (LΨn)
n→+∞
−−−−−−→ L∗
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39
109. Consistency
Theorem [Rossi and Villa, 2006]
Under assumptions (A1)-(A8), Ψn is consistent:
E (LΨn)
n→+∞
−−−−−−→ L∗
.
Ideas of the proof: The proof is based on a similar sketch as in the work
of [Biau et al., 2005] but the result allows the use of a continuous
parameter (the regularization parameter C), based on the shatter
coefficient of a class of functions that includes SVM.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39
110. Application 1: Voice recognition
Description of the data and methods
3 problems and for each problem, 100 records sampled at 82 192
points;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
111. Application 1: Voice recognition
Description of the data and methods
3 problems and for each problem, 100 records sampled at 82 192
points;
consistent approach:
Projection on a trigonometric basis;
Splitting the data base into 50 curves (training) / 49 (validation);
Performances calculated by leave-one-out.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
112. Application 1: Voice recognition
Description of the data and methods
3 problems and for each problem, 100 records sampled at 82 192
points;
consistent approach:
Projection on a trigonometric basis;
Splitting the data base into 50 curves (training) / 49 (validation);
Performances calculated by leave-one-out.
Results
Prob. k-nn QDA SVM gau. SVM lin. SVM lin.
(proj) (proj) (direct)
yes/no 10% 7% 10% 19% 58%
boat/goat 21% 35% 8% 29% 46%
sh/ao 16% 19% 12% 25% 47%
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
113. Regression by SVM
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in R.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
114. Regression by SVM
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in R.
Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
115. Regression by SVM
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in R.
Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).
Once again, we try to learn a regression machine, Ψn, of the form
x → φ(x), w X + b
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
116. Generalization of the classification case to
regression
w and b minimize
C w 2
X +
n
i=1
Lk (xi, yi, w)
where Lk
, for k = 1, 2 and ≥ 0 is the -sensitive loss function:
Lk (xi, yi, w) = max 0, |yi − φ(xi), w X|k
− .
or any other loss function.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39
117. Generalization of the classification case to
regression
w and b minimize
C w 2
X +
n
i=1
Lk (xi, yi, w)
where Lk
, for k = 1, 2 and ≥ 0 is the -sensitive loss function:
Lk (xi, yi, w) = max 0, |yi − φ(xi), w X|k
− .
or any other loss function.
Remark: A dual version, which is a quadratic optimization problem in Rn
,
also exists.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39
118. A kernel ridge regression
When is equal to 0 and k = 2, the previous problem becomes: Find w
and b that minimize
Υ w 2
X +
n
i=1
(y − φ(xi), w X)2
which can be viewed as a kernel ridge regression. This method is also
known under the name of Least-Square SVM or LS-SVM.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39
119. A kernel ridge regression
When is equal to 0 and k = 2, the previous problem becomes: Find w
and b that minimize
Υ w 2
X +
n
i=1
(y − φ(xi), w X)2
which can be viewed as a kernel ridge regression. This method is also
known under the name of Least-Square SVM or LS-SVM.
A multidimensional consistency result is available in
[Christmann and Steinwart, 2007]: the same method as for SVM
classifiers can then be used for the regression case !
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39
120. Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 38 / 39
121. References
Further details for the references are given in the joint document.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 39 / 39