FDA and Statistical learning theory

FDA and Statistical learning theory
Nathalie Villa-Vialaneix - nathalie.villa@math.univ-toulouse.fr
http://www.nathalievilla.org
Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de Perpignan
France
La Havane, September 17th, 2008
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 1 / 39

Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References

Purpose of statistical learning theory
In the previous presentations, the aim was to ﬁnd an estimator that is
“close” to the model.

The aim of statistical learning theory is slightly different: ﬁnd a regression
function that has a small error.

More precisely, binary classiﬁcation case:
we are given a pair of random variable, (X, Y) from X × {−1, 1}
where X is any topological space;

we observe n i.i.d. realizations of (X, Y), (x1, y1), . . . , (xn, yn), called
the learning set;

we observe n i.i.d. realizations of (X, Y), (x1, y1), . . . , (xn, yn), called
the learning set;
we intend to ﬁnd a function, built from (x1, y1), . . . , (xn, yn),
Ψn
: X → {−1, 1} that minimizes
P (Ψn
(X) Y) .

First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.

denoted by L∗.
2 Generally, Ψn
is chosen in a restricted class of functions from X to
{−1, 1}, C; then the performance of Ψn
can be quantiﬁed by:
P (Ψn
(X) Y) − L∗

denoted by L∗.
2 Generally, Ψn
P (Ψn
(X) Y) − L∗
= P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
+ inf
Ψ∈C
P (Ψ(X) Y) − L∗

denoted by L∗.
2 Generally, Ψn
P (Ψn
(X) Y) − L∗
= P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
Error due to the training method
+ inf
Ψ∈C
P (Ψ(X) Y) − L∗
Error due to the choice of C

Consistency
From this last remark, we can define:
Definition: Weak consistency
A algorithm leading to build the classifier Ψn
is said to be (weakly
universally) consistent if, for all distribution of the random pair (X, Y), we
have
E (LΨn
)
n→+∞
−−−−−−→ L∗
where LΨn
:= P (Ψn
(X) Y | (xi, yi)i)

Consistency
From this last remark, we can define:
Definition: Weak consistency
A algorithm leading to build the classifier Ψn
is said to be (weakly
universally) consistent if, for all distribution of the random pair (X, Y), we
have
E (LΨn
)
n→+∞
−−−−−−→ L∗
where LΨn
:= P (Ψn
(X) Y | (xi, yi)i)
Definition: Strong consistency
Moreover, it is said to be strongly (universally) consistent if, for all
distribution of the random pair (X, Y), we have
LΨn n→+∞
−−−−−−→ L∗
p.s.

Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performances
of Ψn
:
too small (not rich) C have a poor value of
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
but too rich C have a poor value of
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
because the learning algorithm tends to overﬁt the data.

of Ψn
:
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
2 A naive approach to ﬁnd a good Ψn
over the class C could be to
minimize the empirical risk of C:
Ψn
:= arg min
Ψ∈C
LnΨ
where LnΨ := 1
n
n
i=1 I{Ψ(xi) yi}.

of Ψn
:
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
2 A naive approach to ﬁnd a good Ψn
over the class C could be to
minimize the empirical risk of C:
Ψn
:= arg min
Ψ∈C
LnΨ
where LnΨ := 1
n
n
i=1 I{Ψ(xi) yi}.
The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to the
accuracy of the empirical risk.

VC-dimension
A way to quantify the “richness” of a class of functions is to calculate its
VC-dimension:

VC-dimension
A way to quantify the “richness” of a class of functions is to calculate its
VC-dimension:
Deﬁnition: VC-dimension
A class of classiﬁers (functions from X in {−1, 1}), C, is said to shatter a
set of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to those
points, m1, m2, . . . , md ∈ {−1, 1}, there exists a Ψ ∈ C such that:
∀ i = 1, . . . , d, Ψ(zi) = mi.
The VC-dimension of class of functions C is the maximum number of
points that can be shattered by C.

Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R .

Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
2 points are shattered by C:

Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:

Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points cannot be shattered by C:

Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points canto be shattered by C:
no Ψ ∈ C can have value 1 on the red circles and −1 on the black ones.

Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
then, VC-dimension of C = 3.

Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
More generally, VC-dimension of hyperplans in Rd
is d + 1.

Relationship between VC-dimension and empirical
error
Theorem [Vapnik, 1995, Vapnik, 1998]
With a probability almost equal to 1 − η,
sup
Ψ∈C
E (LΨ) − LnΨ ≤
VC(C) − log(η/4)
n
.

An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,
another quantity can also be considered:
Definition: Shatter coefficient
The k-th shatter coefficient of the set of functions C is the maximum
numbers of partitions of n points into two sets that can be obtained from C.
This number, denoted by S(C, n), is almost equal to 2n

Example: If C is the space of hyperplans in Rd
,
S(C, n) =
2n
if n ≤ d
2d+1
= 2VC(C) if n ≥ d + 1

Example: If C is the space of hyperplans in Rd
,
S(C, n) =
2n
if n ≤ d
2d+1
= 2VC(C) if n ≥ d + 1
Remark: For all n > 2, S(C, n) ≤ nVC(C).

Vapnik-Chervonenkis inequality
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.

P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Consequences for the learning error on C: If Ψn has been chosen by
minimizing the empirical risk, i.e.,
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}

P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}
as
P E (LΨn) − inf
Ψ∈C
E (LΨ) > ≤ P 2 sup
Ψ∈C
LnΨ − E (LΨ) >

P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}
and then,
P E (LΨn) − inf
Ψ∈C
E (LΨ) > ≤ S(C, n)e−n 2
/128
.

Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;

LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);

LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
the empirical risk: for Ψ : X → R, LnΨ = 1
n
n
i=1(yi − Ψ(xi))2
.

LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
the empirical risk: for Ψ : X → R, LnΨ = 1
n
n
i=1(yi − Ψ(xi))2
.
Hence, in this case, a consistent regression scheme, Ψn, satisﬁes:
lim
n→+∞
E (LΨn) = L∗
;
and a strongly consistent regression scheme, Ψn, satisﬁes:
lim
n→+∞
LΨn = L∗
p.s.

Table of contents
3 SVM
4 References

Remains on functional multilayer perceptron by
projection approach
Data: Suppose that we are given a random pair (X, Y) taking its values in
X × R where (X, ., . X) is a Hilbert space. Suppose also that we have n
i.i.d. observations of (X, Y), (x1, y1), . . . , (xn, yn).

Remains on functional multilayer perceptron by
projection approach
Data: Suppose that we are given a random pair (X, Y) taking its values in
X × R where (X, ., . X) is a Hilbert space. Suppose also that we have n
i.i.d. observations of (X, Y), (x1, y1), . . . , (xn, yn).
Functional MLP: The projection approach is based on the knowledge of
a Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i and
also the weights of the MLP are projected on this basis truncated at q:
Cn
q =



Ψ : X → R : ∀ x ∈ X, Ψ(x) =
pn
l=1
w
(2)
l
G

w
(0)
l
+
q
k=1
β
(1)
lk
(Pq(x))k


pn
l=1
|w
(2)
l
| ≤ αn



where (pn)n is a sequence of integer, (αn)n is a sequence of positive real
numbers, G is a given continuous functions and the weights (w
(2)
l
)l,
(w
(0)
l
)l and (β
(1)
lk
)l,k have to be learned from the data set in R (see
Presentation 2 for further details).

Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.

Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, with
limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0;

Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A2) limn→+∞
pnαn log(pn log αn)
n = 0 and ∃ δ > 0: limn→+∞
α2
n
n1−δ = 0;

Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A2) limn→+∞
pnαn log(pn log αn)
n = 0 and ∃ δ > 0: limn→+∞
α2
n
n1−δ = 0;
(A3) Y is squared integrable.

Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.

MPL
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The ﬁst one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.

MPL
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
2 The second one shows that, for any ﬁxed p
lim
n→+∞
LΨp
n = L∗
p.

MPL
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
lim
n→+∞
LΨp
n = L∗
p.
Remark: The limitation of this result is in the fact that it is a double limit
and that no indication on the way n and p should be linked in given.

MPL
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
lim
n→+∞
LΨp
n = L∗
p.
Remark 2: The principle of the proof is very general and can be applied to
any other consistent method in Rp
.

Presentation of k-nearest neighbors for functional
classiﬁcation
This method has been introduced in [Biau et al., 2005] for the binary
classiﬁcation case and it exists a regression version in the work of
[Laloë, 2008].

classiﬁcation
[Laloë, 2008].
Context: We are given a random pair (X, Y) taking its values in
X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n
i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn).

classiﬁcation
[Laloë, 2008].
Functional k-nearest neighbors also consists in using the projection of
the data on a Hilbert basis, (φj)j≥1: denote xd
i
= (xd
i1
, . . . , xd
id
) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = xi, φj X.

classiﬁcation
[Laloë, 2008].
Functional k-nearest neighbors also consists in using the projection of
the data on a Hilbert basis, (φj)j≥1: denote xd
i
= (xd
i1
, . . . , xd
id
) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = xi, φj X.
k-nearest neighbors for d-dimensional data is then performed on the
dataset (xd
1
, y1), . . . , (xd
n, yn): if for all u ∈ Rd
,
Vk (u) := {i ∈ [[1, n]] : xd
i
− u Rd belongs to the k smallest of these values},
Ψn : x ∈ X →
−1 if i∈Vk (xd ) I{yi=−1} > i∈Vk (xd ) I{yi=1}
+1 otherwise

Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗,
.

parameter k
strategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors
classiﬁer, Ψd,l,k
n , from data {(xd
i
, yi)}i=1,...,l.

parameter k
strategy:
n , from data {(xd
i
, yi)}i=1,...,l.
2 Choose
(dn
, kn
) = arg min
k∈N∗, d∈N∗
1
n − l
n
i=l+1
I Ψd,l,k
n (xi) yi
+
λd
√
n − l
where λd is a penalization term to avoid the selection of (possibly
overﬁtting) very large dimensions.
.

parameter k
strategy:
n , from data {(xd
i
, yi)}i=1,...,l.
2 Choose
(dn
, kn
) = arg min
k∈N∗, d∈N∗
1
n − l
n
i=l+1
I Ψd,l,k
n (xi) yi
+
λd
√
n − l
where λd is a penalization term to avoid the selection of (possibly
overﬁtting) very large dimensions.
Then, deﬁne Ψn = Ψdn
,l,kn
n .

An oracle inequality
Oracle inequality [Biau et al., 2005]
Note ∆ = +∞
d=1
e−2λ2
d < +∞. Then, it exists C > 0, only depending on ∆,
such that ∀ l > 1/∆,
E (LΨn) − L∗
≤ inf
d≥1
(L∗
d − L∗
) + inf
1≤k≤l
E LΨl,k,d
n − L∗
d +
λd
√
n − l
+C
log l
n − l

An oracle inequality
Oracle inequality [Biau et al., 2005]
Note ∆ = +∞
d=1
e−2λ2
d < +∞. Then, it exists C > 0, only depending on ∆,
such that ∀ l > 1/∆,
E (LΨn) − L∗
≤ inf
d≥1
(L∗
d − L∗
) + inf
1≤k≤l
E LΨl,k,d
n − L∗
d +
λd
√
n − l
+C
log l
n − l
Then, we have:
by a martingale property: limd→+∞ L∗
d
= L∗,
by consistency of k-nearest neighbors in Rd
: for all d ≥ 1,
inf1≤k≤l E LΨl,k,d
n − L∗
d
l→+∞
−−−−−→ 0,
the rest of the right hand side of the inequality can be set to converge
to 0 when n grows to inﬁnity, for suitable choices of n, l and λd.

Consistency of functional k-nearest neighbors
Theorem [Biau et al., 2005]
Suppose that
lim
n→+∞
l = +∞ lim
n→+∞
(n − l) = +∞ lim
n→+∞
log l
n − l
= 0
then
lim
n→+∞
E (LΨn) = L∗
.

Table of contents
3 SVM
4 References

A binary classiﬁcation problem
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in {−1, 1}.

Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).

We try to learn a classiﬁcation machine, Ψn, of the form
x → Sign ( x, w Rd + b), or, more precisely, of the form
x → Sign ( φ(x), w X + b)
where the exact nature of φ and X will be discussed later.

Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)

w
margin: 1
w 2
Rd
Support Vector

w
margin: 1
w 2
Rd
Support Vector
w is such that:
minw,b w Rd ,
such that: yi(wT
xi + b) ≥ 1, 1 ≤ i ≤ n.

Linear discrimination with soft margin

w
margin: 1
w 2
Rd
Support Vector

w
margin: 1
w 2
Rd
Support Vector
w is such that:
minw,b,ξ w Rd + C n
i=1 ξi,
where: yi(wT
xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.

Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd

Original space Rd
Feature space X
φ (nonlinear)

Original space Rd
Feature space X
φ (nonlinear)
w is such that:
(PC,X) minw,b,ξ w X + C n
i=1 ξi,
where: yi( w, φ(xi) X + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.

Details about the feature space: a regularization
framework
Regularization framework: (PC,X) ⇔
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .

framework
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .
Dual problem: (PC,X) ⇔
(DC,X) maxα
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj φ(xi), φ(xj) X
where N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.

framework
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .
Dual problem: (PC,X) ⇔
(DC,X) maxα
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj φ(xi), φ(xj) X
where N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
Inner product in X:
∀ u, v ∈ X, K(u, v) = φ(u), φ(v) X

Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].

Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
The Gaussian kernel: K(u, v) = e
−σ2
u−v 2
Rd
for σ > 0;

Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
−σ2
u−v 2
Rd
for σ > 0;
The exponential kernel: K(u, v) = e u,v R ;

Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
−σ2
u−v 2
Rd
for σ > 0;
The exponential kernel: K(u, v) = e u,v R ;
Vovk’s real inﬁnite polynomial: K(u, v) = (1 − u, v Rd )−α for α > 0;
. . .

Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
.

Suppose
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
.

Suppose
, W;
(W);
∀ > 0, the -covering number of φ(W), that is, the minimum number of
balls of radius that are needed to cover φ(W), is such that:
N(K, ) = O ( −α) for α > 0;
.

Suppose
, W;
(W);
N(K, ) = O ( −α) for α > 0;
the regularization parameter, C, depends on n by:
limn→+∞ nCn = +∞, Cn = O nβ−1
for 0 < β < 1/α.
.

Suppose
, W;
(W);
N(K, ) = O ( −α) for α > 0;
the regularization parameter, C, depends on n by:
limn→+∞ nCn = +∞, Cn = O nβ−1
for 0 < β < 1/α.
Remark: The Gaussian kernel satisﬁes all these assumptions with
N(K, ) = O n−d
.

Consistency of SVM in Rd
Theorem [Steinwart, 2002]
Under assumptions (A1)-(A4), SVM are consistent.

Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, ., . X).

1 We already talk about the advantages of regularization or
projection of the functional data as a pre-processing;

1 We already talk about the advantages of regularization or
projection of the functional data as a pre-processing;
2 The consistency result can’t be directly applied with inﬁnite
dimensional data because the condition of covering number for
inﬁnite dimensional Gaussian kernel is not valid.

A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;

[Biau et al., 2005]
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]

[Biau et al., 2005]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);

[Biau et al., 2005]
B2 = (xl+1, yl+1), . . . , (xn, yn);
Learn a SVM on B1: Ψl,a
n ;

[Biau et al., 2005]
B2 = (xl+1, yl+1), . . . , (xn, yn);
n ;
Validation on B2:
a∗
= arg min
a
Ln−lΨl,a
n +
λd
√
n − l
with Ln−lΨl,a
n = 1
n−l
n
i=l+1 I Ψl,a
n (xi ) yi
.

[Biau et al., 2005]
B2 = (xl+1, yl+1), . . . , (xn, yn);
n ;
Validation on B2:
a∗
= arg min
a
Ln−lΨl,a
n +
λd
√
n − l
with Ln−lΨl,a
n = 1
n−l
n
i=l+1 I Ψl,a
n (xi ) yi
.
⇒ The obtained classiﬁer is denoted Ψn.

Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.

Assumptions
Assumptions on X
Assumptions on the parameters: ∀ d ≥ 1,
(A2) Jd is a ﬁnite set;
(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd
and
∃νd > 0 : N(Kd, ) = O ( −νd );
(A4) Cd > 1;
(A5) d≥1 |Jd|e−2λ2
d < +∞.

Assumptions
Assumptions on X
Assumptions on the parameters: ∀ d ≥ 1,
(A2) Jd is a ﬁnite set;
(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd
and
∃νd > 0 : N(Kd, ) = O ( −νd );
(A4) Cd > 1;
(A5) d≥1 |Jd|e−2λ2
d < +∞.
Assumptions on training/validation sets
(A6) limn→+∞ l = +∞;
(A7) limn→+∞ n − l = +∞;
(A8) limn→+∞
l log(n−l)
n−l = 0.

Consistency
Theorem [Rossi and Villa, 2006]
Under assumptions (A1)-(A8), Ψn is consistent:
E (LΨn)
n→+∞
−−−−−−→ L∗
.

Consistency
Theorem [Rossi and Villa, 2006]
Under assumptions (A1)-(A8), Ψn is consistent:
E (LΨn)
n→+∞
−−−−−−→ L∗
.
Ideas of the proof: The proof is based on a similar sketch as in the work
of [Biau et al., 2005] but the result allows the use of a continuous
parameter (the regularization parameter C), based on the shatter
coefﬁcient of a class of functions that includes SVM.

Application 1: Voice recognition
Description of the data and methods
3 problems and for each problem, 100 records sampled at 82 192
points;

points;
consistent approach:
Projection on a trigonometric basis;
Splitting the data base into 50 curves (training) / 49 (validation);
Performances calculated by leave-one-out.

points;
consistent approach:
Projection on a trigonometric basis;
Splitting the data base into 50 curves (training) / 49 (validation);
Performances calculated by leave-one-out.
Results
Prob. k-nn QDA SVM gau. SVM lin. SVM lin.
(proj) (proj) (direct)
yes/no 10% 7% 10% 19% 58%
boat/goat 21% 35% 8% 29% 46%
sh/ao 16% 19% 12% 25% 47%

Regression by SVM
and that Y takes its values in R.

Regression by SVM

Regression by SVM
Once again, we try to learn a regression machine, Ψn, of the form
x → φ(x), w X + b
where the exact nature of φ and X will be discussed later.

Generalization of the classiﬁcation case to
regression
w and b minimize
C w 2
X +
n
i=1
Lk (xi, yi, w)
where Lk
, for k = 1, 2 and ≥ 0 is the -sensitive loss function:
Lk (xi, yi, w) = max 0, |yi − φ(xi), w X|k
− .
or any other loss function.

Generalization of the classiﬁcation case to
regression
w and b minimize
C w 2
X +
n
i=1
Lk (xi, yi, w)
where Lk
, for k = 1, 2 and ≥ 0 is the -sensitive loss function:
Lk (xi, yi, w) = max 0, |yi − φ(xi), w X|k
− .
or any other loss function.
Remark: A dual version, which is a quadratic optimization problem in Rn
,
also exists.

A kernel ridge regression
When is equal to 0 and k = 2, the previous problem becomes: Find w
and b that minimize
Υ w 2
X +
n
i=1
(y − φ(xi), w X)2
which can be viewed as a kernel ridge regression. This method is also
known under the name of Least-Square SVM or LS-SVM.

A kernel ridge regression
When is equal to 0 and k = 2, the previous problem becomes: Find w
and b that minimize
Υ w 2
X +
n
i=1
(y − φ(xi), w X)2
which can be viewed as a kernel ridge regression. This method is also
known under the name of Least-Square SVM or LS-SVM.
A multidimensional consistency result is available in
[Christmann and Steinwart, 2007]: the same method as for SVM
classiﬁers can then be used for the regression case !

Table of contents
3 SVM
4 References

References
Further details for the references are given in the joint document.

FDA and Statistical learning theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to FDA and Statistical learning theory

Similar to FDA and Statistical learning theory (20)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

FDA and Statistical learning theory