SlideShare a Scribd company logo
1 of 121
Download to read offline
FDA and Statistical learning theory
Nathalie Villa-Vialaneix - nathalie.villa@math.univ-toulouse.fr
http://www.nathalievilla.org
Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de Perpignan
France
La Havane, September 17th, 2008
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 1 / 39
Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 2 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
The aim of statistical learning theory is slightly different: find a regression
function that has a small error.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
The aim of statistical learning theory is slightly different: find a regression
function that has a small error.
More precisely, binary classification case:
we are given a pair of random variable, (X, Y) from X × {−1, 1}
where X is any topological space;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
The aim of statistical learning theory is slightly different: find a regression
function that has a small error.
More precisely, binary classification case:
we are given a pair of random variable, (X, Y) from X × {−1, 1}
where X is any topological space;
we observe n i.i.d. realizations of (X, Y), (x1, y1), . . . , (xn, yn), called
the learning set;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is
“close” to the model.
The aim of statistical learning theory is slightly different: find a regression
function that has a small error.
More precisely, binary classification case:
we are given a pair of random variable, (X, Y) from X × {−1, 1}
where X is any topological space;
we observe n i.i.d. realizations of (X, Y), (x1, y1), . . . , (xn, yn), called
the learning set;
we intend to find a function, built from (x1, y1), . . . , (xn, yn),
Ψn
: X → {−1, 1} that minimizes
P (Ψn
(X) Y) .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.
2 Generally, Ψn
is chosen in a restricted class of functions from X to
{−1, 1}, C; then the performance of Ψn
can be quantified by:
P (Ψn
(X) Y) − L∗
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.
2 Generally, Ψn
is chosen in a restricted class of functions from X to
{−1, 1}, C; then the performance of Ψn
can be quantified by:
P (Ψn
(X) Y) − L∗
= P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
+ inf
Ψ∈C
P (Ψ(X) Y) − L∗
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the
error. This lower bound for error expectancy is called Bayes risk,
denoted by L∗.
2 Generally, Ψn
is chosen in a restricted class of functions from X to
{−1, 1}, C; then the performance of Ψn
can be quantified by:
P (Ψn
(X) Y) − L∗
= P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
Error due to the training method
+ inf
Ψ∈C
P (Ψ(X) Y) − L∗
Error due to the choice of C
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
Consistency
From this last remark, we can define:
Definition: Weak consistency
A algorithm leading to build the classifier Ψn
is said to be (weakly
universally) consistent if, for all distribution of the random pair (X, Y), we
have
E (LΨn
)
n→+∞
−−−−−−→ L∗
where LΨn
:= P (Ψn
(X) Y | (xi, yi)i)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39
Consistency
From this last remark, we can define:
Definition: Weak consistency
A algorithm leading to build the classifier Ψn
is said to be (weakly
universally) consistent if, for all distribution of the random pair (X, Y), we
have
E (LΨn
)
n→+∞
−−−−−−→ L∗
where LΨn
:= P (Ψn
(X) Y | (xi, yi)i)
Definition: Strong consistency
Moreover, it is said to be strongly (universally) consistent if, for all
distribution of the random pair (X, Y), we have
LΨn n→+∞
−−−−−−→ L∗
p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39
Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performances
of Ψn
:
too small (not rich) C have a poor value of
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
but too rich C have a poor value of
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
because the learning algorithm tends to overfit the data.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performances
of Ψn
:
too small (not rich) C have a poor value of
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
but too rich C have a poor value of
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
because the learning algorithm tends to overfit the data.
2 A naive approach to find a good Ψn
over the class C could be to
minimize the empirical risk of C:
Ψn
:= arg min
Ψ∈C
LnΨ
where LnΨ := 1
n
n
i=1 I{Ψ(xi) yi}.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performances
of Ψn
:
too small (not rich) C have a poor value of
inf
Ψ∈C
P (Ψ(X) Y) − L∗
,
but too rich C have a poor value of
P (Ψn
(X) Y) − inf
Ψ∈C
P (Ψ(X) Y)
because the learning algorithm tends to overfit the data.
2 A naive approach to find a good Ψn
over the class C could be to
minimize the empirical risk of C:
Ψn
:= arg min
Ψ∈C
LnΨ
where LnΨ := 1
n
n
i=1 I{Ψ(xi) yi}.
The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to the
accuracy of the empirical risk.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
VC-dimension
A way to quantify the “richness” of a class of functions is to calculate its
VC-dimension:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39
VC-dimension
A way to quantify the “richness” of a class of functions is to calculate its
VC-dimension:
Definition: VC-dimension
A class of classifiers (functions from X in {−1, 1}), C, is said to shatter a
set of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to those
points, m1, m2, . . . , md ∈ {−1, 1}, there exists a Ψ ∈ C such that:
∀ i = 1, . . . , d, Ψ(zi) = mi.
The VC-dimension of class of functions C is the maximum number of
points that can be shattered by C.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
2 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
2 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
2 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
3 points are shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points cannot be shattered by C:
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points canto be shattered by C:
no Ψ ∈ C can have value 1 on the red circles and −1 on the black ones.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points canto be shattered by C:
then, VC-dimension of C = 3.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2
and
C = Ψ : x ∈ R2
→ ±Sign(aT
x + b), a ∈ R2
and b ∈ R . Then,
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd
is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Relationship between VC-dimension and empirical
error
Theorem [Vapnik, 1995, Vapnik, 1998]
With a probability almost equal to 1 − η,
sup
Ψ∈C
E (LΨ) − LnΨ ≤
VC(C) − log(η/4)
n
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 9 / 39
An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,
another quantity can also be considered:
Definition: Shatter coefficient
The k-th shatter coefficient of the set of functions C is the maximum
numbers of partitions of n points into two sets that can be obtained from C.
This number, denoted by S(C, n), is almost equal to 2n
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,
another quantity can also be considered:
Definition: Shatter coefficient
The k-th shatter coefficient of the set of functions C is the maximum
numbers of partitions of n points into two sets that can be obtained from C.
This number, denoted by S(C, n), is almost equal to 2n
Example: If C is the space of hyperplans in Rd
,
S(C, n) =
2n
if n ≤ d
2d+1
= 2VC(C) if n ≥ d + 1
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,
another quantity can also be considered:
Definition: Shatter coefficient
The k-th shatter coefficient of the set of functions C is the maximum
numbers of partitions of n points into two sets that can be obtained from C.
This number, denoted by S(C, n), is almost equal to 2n
Example: If C is the space of hyperplans in Rd
,
S(C, n) =
2n
if n ≤ d
2d+1
= 2VC(C) if n ≥ d + 1
Remark: For all n > 2, S(C, n) ≤ nVC(C).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Consequences for the learning error on C: If Ψn has been chosen by
minimizing the empirical risk, i.e.,
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Consequences for the learning error on C: If Ψn has been chosen by
minimizing the empirical risk, i.e.,
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}
as
P E (LΨn) − inf
Ψ∈C
E (LΨ) > ≤ P 2 sup
Ψ∈C
LnΨ − E (LΨ) >
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P sup
Ψ∈C
LnΨ − E (LΨ) > ≤ S(C, n)e−n 2
/32
.
Consequences for the learning error on C: If Ψn has been chosen by
minimizing the empirical risk, i.e.,
Ψn := arg min
Ψ∈C
1
n
n
i=1
I{Ψ(xi) yi}
and then,
P E (LΨn) − inf
Ψ∈C
E (LΨ) > ≤ S(C, n)e−n 2
/128
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
the empirical risk: for Ψ : X → R, LnΨ = 1
n
n
i=1(yi − Ψ(xi))2
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
Additional notes for the regression case
Same theory can be developed for the regression case under additional
assumptions. To summarize, let (X, Y) be a random pair taking its values
in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of
(X, Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,
LΨ = E (Ψ(X) − Y)2
| (xi, yi)i ;
the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2
. In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
the empirical risk: for Ψ : X → R, LnΨ = 1
n
n
i=1(yi − Ψ(xi))2
.
Hence, in this case, a consistent regression scheme, Ψn, satisfies:
lim
n→+∞
E (LΨn) = L∗
;
and a strongly consistent regression scheme, Ψn, satisfies:
lim
n→+∞
LΨn = L∗
p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 13 / 39
Remains on functional multilayer perceptron by
projection approach
Data: Suppose that we are given a random pair (X, Y) taking its values in
X × R where (X, ., . X) is a Hilbert space. Suppose also that we have n
i.i.d. observations of (X, Y), (x1, y1), . . . , (xn, yn).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39
Remains on functional multilayer perceptron by
projection approach
Data: Suppose that we are given a random pair (X, Y) taking its values in
X × R where (X, ., . X) is a Hilbert space. Suppose also that we have n
i.i.d. observations of (X, Y), (x1, y1), . . . , (xn, yn).
Functional MLP: The projection approach is based on the knowledge of
a Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i and
also the weights of the MLP are projected on this basis truncated at q:
Cn
q =



Ψ : X → R : ∀ x ∈ X, Ψ(x) =
pn
l=1
w
(2)
l
G

w
(0)
l
+
q
k=1
β
(1)
lk
(Pq(x))k


pn
l=1
|w
(2)
l
| ≤ αn



where (pn)n is a sequence of integer, (αn)n is a sequence of positive real
numbers, G is a given continuous functions and the weights (w
(2)
l
)l,
(w
(0)
l
)l and (β
(1)
lk
)l,k have to be learned from the data set in R (see
Presentation 2 for further details).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39
Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, with
limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, with
limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0;
(A2) limn→+∞
pnαn log(pn log αn)
n = 0 and ∃ δ > 0: limn→+∞
α2
n
n1−δ = 0;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
Assumptions for consistency of functional MLP
Note
Ψp
n = arg min
Ψ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, with
limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0;
(A2) limn→+∞
pnαn log(pn log αn)
n = 0 and ∃ δ > 0: limn→+∞
α2
n
n1−δ = 0;
(A3) Y is squared integrable.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The fist one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The fist one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
2 The second one shows that, for any fixed p
lim
n→+∞
LΨp
n = L∗
p.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The fist one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
2 The second one shows that, for any fixed p
lim
n→+∞
LΨp
n = L∗
p.
Remark: The limitation of this result is in the fact that it is a double limit
and that no indication on the way n and p should be linked in given.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Strong consistency of projection based functional
MPL
Theorem [Rossi and Conan-Guez, 2006]
Under assumptions (A1)-(A3),
lim
p→+∞
lim
n→+∞
LΨp
n = L∗
p.s.
Sketch of the proof: The proof is divided into two parts:
1 The fist one shows that
L∗
p = inf
Ψ∈Rp→+∞
E (Ψ(Pp(X)) − Y)2
| (xi, yi)i
n→+∞
−−−−−−→ L∗
a.s.
2 The second one shows that, for any fixed p
lim
n→+∞
LΨp
n = L∗
p.
Remark 2: The principle of the proof is very general and can be applied to
any other consistent method in Rp
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Presentation of k-nearest neighbors for functional
classification
This method has been introduced in [Biau et al., 2005] for the binary
classification case and it exists a regression version in the work of
[Laloë, 2008].
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
Presentation of k-nearest neighbors for functional
classification
This method has been introduced in [Biau et al., 2005] for the binary
classification case and it exists a regression version in the work of
[Laloë, 2008].
Context: We are given a random pair (X, Y) taking its values in
X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n
i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
Presentation of k-nearest neighbors for functional
classification
This method has been introduced in [Biau et al., 2005] for the binary
classification case and it exists a regression version in the work of
[Laloë, 2008].
Context: We are given a random pair (X, Y) taking its values in
X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n
i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn).
Functional k-nearest neighbors also consists in using the projection of
the data on a Hilbert basis, (φj)j≥1: denote xd
i
= (xd
i1
, . . . , xd
id
) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = xi, φj X.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
Presentation of k-nearest neighbors for functional
classification
This method has been introduced in [Biau et al., 2005] for the binary
classification case and it exists a regression version in the work of
[Laloë, 2008].
Context: We are given a random pair (X, Y) taking its values in
X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n
i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn).
Functional k-nearest neighbors also consists in using the projection of
the data on a Hilbert basis, (φj)j≥1: denote xd
i
= (xd
i1
, . . . , xd
id
) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = xi, φj X.
k-nearest neighbors for d-dimensional data is then performed on the
dataset (xd
1
, y1), . . . , (xd
n, yn): if for all u ∈ Rd
,
Vk (u) := {i ∈ [[1, n]] : xd
i
− u Rd belongs to the k smallest of these values},
Ψn : x ∈ X →
−1 if i∈Vk (xd ) I{yi=−1} > i∈Vk (xd ) I{yi=1}
+1 otherwise
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗,
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors
classifier, Ψd,l,k
n , from data {(xd
i
, yi)}i=1,...,l.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors
classifier, Ψd,l,k
n , from data {(xd
i
, yi)}i=1,...,l.
2 Choose
(dn
, kn
) = arg min
k∈N∗, d∈N∗
1
n − l
n
i=l+1
I Ψd,l,k
n (xi) yi
+
λd
√
n − l
where λd is a penalization term to avoid the selection of (possibly
overfitting) very large dimensions.
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
Selection of the dimension of projection and of the
parameter k
d and k are then automatically selected from the dataset by a validation
strategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors
classifier, Ψd,l,k
n , from data {(xd
i
, yi)}i=1,...,l.
2 Choose
(dn
, kn
) = arg min
k∈N∗, d∈N∗
1
n − l
n
i=l+1
I Ψd,l,k
n (xi) yi
+
λd
√
n − l
where λd is a penalization term to avoid the selection of (possibly
overfitting) very large dimensions.
Then, define Ψn = Ψdn
,l,kn
n .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
An oracle inequality
Oracle inequality [Biau et al., 2005]
Note ∆ = +∞
d=1
e−2λ2
d < +∞. Then, it exists C > 0, only depending on ∆,
such that ∀ l > 1/∆,
E (LΨn) − L∗
≤ inf
d≥1
(L∗
d − L∗
) + inf
1≤k≤l
E LΨl,k,d
n − L∗
d +
λd
√
n − l
+C
log l
n − l
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39
An oracle inequality
Oracle inequality [Biau et al., 2005]
Note ∆ = +∞
d=1
e−2λ2
d < +∞. Then, it exists C > 0, only depending on ∆,
such that ∀ l > 1/∆,
E (LΨn) − L∗
≤ inf
d≥1
(L∗
d − L∗
) + inf
1≤k≤l
E LΨl,k,d
n − L∗
d +
λd
√
n − l
+C
log l
n − l
Then, we have:
by a martingale property: limd→+∞ L∗
d
= L∗,
by consistency of k-nearest neighbors in Rd
: for all d ≥ 1,
inf1≤k≤l E LΨl,k,d
n − L∗
d
l→+∞
−−−−−→ 0,
the rest of the right hand side of the inequality can be set to converge
to 0 when n grows to infinity, for suitable choices of n, l and λd.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39
Consistency of functional k-nearest neighbors
Theorem [Biau et al., 2005]
Suppose that
lim
n→+∞
l = +∞ lim
n→+∞
(n − l) = +∞ lim
n→+∞
log l
n − l
= 0
then
lim
n→+∞
E (LΨn) = L∗
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 20 / 39
Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 21 / 39
A binary classification problem
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in {−1, 1}.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
A binary classification problem
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in {−1, 1}.
Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
A binary classification problem
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in {−1, 1}.
Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).
We try to learn a classification machine, Ψn, of the form
x → Sign ( x, w Rd + b), or, more precisely, of the form
x → Sign ( φ(x), w X + b)
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)
w
margin: 1
w 2
Rd
Support Vector
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
Linear discrimination with optimal margin
Learn Ψn : x → Sign ( x, w Rd + b)
w
margin: 1
w 2
Rd
Support Vector
w is such that:
minw,b w Rd ,
such that: yi(wT
xi + b) ≥ 1, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
Linear discrimination with soft margin
Learn Ψn : x → Sign ( x, w Rd + b)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
Linear discrimination with soft margin
Learn Ψn : x → Sign ( x, w Rd + b)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
Linear discrimination with soft margin
Learn Ψn : x → Sign ( x, w Rd + b)
w
margin: 1
w 2
Rd
Support Vector
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
Linear discrimination with soft margin
Learn Ψn : x → Sign ( x, w Rd + b)
w
margin: 1
w 2
Rd
Support Vector
w is such that:
minw,b,ξ w Rd + C n
i=1 ξi,
where: yi(wT
xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd
Feature space X
φ (nonlinear)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd
Feature space X
φ (nonlinear)
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
Mapping the data onto a high dimensional space
Learn Ψn : x → Sign ( φ(x), w X + b)
Original space Rd
Feature space X
φ (nonlinear)
w is such that:
(PC,X) minw,b,ξ w X + C n
i=1 ξi,
where: yi( w, φ(xi) X + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
Details about the feature space: a regularization
framework
Regularization framework: (PC,X) ⇔
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
Details about the feature space: a regularization
framework
Regularization framework: (PC,X) ⇔
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .
Dual problem: (PC,X) ⇔
(DC,X) maxα
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj φ(xi), φ(xj) X
where N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
Details about the feature space: a regularization
framework
Regularization framework: (PC,X) ⇔
(Rλ,X) min
F∈X
1
n
n
i=1
max(0, 1 − yiF(xi)) + λ F X .
Dual problem: (PC,X) ⇔
(DC,X) maxα
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj φ(xi), φ(xj) X
where N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
Inner product in X:
∀ u, v ∈ X, K(u, v) = φ(u), φ(v) X
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e
−σ2
u−v 2
Rd
for σ > 0;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e
−σ2
u−v 2
Rd
for σ > 0;
The exponential kernel: K(u, v) = e u,v R ;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
Example of usefull kernels
Provided that
∀ m ∈ N∗
, (ui)i=1,...,m ∈ Rd
, (αi)i=1,...,m ∈ R,
m
i,j=1
αiαjK(ui, uj) ≥ 0
K can be used as a kernel mapping the original data onto a high
dimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e
−σ2
u−v 2
Rd
for σ > 0;
The exponential kernel: K(u, v) = e u,v R ;
Vovk’s real infinite polynomial: K(u, v) = (1 − u, v Rd )−α for α > 0;
. . .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
∀ > 0, the -covering number of φ(W), that is, the minimum number of
balls of radius that are needed to cover φ(W), is such that:
N(K, ) = O ( −α) for α > 0;
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
∀ > 0, the -covering number of φ(W), that is, the minimum number of
balls of radius that are needed to cover φ(W), is such that:
N(K, ) = O ( −α) for α > 0;
the regularization parameter, C, depends on n by:
limn→+∞ nCn = +∞, Cn = O nβ−1
for 0 < β < 1/α.
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd
, W;
the kernel K is universal on W (i.e., the set of all functions
{u ∈ W → w, φ(u) X, w ∈ X} is dense in C0
(W);
∀ > 0, the -covering number of φ(W), that is, the minimum number of
balls of radius that are needed to cover φ(W), is such that:
N(K, ) = O ( −α) for α > 0;
the regularization parameter, C, depends on n by:
limn→+∞ nCn = +∞, Cn = O nβ−1
for 0 < β < 1/α.
Remark: The Gaussian kernel satisfies all these assumptions with
N(K, ) = O n−d
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Consistency of SVM in Rd
Theorem [Steinwart, 2002]
Under assumptions (A1)-(A4), SVM are consistent.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 29 / 39
Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, ., . X).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, ., . X).
1 We already talk about the advantages of regularization or
projection of the functional data as a pre-processing;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, ., . X).
1 We already talk about the advantages of regularization or
projection of the functional data as a pre-processing;
2 The consistency result can’t be directly applied with infinite
dimensional data because the condition of covering number for
infinite dimensional Gaussian kernel is not valid.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);
Learn a SVM on B1: Ψl,a
n ;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);
Learn a SVM on B1: Ψl,a
n ;
Validation on B2:
a∗
= arg min
a
Ln−lΨl,a
n +
λd
√
n − l
with Ln−lΨl,a
n = 1
n−l
n
i=l+1 I Ψl,a
n (xi ) yi
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of
[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd]
Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and
B2 = (xl+1, yl+1), . . . , (xn, yn);
Learn a SVM on B1: Ψl,a
n ;
Validation on B2:
a∗
= arg min
a
Ln−lΨl,a
n +
λd
√
n − l
with Ln−lΨl,a
n = 1
n−l
n
i=l+1 I Ψl,a
n (xi ) yi
.
⇒ The obtained classifier is denoted Ψn.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Assumptions on the parameters: ∀ d ≥ 1,
(A2) Jd is a finite set;
(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd
and
∃νd > 0 : N(Kd, ) = O ( −νd );
(A4) Cd > 1;
(A5) d≥1 |Jd|e−2λ2
d < +∞.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Assumptions on the parameters: ∀ d ≥ 1,
(A2) Jd is a finite set;
(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd
and
∃νd > 0 : N(Kd, ) = O ( −νd );
(A4) Cd > 1;
(A5) d≥1 |Jd|e−2λ2
d < +∞.
Assumptions on training/validation sets
(A6) limn→+∞ l = +∞;
(A7) limn→+∞ n − l = +∞;
(A8) limn→+∞
l log(n−l)
n−l = 0.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
Consistency
Theorem [Rossi and Villa, 2006]
Under assumptions (A1)-(A8), Ψn is consistent:
E (LΨn)
n→+∞
−−−−−−→ L∗
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39
Consistency
Theorem [Rossi and Villa, 2006]
Under assumptions (A1)-(A8), Ψn is consistent:
E (LΨn)
n→+∞
−−−−−−→ L∗
.
Ideas of the proof: The proof is based on a similar sketch as in the work
of [Biau et al., 2005] but the result allows the use of a continuous
parameter (the regularization parameter C), based on the shatter
coefficient of a class of functions that includes SVM.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39
Application 1: Voice recognition
Description of the data and methods
3 problems and for each problem, 100 records sampled at 82 192
points;
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
Application 1: Voice recognition
Description of the data and methods
3 problems and for each problem, 100 records sampled at 82 192
points;
consistent approach:
Projection on a trigonometric basis;
Splitting the data base into 50 curves (training) / 49 (validation);
Performances calculated by leave-one-out.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
Application 1: Voice recognition
Description of the data and methods
3 problems and for each problem, 100 records sampled at 82 192
points;
consistent approach:
Projection on a trigonometric basis;
Splitting the data base into 50 curves (training) / 49 (validation);
Performances calculated by leave-one-out.
Results
Prob. k-nn QDA SVM gau. SVM lin. SVM lin.
(proj) (proj) (direct)
yes/no 10% 7% 10% 19% 58%
boat/goat 21% 35% 8% 29% 46%
sh/ao 16% 19% 12% 25% 47%
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
Regression by SVM
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in R.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
Regression by SVM
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in R.
Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
Regression by SVM
Suppose that we are given a random pair of variables (X, Y) where X
takes its values in Rd
and that Y takes its values in R.
Moreover, we know n i.i.d. realizations of the random pair (X, Y) that
we denote by (x1, y1), . . . , (xn, yn).
Once again, we try to learn a regression machine, Ψn, of the form
x → φ(x), w X + b
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
Generalization of the classification case to
regression
w and b minimize
C w 2
X +
n
i=1
Lk (xi, yi, w)
where Lk
, for k = 1, 2 and ≥ 0 is the -sensitive loss function:
Lk (xi, yi, w) = max 0, |yi − φ(xi), w X|k
− .
or any other loss function.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39
Generalization of the classification case to
regression
w and b minimize
C w 2
X +
n
i=1
Lk (xi, yi, w)
where Lk
, for k = 1, 2 and ≥ 0 is the -sensitive loss function:
Lk (xi, yi, w) = max 0, |yi − φ(xi), w X|k
− .
or any other loss function.
Remark: A dual version, which is a quadratic optimization problem in Rn
,
also exists.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39
A kernel ridge regression
When is equal to 0 and k = 2, the previous problem becomes: Find w
and b that minimize
Υ w 2
X +
n
i=1
(y − φ(xi), w X)2
which can be viewed as a kernel ridge regression. This method is also
known under the name of Least-Square SVM or LS-SVM.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39
A kernel ridge regression
When is equal to 0 and k = 2, the previous problem becomes: Find w
and b that minimize
Υ w 2
X +
n
i=1
(y − φ(xi), w X)2
which can be viewed as a kernel ridge regression. This method is also
known under the name of Least-Square SVM or LS-SVM.
A multidimensional consistency result is available in
[Christmann and Steinwart, 2007]: the same method as for SVM
classifiers can then be used for the regression case !
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39
Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 38 / 39
References
Further details for the references are given in the joint document.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 39 / 39

More Related Content

What's hot

What's hot (20)

Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computation
 
Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forests
 
Slides econometrics-2018-graduate-2
Slides econometrics-2018-graduate-2Slides econometrics-2018-graduate-2
Slides econometrics-2018-graduate-2
 
Slides econometrics-2018-graduate-1
Slides econometrics-2018-graduate-1Slides econometrics-2018-graduate-1
Slides econometrics-2018-graduate-1
 
Side 2019 #6
Side 2019 #6Side 2019 #6
Side 2019 #6
 
Varese italie #2
Varese italie #2Varese italie #2
Varese italie #2
 
Side 2019, part 2
Side 2019, part 2Side 2019, part 2
Side 2019, part 2
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
 
A comparison of three learning methods to predict N20 fluxes and N leaching
A comparison of three learning methods to predict N20 fluxes and N leachingA comparison of three learning methods to predict N20 fluxes and N leaching
A comparison of three learning methods to predict N20 fluxes and N leaching
 
Boston talk
Boston talkBoston talk
Boston talk
 
Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3
 
Varese italie seminar
Varese italie seminarVarese italie seminar
Varese italie seminar
 
Ab cancun
Ab cancunAb cancun
Ab cancun
 
Monte Carlo in Montréal 2017
Monte Carlo in Montréal 2017Monte Carlo in Montréal 2017
Monte Carlo in Montréal 2017
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
Lesson 4: Calculating Limits (Section 21 slides)
Lesson 4: Calculating Limits (Section 21 slides)Lesson 4: Calculating Limits (Section 21 slides)
Lesson 4: Calculating Limits (Section 21 slides)
 
Side 2019, part 1
Side 2019, part 1Side 2019, part 1
Side 2019, part 1
 
Side 2019 #4
Side 2019 #4Side 2019 #4
Side 2019 #4
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
 

Viewers also liked

Viewers also liked (9)

Influence of the sampling on Functional Data Analysis
Influence of the sampling on Functional Data AnalysisInfluence of the sampling on Functional Data Analysis
Influence of the sampling on Functional Data Analysis
 
Introduction to FDA and linear models
 Introduction to FDA and linear models Introduction to FDA and linear models
Introduction to FDA and linear models
 
Several nonlinear models and methods for FDA
Several nonlinear models and methods for FDASeveral nonlinear models and methods for FDA
Several nonlinear models and methods for FDA
 
Graph mining with kernel self-organizing map
Graph mining with kernel self-organizing mapGraph mining with kernel self-organizing map
Graph mining with kernel self-organizing map
 
Fouille de données issues d’un grand graphe par carte de Kohonen à noyau
Fouille de données issues d’un grand graphe par carte de Kohonen à noyauFouille de données issues d’un grand graphe par carte de Kohonen à noyau
Fouille de données issues d’un grand graphe par carte de Kohonen à noyau
 
Carte de Kohonen par noyau et application a la classification de sommets de g...
Carte de Kohonen par noyau et application a la classification de sommets de g...Carte de Kohonen par noyau et application a la classification de sommets de g...
Carte de Kohonen par noyau et application a la classification de sommets de g...
 
Fouille de données pour des grands graphes
Fouille de données pour des grands graphesFouille de données pour des grands graphes
Fouille de données pour des grands graphes
 
Fouille de données pour de grands graphes. Recherche de communautés et organi...
Fouille de données pour de grands graphes. Recherche de communautés et organi...Fouille de données pour de grands graphes. Recherche de communautés et organi...
Fouille de données pour de grands graphes. Recherche de communautés et organi...
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph mining
 

Similar to FDA and Statistical learning theory

Basic Concept Of Probability
Basic Concept Of ProbabilityBasic Concept Of Probability
Basic Concept Of Probability
guest45a926
 

Similar to FDA and Statistical learning theory (20)

Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
AI Lesson 26
AI Lesson 26AI Lesson 26
AI Lesson 26
 
06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes
 
StatPhysPerspectives_AMALEA_Cetraro_AnnaCarbone.pdf
StatPhysPerspectives_AMALEA_Cetraro_AnnaCarbone.pdfStatPhysPerspectives_AMALEA_Cetraro_AnnaCarbone.pdf
StatPhysPerspectives_AMALEA_Cetraro_AnnaCarbone.pdf
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
 
Bayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdfBayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdf
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision Theory
 
MUMS Opening Workshop - Emulators for models and Complexity Reduction - Akil ...
MUMS Opening Workshop - Emulators for models and Complexity Reduction - Akil ...MUMS Opening Workshop - Emulators for models and Complexity Reduction - Akil ...
MUMS Opening Workshop - Emulators for models and Complexity Reduction - Akil ...
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Basic Concept Of Probability
Basic Concept Of ProbabilityBasic Concept Of Probability
Basic Concept Of Probability
 
Slides ub-3
Slides ub-3Slides ub-3
Slides ub-3
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
On non-negative unbiased estimators
On non-negative unbiased estimatorsOn non-negative unbiased estimators
On non-negative unbiased estimators
 

More from tuxette

More from tuxette (20)

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiques
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random forest
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 

Recently uploaded

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 

FDA and Statistical learning theory

  • 1. FDA and Statistical learning theory Nathalie Villa-Vialaneix - nathalie.villa@math.univ-toulouse.fr http://www.nathalievilla.org Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de Perpignan France La Havane, September 17th, 2008 Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 1 / 39
  • 2. Table of contents 1 Basics in statistical learning theory 2 Examples of consistent methods for FDA 3 SVM 4 References Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 2 / 39
  • 3. Purpose of statistical learning theory In the previous presentations, the aim was to find an estimator that is “close” to the model. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
  • 4. Purpose of statistical learning theory In the previous presentations, the aim was to find an estimator that is “close” to the model. The aim of statistical learning theory is slightly different: find a regression function that has a small error. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
  • 5. Purpose of statistical learning theory In the previous presentations, the aim was to find an estimator that is “close” to the model. The aim of statistical learning theory is slightly different: find a regression function that has a small error. More precisely, binary classification case: we are given a pair of random variable, (X, Y) from X × {−1, 1} where X is any topological space; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
  • 6. Purpose of statistical learning theory In the previous presentations, the aim was to find an estimator that is “close” to the model. The aim of statistical learning theory is slightly different: find a regression function that has a small error. More precisely, binary classification case: we are given a pair of random variable, (X, Y) from X × {−1, 1} where X is any topological space; we observe n i.i.d. realizations of (X, Y), (x1, y1), . . . , (xn, yn), called the learning set; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
  • 7. Purpose of statistical learning theory In the previous presentations, the aim was to find an estimator that is “close” to the model. The aim of statistical learning theory is slightly different: find a regression function that has a small error. More precisely, binary classification case: we are given a pair of random variable, (X, Y) from X × {−1, 1} where X is any topological space; we observe n i.i.d. realizations of (X, Y), (x1, y1), . . . , (xn, yn), called the learning set; we intend to find a function, built from (x1, y1), . . . , (xn, yn), Ψn : X → {−1, 1} that minimizes P (Ψn (X) Y) . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
  • 8. First remarks on the aim 1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the error. This lower bound for error expectancy is called Bayes risk, denoted by L∗. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
  • 9. First remarks on the aim 1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the error. This lower bound for error expectancy is called Bayes risk, denoted by L∗. 2 Generally, Ψn is chosen in a restricted class of functions from X to {−1, 1}, C; then the performance of Ψn can be quantified by: P (Ψn (X) Y) − L∗ Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
  • 10. First remarks on the aim 1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the error. This lower bound for error expectancy is called Bayes risk, denoted by L∗. 2 Generally, Ψn is chosen in a restricted class of functions from X to {−1, 1}, C; then the performance of Ψn can be quantified by: P (Ψn (X) Y) − L∗ = P (Ψn (X) Y) − inf Ψ∈C P (Ψ(X) Y) + inf Ψ∈C P (Ψ(X) Y) − L∗ Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
  • 11. First remarks on the aim 1 infΨ:X→{−1,1} P (Ψ(X) Y) is the “target” for the expectancy of the error. This lower bound for error expectancy is called Bayes risk, denoted by L∗. 2 Generally, Ψn is chosen in a restricted class of functions from X to {−1, 1}, C; then the performance of Ψn can be quantified by: P (Ψn (X) Y) − L∗ = P (Ψn (X) Y) − inf Ψ∈C P (Ψ(X) Y) Error due to the training method + inf Ψ∈C P (Ψ(X) Y) − L∗ Error due to the choice of C Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
  • 12. Consistency From this last remark, we can define: Definition: Weak consistency A algorithm leading to build the classifier Ψn is said to be (weakly universally) consistent if, for all distribution of the random pair (X, Y), we have E (LΨn ) n→+∞ −−−−−−→ L∗ where LΨn := P (Ψn (X) Y | (xi, yi)i) Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39
  • 13. Consistency From this last remark, we can define: Definition: Weak consistency A algorithm leading to build the classifier Ψn is said to be (weakly universally) consistent if, for all distribution of the random pair (X, Y), we have E (LΨn ) n→+∞ −−−−−−→ L∗ where LΨn := P (Ψn (X) Y | (xi, yi)i) Definition: Strong consistency Moreover, it is said to be strongly (universally) consistent if, for all distribution of the random pair (X, Y), we have LΨn n→+∞ −−−−−−→ L∗ p.s. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39
  • 14. Choice of C and of Ψn 1 The choice of C is of a main importance to obtain good performances of Ψn : too small (not rich) C have a poor value of inf Ψ∈C P (Ψ(X) Y) − L∗ , but too rich C have a poor value of P (Ψn (X) Y) − inf Ψ∈C P (Ψ(X) Y) because the learning algorithm tends to overfit the data. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
  • 15. Choice of C and of Ψn 1 The choice of C is of a main importance to obtain good performances of Ψn : too small (not rich) C have a poor value of inf Ψ∈C P (Ψ(X) Y) − L∗ , but too rich C have a poor value of P (Ψn (X) Y) − inf Ψ∈C P (Ψ(X) Y) because the learning algorithm tends to overfit the data. 2 A naive approach to find a good Ψn over the class C could be to minimize the empirical risk of C: Ψn := arg min Ψ∈C LnΨ where LnΨ := 1 n n i=1 I{Ψ(xi) yi}. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
  • 16. Choice of C and of Ψn 1 The choice of C is of a main importance to obtain good performances of Ψn : too small (not rich) C have a poor value of inf Ψ∈C P (Ψ(X) Y) − L∗ , but too rich C have a poor value of P (Ψn (X) Y) − inf Ψ∈C P (Ψ(X) Y) because the learning algorithm tends to overfit the data. 2 A naive approach to find a good Ψn over the class C could be to minimize the empirical risk of C: Ψn := arg min Ψ∈C LnΨ where LnΨ := 1 n n i=1 I{Ψ(xi) yi}. The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to the accuracy of the empirical risk. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
  • 17. VC-dimension A way to quantify the “richness” of a class of functions is to calculate its VC-dimension: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39
  • 18. VC-dimension A way to quantify the “richness” of a class of functions is to calculate its VC-dimension: Definition: VC-dimension A class of classifiers (functions from X in {−1, 1}), C, is said to shatter a set of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to those points, m1, m2, . . . , md ∈ {−1, 1}, there exists a Ψ ∈ C such that: ∀ i = 1, . . . , d, Ψ(zi) = mi. The VC-dimension of class of functions C is the maximum number of points that can be shattered by C. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39
  • 19. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 20. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 2 points are shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 21. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 2 points are shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 22. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 2 points are shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 23. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 3 points are shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 24. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 3 points are shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 25. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 3 points are shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 26. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 3 points are shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 27. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 3 points are shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 28. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 4 points cannot be shattered by C: Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 29. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 4 points canto be shattered by C: no Ψ ∈ C can have value 1 on the red circles and −1 on the black ones. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 30. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 4 points canto be shattered by C: then, VC-dimension of C = 3. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 31. Example: VC-dimension of hyperplans Suppose that X = R2 and C = Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R . Then, 4 points canto be shattered by C: More generally, VC-dimension of hyperplans in Rd is d + 1. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
  • 32. Relationship between VC-dimension and empirical error Theorem [Vapnik, 1995, Vapnik, 1998] With a probability almost equal to 1 − η, sup Ψ∈C E (LΨ) − LnΨ ≤ VC(C) − log(η/4) n . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 9 / 39
  • 33. An alternative to VC-dimension Remark: In most cases, VC-dimension is not enough precise. Then, another quantity can also be considered: Definition: Shatter coefficient The k-th shatter coefficient of the set of functions C is the maximum numbers of partitions of n points into two sets that can be obtained from C. This number, denoted by S(C, n), is almost equal to 2n Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
  • 34. An alternative to VC-dimension Remark: In most cases, VC-dimension is not enough precise. Then, another quantity can also be considered: Definition: Shatter coefficient The k-th shatter coefficient of the set of functions C is the maximum numbers of partitions of n points into two sets that can be obtained from C. This number, denoted by S(C, n), is almost equal to 2n Example: If C is the space of hyperplans in Rd , S(C, n) = 2n if n ≤ d 2d+1 = 2VC(C) if n ≥ d + 1 Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
  • 35. An alternative to VC-dimension Remark: In most cases, VC-dimension is not enough precise. Then, another quantity can also be considered: Definition: Shatter coefficient The k-th shatter coefficient of the set of functions C is the maximum numbers of partitions of n points into two sets that can be obtained from C. This number, denoted by S(C, n), is almost equal to 2n Example: If C is the space of hyperplans in Rd , S(C, n) = 2n if n ≤ d 2d+1 = 2VC(C) if n ≥ d + 1 Remark: For all n > 2, S(C, n) ≤ nVC(C). Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
  • 36. Vapnik-Chervonenkis inequality Theorem [Vapnik, 1995, Vapnik, 1998] P sup Ψ∈C LnΨ − E (LΨ) > ≤ S(C, n)e−n 2 /32 . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
  • 37. Vapnik-Chervonenkis inequality Theorem [Vapnik, 1995, Vapnik, 1998] P sup Ψ∈C LnΨ − E (LΨ) > ≤ S(C, n)e−n 2 /32 . Consequences for the learning error on C: If Ψn has been chosen by minimizing the empirical risk, i.e., Ψn := arg min Ψ∈C 1 n n i=1 I{Ψ(xi) yi} Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
  • 38. Vapnik-Chervonenkis inequality Theorem [Vapnik, 1995, Vapnik, 1998] P sup Ψ∈C LnΨ − E (LΨ) > ≤ S(C, n)e−n 2 /32 . Consequences for the learning error on C: If Ψn has been chosen by minimizing the empirical risk, i.e., Ψn := arg min Ψ∈C 1 n n i=1 I{Ψ(xi) yi} as P E (LΨn) − inf Ψ∈C E (LΨ) > ≤ P 2 sup Ψ∈C LnΨ − E (LΨ) > Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
  • 39. Vapnik-Chervonenkis inequality Theorem [Vapnik, 1995, Vapnik, 1998] P sup Ψ∈C LnΨ − E (LΨ) > ≤ S(C, n)e−n 2 /32 . Consequences for the learning error on C: If Ψn has been chosen by minimizing the empirical risk, i.e., Ψn := arg min Ψ∈C 1 n n i=1 I{Ψ(xi) yi} and then, P E (LΨn) − inf Ψ∈C E (LΨ) > ≤ S(C, n)e−n 2 /128 . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
  • 40. Additional notes for the regression case Same theory can be developed for the regression case under additional assumptions. To summarize, let (X, Y) be a random pair taking its values in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of (X, Y). Then, we can introduce the risk as, for example, the mean square error: for Ψ : X → R, LΨ = E (Ψ(X) − Y)2 | (xi, yi)i ; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
  • 41. Additional notes for the regression case Same theory can be developed for the regression case under additional assumptions. To summarize, let (X, Y) be a random pair taking its values in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of (X, Y). Then, we can introduce the risk as, for example, the mean square error: for Ψ : X → R, LΨ = E (Ψ(X) − Y)2 | (xi, yi)i ; the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2 . In this case, L∗ = E (LΨ∗) where Ψ∗ = E (Y | X); Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
  • 42. Additional notes for the regression case Same theory can be developed for the regression case under additional assumptions. To summarize, let (X, Y) be a random pair taking its values in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of (X, Y). Then, we can introduce the risk as, for example, the mean square error: for Ψ : X → R, LΨ = E (Ψ(X) − Y)2 | (xi, yi)i ; the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2 . In this case, L∗ = E (LΨ∗) where Ψ∗ = E (Y | X); the empirical risk: for Ψ : X → R, LnΨ = 1 n n i=1(yi − Ψ(xi))2 . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
  • 43. Additional notes for the regression case Same theory can be developed for the regression case under additional assumptions. To summarize, let (X, Y) be a random pair taking its values in X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of (X, Y). Then, we can introduce the risk as, for example, the mean square error: for Ψ : X → R, LΨ = E (Ψ(X) − Y)2 | (xi, yi)i ; the Bayes risk is: L∗ = infΨ:X→R E (Ψ(X) − Y)2 . In this case, L∗ = E (LΨ∗) where Ψ∗ = E (Y | X); the empirical risk: for Ψ : X → R, LnΨ = 1 n n i=1(yi − Ψ(xi))2 . Hence, in this case, a consistent regression scheme, Ψn, satisfies: lim n→+∞ E (LΨn) = L∗ ; and a strongly consistent regression scheme, Ψn, satisfies: lim n→+∞ LΨn = L∗ p.s. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
  • 44. Table of contents 1 Basics in statistical learning theory 2 Examples of consistent methods for FDA 3 SVM 4 References Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 13 / 39
  • 45. Remains on functional multilayer perceptron by projection approach Data: Suppose that we are given a random pair (X, Y) taking its values in X × R where (X, ., . X) is a Hilbert space. Suppose also that we have n i.i.d. observations of (X, Y), (x1, y1), . . . , (xn, yn). Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39
  • 46. Remains on functional multilayer perceptron by projection approach Data: Suppose that we are given a random pair (X, Y) taking its values in X × R where (X, ., . X) is a Hilbert space. Suppose also that we have n i.i.d. observations of (X, Y), (x1, y1), . . . , (xn, yn). Functional MLP: The projection approach is based on the knowledge of a Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i and also the weights of the MLP are projected on this basis truncated at q: Cn q =    Ψ : X → R : ∀ x ∈ X, Ψ(x) = pn l=1 w (2) l G  w (0) l + q k=1 β (1) lk (Pq(x))k   pn l=1 |w (2) l | ≤ αn    where (pn)n is a sequence of integer, (αn)n is a sequence of positive real numbers, G is a given continuous functions and the weights (w (2) l )l, (w (0) l )l and (β (1) lk )l,k have to be learned from the data set in R (see Presentation 2 for further details). Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39
  • 47. Assumptions for consistency of functional MLP Note Ψp n = arg min Ψ∈Cn q LnΨ. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
  • 48. Assumptions for consistency of functional MLP Note Ψp n = arg min Ψ∈Cn q LnΨ. and suppose that: (A1) G : R → [0, 1] is monotone, non decreasing, with limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
  • 49. Assumptions for consistency of functional MLP Note Ψp n = arg min Ψ∈Cn q LnΨ. and suppose that: (A1) G : R → [0, 1] is monotone, non decreasing, with limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0; (A2) limn→+∞ pnαn log(pn log αn) n = 0 and ∃ δ > 0: limn→+∞ α2 n n1−δ = 0; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
  • 50. Assumptions for consistency of functional MLP Note Ψp n = arg min Ψ∈Cn q LnΨ. and suppose that: (A1) G : R → [0, 1] is monotone, non decreasing, with limt→+∞ G(t) = 1 and limt→−∞ G(t) = 0; (A2) limn→+∞ pnαn log(pn log αn) n = 0 and ∃ δ > 0: limn→+∞ α2 n n1−δ = 0; (A3) Y is squared integrable. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
  • 51. Strong consistency of projection based functional MPL Theorem [Rossi and Conan-Guez, 2006] Under assumptions (A1)-(A3), lim p→+∞ lim n→+∞ LΨp n = L∗ p.s. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
  • 52. Strong consistency of projection based functional MPL Theorem [Rossi and Conan-Guez, 2006] Under assumptions (A1)-(A3), lim p→+∞ lim n→+∞ LΨp n = L∗ p.s. Sketch of the proof: The proof is divided into two parts: 1 The fist one shows that L∗ p = inf Ψ∈Rp→+∞ E (Ψ(Pp(X)) − Y)2 | (xi, yi)i n→+∞ −−−−−−→ L∗ a.s. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
  • 53. Strong consistency of projection based functional MPL Theorem [Rossi and Conan-Guez, 2006] Under assumptions (A1)-(A3), lim p→+∞ lim n→+∞ LΨp n = L∗ p.s. Sketch of the proof: The proof is divided into two parts: 1 The fist one shows that L∗ p = inf Ψ∈Rp→+∞ E (Ψ(Pp(X)) − Y)2 | (xi, yi)i n→+∞ −−−−−−→ L∗ a.s. 2 The second one shows that, for any fixed p lim n→+∞ LΨp n = L∗ p. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
  • 54. Strong consistency of projection based functional MPL Theorem [Rossi and Conan-Guez, 2006] Under assumptions (A1)-(A3), lim p→+∞ lim n→+∞ LΨp n = L∗ p.s. Sketch of the proof: The proof is divided into two parts: 1 The fist one shows that L∗ p = inf Ψ∈Rp→+∞ E (Ψ(Pp(X)) − Y)2 | (xi, yi)i n→+∞ −−−−−−→ L∗ a.s. 2 The second one shows that, for any fixed p lim n→+∞ LΨp n = L∗ p. Remark: The limitation of this result is in the fact that it is a double limit and that no indication on the way n and p should be linked in given. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
  • 55. Strong consistency of projection based functional MPL Theorem [Rossi and Conan-Guez, 2006] Under assumptions (A1)-(A3), lim p→+∞ lim n→+∞ LΨp n = L∗ p.s. Sketch of the proof: The proof is divided into two parts: 1 The fist one shows that L∗ p = inf Ψ∈Rp→+∞ E (Ψ(Pp(X)) − Y)2 | (xi, yi)i n→+∞ −−−−−−→ L∗ a.s. 2 The second one shows that, for any fixed p lim n→+∞ LΨp n = L∗ p. Remark 2: The principle of the proof is very general and can be applied to any other consistent method in Rp . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
  • 56. Presentation of k-nearest neighbors for functional classification This method has been introduced in [Biau et al., 2005] for the binary classification case and it exists a regression version in the work of [Laloë, 2008]. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
  • 57. Presentation of k-nearest neighbors for functional classification This method has been introduced in [Biau et al., 2005] for the binary classification case and it exists a regression version in the work of [Laloë, 2008]. Context: We are given a random pair (X, Y) taking its values in X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn). Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
  • 58. Presentation of k-nearest neighbors for functional classification This method has been introduced in [Biau et al., 2005] for the binary classification case and it exists a regression version in the work of [Laloë, 2008]. Context: We are given a random pair (X, Y) taking its values in X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn). Functional k-nearest neighbors also consists in using the projection of the data on a Hilbert basis, (φj)j≥1: denote xd i = (xd i1 , . . . , xd id ) where ∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = xi, φj X. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
  • 59. Presentation of k-nearest neighbors for functional classification This method has been introduced in [Biau et al., 2005] for the binary classification case and it exists a regression version in the work of [Laloë, 2008]. Context: We are given a random pair (X, Y) taking its values in X × {−1, 1} where (X, ., . X) is a Hilbert space. Moreover, we are given n i.i.d. observations of (X, Y) that are denoted (x1, y1), . . . , (xn, yn). Functional k-nearest neighbors also consists in using the projection of the data on a Hilbert basis, (φj)j≥1: denote xd i = (xd i1 , . . . , xd id ) where ∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = xi, φj X. k-nearest neighbors for d-dimensional data is then performed on the dataset (xd 1 , y1), . . . , (xd n, yn): if for all u ∈ Rd , Vk (u) := {i ∈ [[1, n]] : xd i − u Rd belongs to the k smallest of these values}, Ψn : x ∈ X → −1 if i∈Vk (xd ) I{yi=−1} > i∈Vk (xd ) I{yi=1} +1 otherwise Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
  • 60. Selection of the dimension of projection and of the parameter k d and k are then automatically selected from the dataset by a validation strategy: 1 For all k ∈ N∗ and all d ∈ N∗, . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
  • 61. Selection of the dimension of projection and of the parameter k d and k are then automatically selected from the dataset by a validation strategy: 1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors classifier, Ψd,l,k n , from data {(xd i , yi)}i=1,...,l. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
  • 62. Selection of the dimension of projection and of the parameter k d and k are then automatically selected from the dataset by a validation strategy: 1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors classifier, Ψd,l,k n , from data {(xd i , yi)}i=1,...,l. 2 Choose (dn , kn ) = arg min k∈N∗, d∈N∗ 1 n − l n i=l+1 I Ψd,l,k n (xi) yi + λd √ n − l where λd is a penalization term to avoid the selection of (possibly overfitting) very large dimensions. . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
  • 63. Selection of the dimension of projection and of the parameter k d and k are then automatically selected from the dataset by a validation strategy: 1 For all k ∈ N∗ and all d ∈ N∗, compute the k-nearest neighbors classifier, Ψd,l,k n , from data {(xd i , yi)}i=1,...,l. 2 Choose (dn , kn ) = arg min k∈N∗, d∈N∗ 1 n − l n i=l+1 I Ψd,l,k n (xi) yi + λd √ n − l where λd is a penalization term to avoid the selection of (possibly overfitting) very large dimensions. Then, define Ψn = Ψdn ,l,kn n . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
  • 64. An oracle inequality Oracle inequality [Biau et al., 2005] Note ∆ = +∞ d=1 e−2λ2 d < +∞. Then, it exists C > 0, only depending on ∆, such that ∀ l > 1/∆, E (LΨn) − L∗ ≤ inf d≥1 (L∗ d − L∗ ) + inf 1≤k≤l E LΨl,k,d n − L∗ d + λd √ n − l +C log l n − l Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39
  • 65. An oracle inequality Oracle inequality [Biau et al., 2005] Note ∆ = +∞ d=1 e−2λ2 d < +∞. Then, it exists C > 0, only depending on ∆, such that ∀ l > 1/∆, E (LΨn) − L∗ ≤ inf d≥1 (L∗ d − L∗ ) + inf 1≤k≤l E LΨl,k,d n − L∗ d + λd √ n − l +C log l n − l Then, we have: by a martingale property: limd→+∞ L∗ d = L∗, by consistency of k-nearest neighbors in Rd : for all d ≥ 1, inf1≤k≤l E LΨl,k,d n − L∗ d l→+∞ −−−−−→ 0, the rest of the right hand side of the inequality can be set to converge to 0 when n grows to infinity, for suitable choices of n, l and λd. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39
  • 66. Consistency of functional k-nearest neighbors Theorem [Biau et al., 2005] Suppose that lim n→+∞ l = +∞ lim n→+∞ (n − l) = +∞ lim n→+∞ log l n − l = 0 then lim n→+∞ E (LΨn) = L∗ . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 20 / 39
  • 67. Table of contents 1 Basics in statistical learning theory 2 Examples of consistent methods for FDA 3 SVM 4 References Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 21 / 39
  • 68. A binary classification problem Suppose that we are given a random pair of variables (X, Y) where X takes its values in Rd and that Y takes its values in {−1, 1}. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
  • 69. A binary classification problem Suppose that we are given a random pair of variables (X, Y) where X takes its values in Rd and that Y takes its values in {−1, 1}. Moreover, we know n i.i.d. realizations of the random pair (X, Y) that we denote by (x1, y1), . . . , (xn, yn). Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
  • 70. A binary classification problem Suppose that we are given a random pair of variables (X, Y) where X takes its values in Rd and that Y takes its values in {−1, 1}. Moreover, we know n i.i.d. realizations of the random pair (X, Y) that we denote by (x1, y1), . . . , (xn, yn). We try to learn a classification machine, Ψn, of the form x → Sign ( x, w Rd + b), or, more precisely, of the form x → Sign ( φ(x), w X + b) where the exact nature of φ and X will be discussed later. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
  • 71. Linear discrimination with optimal margin Learn Ψn : x → Sign ( x, w Rd + b) Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
  • 72. Linear discrimination with optimal margin Learn Ψn : x → Sign ( x, w Rd + b) Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
  • 73. Linear discrimination with optimal margin Learn Ψn : x → Sign ( x, w Rd + b) w margin: 1 w 2 Rd Support Vector Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
  • 74. Linear discrimination with optimal margin Learn Ψn : x → Sign ( x, w Rd + b) w margin: 1 w 2 Rd Support Vector w is such that: minw,b w Rd , such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
  • 75. Linear discrimination with soft margin Learn Ψn : x → Sign ( x, w Rd + b) Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
  • 76. Linear discrimination with soft margin Learn Ψn : x → Sign ( x, w Rd + b) Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
  • 77. Linear discrimination with soft margin Learn Ψn : x → Sign ( x, w Rd + b) w margin: 1 w 2 Rd Support Vector Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
  • 78. Linear discrimination with soft margin Learn Ψn : x → Sign ( x, w Rd + b) w margin: 1 w 2 Rd Support Vector w is such that: minw,b,ξ w Rd + C n i=1 ξi, where: yi(wT xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n, ξi ≥ 0, 1 ≤ i ≤ n. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
  • 79. Mapping the data onto a high dimensional space Learn Ψn : x → Sign ( φ(x), w X + b) Original space Rd Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
  • 80. Mapping the data onto a high dimensional space Learn Ψn : x → Sign ( φ(x), w X + b) Original space Rd Feature space X φ (nonlinear) Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
  • 81. Mapping the data onto a high dimensional space Learn Ψn : x → Sign ( φ(x), w X + b) Original space Rd Feature space X φ (nonlinear) Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
  • 82. Mapping the data onto a high dimensional space Learn Ψn : x → Sign ( φ(x), w X + b) Original space Rd Feature space X φ (nonlinear) w is such that: (PC,X) minw,b,ξ w X + C n i=1 ξi, where: yi( w, φ(xi) X + b) ≥ 1 − ξi, 1 ≤ i ≤ n, ξi ≥ 0, 1 ≤ i ≤ n. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
  • 83. Details about the feature space: a regularization framework Regularization framework: (PC,X) ⇔ (Rλ,X) min F∈X 1 n n i=1 max(0, 1 − yiF(xi)) + λ F X . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
  • 84. Details about the feature space: a regularization framework Regularization framework: (PC,X) ⇔ (Rλ,X) min F∈X 1 n n i=1 max(0, 1 − yiF(xi)) + λ F X . Dual problem: (PC,X) ⇔ (DC,X) maxα n i=1 αi − n i=1 n j=1 αiαjyiyj φ(xi), φ(xj) X where N i=1 αiyi = 0, 0 ≤ αi ≤ C, 1 ≤ i ≤ n. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
  • 85. Details about the feature space: a regularization framework Regularization framework: (PC,X) ⇔ (Rλ,X) min F∈X 1 n n i=1 max(0, 1 − yiF(xi)) + λ F X . Dual problem: (PC,X) ⇔ (DC,X) maxα n i=1 αi − n i=1 n j=1 αiαjyiyj φ(xi), φ(xj) X where N i=1 αiyi = 0, 0 ≤ αi ≤ C, 1 ≤ i ≤ n. Inner product in X: ∀ u, v ∈ X, K(u, v) = φ(u), φ(v) X Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
  • 86. Example of usefull kernels Provided that ∀ m ∈ N∗ , (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R, m i,j=1 αiαjK(ui, uj) ≥ 0 K can be used as a kernel mapping the original data onto a high dimensional feature space: [Aronszajn, 1950]. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
  • 87. Example of usefull kernels Provided that ∀ m ∈ N∗ , (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R, m i,j=1 αiαjK(ui, uj) ≥ 0 K can be used as a kernel mapping the original data onto a high dimensional feature space: [Aronszajn, 1950]. The Gaussian kernel: K(u, v) = e −σ2 u−v 2 Rd for σ > 0; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
  • 88. Example of usefull kernels Provided that ∀ m ∈ N∗ , (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R, m i,j=1 αiαjK(ui, uj) ≥ 0 K can be used as a kernel mapping the original data onto a high dimensional feature space: [Aronszajn, 1950]. The Gaussian kernel: K(u, v) = e −σ2 u−v 2 Rd for σ > 0; The exponential kernel: K(u, v) = e u,v R ; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
  • 89. Example of usefull kernels Provided that ∀ m ∈ N∗ , (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R, m i,j=1 αiαjK(ui, uj) ≥ 0 K can be used as a kernel mapping the original data onto a high dimensional feature space: [Aronszajn, 1950]. The Gaussian kernel: K(u, v) = e −σ2 u−v 2 Rd for σ > 0; The exponential kernel: K(u, v) = e u,v R ; Vovk’s real infinite polynomial: K(u, v) = (1 − u, v Rd )−α for α > 0; . . . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
  • 90. Assumptions for consistency of SVM in Rd Suppose X takes its values in a compact subset of Rd , W; . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
  • 91. Assumptions for consistency of SVM in Rd Suppose X takes its values in a compact subset of Rd , W; the kernel K is universal on W (i.e., the set of all functions {u ∈ W → w, φ(u) X, w ∈ X} is dense in C0 (W); . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
  • 92. Assumptions for consistency of SVM in Rd Suppose X takes its values in a compact subset of Rd , W; the kernel K is universal on W (i.e., the set of all functions {u ∈ W → w, φ(u) X, w ∈ X} is dense in C0 (W); ∀ > 0, the -covering number of φ(W), that is, the minimum number of balls of radius that are needed to cover φ(W), is such that: N(K, ) = O ( −α) for α > 0; . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
  • 93. Assumptions for consistency of SVM in Rd Suppose X takes its values in a compact subset of Rd , W; the kernel K is universal on W (i.e., the set of all functions {u ∈ W → w, φ(u) X, w ∈ X} is dense in C0 (W); ∀ > 0, the -covering number of φ(W), that is, the minimum number of balls of radius that are needed to cover φ(W), is such that: N(K, ) = O ( −α) for α > 0; the regularization parameter, C, depends on n by: limn→+∞ nCn = +∞, Cn = O nβ−1 for 0 < β < 1/α. . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
  • 94. Assumptions for consistency of SVM in Rd Suppose X takes its values in a compact subset of Rd , W; the kernel K is universal on W (i.e., the set of all functions {u ∈ W → w, φ(u) X, w ∈ X} is dense in C0 (W); ∀ > 0, the -covering number of φ(W), that is, the minimum number of balls of radius that are needed to cover φ(W), is such that: N(K, ) = O ( −α) for α > 0; the regularization parameter, C, depends on n by: limn→+∞ nCn = +∞, Cn = O nβ−1 for 0 < β < 1/α. Remark: The Gaussian kernel satisfies all these assumptions with N(K, ) = O n−d . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
  • 95. Consistency of SVM in Rd Theorem [Steinwart, 2002] Under assumptions (A1)-(A4), SVM are consistent. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 29 / 39
  • 96. Why SVM can’t be directly applied to functional data? Suppose now that X takes its values in a Hilbert space (X, ., . X). Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
  • 97. Why SVM can’t be directly applied to functional data? Suppose now that X takes its values in a Hilbert space (X, ., . X). 1 We already talk about the advantages of regularization or projection of the functional data as a pre-processing; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
  • 98. Why SVM can’t be directly applied to functional data? Suppose now that X takes its values in a Hilbert space (X, ., . X). 1 We already talk about the advantages of regularization or projection of the functional data as a pre-processing; 2 The consistency result can’t be directly applied with infinite dimensional data because the condition of covering number for infinite dimensional Gaussian kernel is not valid. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
  • 99. A consistent approach based on the ideas of [Biau et al., 2005] 1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
  • 100. A consistent approach based on the ideas of [Biau et al., 2005] 1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ; 2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd] Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
  • 101. A consistent approach based on the ideas of [Biau et al., 2005] 1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ; 2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd] Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and B2 = (xl+1, yl+1), . . . , (xn, yn); Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
  • 102. A consistent approach based on the ideas of [Biau et al., 2005] 1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ; 2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd] Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and B2 = (xl+1, yl+1), . . . , (xn, yn); Learn a SVM on B1: Ψl,a n ; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
  • 103. A consistent approach based on the ideas of [Biau et al., 2005] 1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ; 2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd] Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and B2 = (xl+1, yl+1), . . . , (xn, yn); Learn a SVM on B1: Ψl,a n ; Validation on B2: a∗ = arg min a Ln−lΨl,a n + λd √ n − l with Ln−lΨl,a n = 1 n−l n i=l+1 I Ψl,a n (xi ) yi . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
  • 104. A consistent approach based on the ideas of [Biau et al., 2005] 1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ; 2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd, C ∈ [0; Cd] Splitting the data : B1 = (x1, y1), . . . , (xl, yl) and B2 = (xl+1, yl+1), . . . , (xn, yn); Learn a SVM on B1: Ψl,a n ; Validation on B2: a∗ = arg min a Ln−lΨl,a n + λd √ n − l with Ln−lΨl,a n = 1 n−l n i=l+1 I Ψl,a n (xi ) yi . ⇒ The obtained classifier is denoted Ψn. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
  • 105. Assumptions Assumptions on X (A1) X takes its values in a bounded subset of X. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
  • 106. Assumptions Assumptions on X (A1) X takes its values in a bounded subset of X. Assumptions on the parameters: ∀ d ≥ 1, (A2) Jd is a finite set; (A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and ∃νd > 0 : N(Kd, ) = O ( −νd ); (A4) Cd > 1; (A5) d≥1 |Jd|e−2λ2 d < +∞. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
  • 107. Assumptions Assumptions on X (A1) X takes its values in a bounded subset of X. Assumptions on the parameters: ∀ d ≥ 1, (A2) Jd is a finite set; (A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and ∃νd > 0 : N(Kd, ) = O ( −νd ); (A4) Cd > 1; (A5) d≥1 |Jd|e−2λ2 d < +∞. Assumptions on training/validation sets (A6) limn→+∞ l = +∞; (A7) limn→+∞ n − l = +∞; (A8) limn→+∞ l log(n−l) n−l = 0. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
  • 108. Consistency Theorem [Rossi and Villa, 2006] Under assumptions (A1)-(A8), Ψn is consistent: E (LΨn) n→+∞ −−−−−−→ L∗ . Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39
  • 109. Consistency Theorem [Rossi and Villa, 2006] Under assumptions (A1)-(A8), Ψn is consistent: E (LΨn) n→+∞ −−−−−−→ L∗ . Ideas of the proof: The proof is based on a similar sketch as in the work of [Biau et al., 2005] but the result allows the use of a continuous parameter (the regularization parameter C), based on the shatter coefficient of a class of functions that includes SVM. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39
  • 110. Application 1: Voice recognition Description of the data and methods 3 problems and for each problem, 100 records sampled at 82 192 points; Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
  • 111. Application 1: Voice recognition Description of the data and methods 3 problems and for each problem, 100 records sampled at 82 192 points; consistent approach: Projection on a trigonometric basis; Splitting the data base into 50 curves (training) / 49 (validation); Performances calculated by leave-one-out. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
  • 112. Application 1: Voice recognition Description of the data and methods 3 problems and for each problem, 100 records sampled at 82 192 points; consistent approach: Projection on a trigonometric basis; Splitting the data base into 50 curves (training) / 49 (validation); Performances calculated by leave-one-out. Results Prob. k-nn QDA SVM gau. SVM lin. SVM lin. (proj) (proj) (direct) yes/no 10% 7% 10% 19% 58% boat/goat 21% 35% 8% 29% 46% sh/ao 16% 19% 12% 25% 47% Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
  • 113. Regression by SVM Suppose that we are given a random pair of variables (X, Y) where X takes its values in Rd and that Y takes its values in R. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
  • 114. Regression by SVM Suppose that we are given a random pair of variables (X, Y) where X takes its values in Rd and that Y takes its values in R. Moreover, we know n i.i.d. realizations of the random pair (X, Y) that we denote by (x1, y1), . . . , (xn, yn). Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
  • 115. Regression by SVM Suppose that we are given a random pair of variables (X, Y) where X takes its values in Rd and that Y takes its values in R. Moreover, we know n i.i.d. realizations of the random pair (X, Y) that we denote by (x1, y1), . . . , (xn, yn). Once again, we try to learn a regression machine, Ψn, of the form x → φ(x), w X + b where the exact nature of φ and X will be discussed later. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
  • 116. Generalization of the classification case to regression w and b minimize C w 2 X + n i=1 Lk (xi, yi, w) where Lk , for k = 1, 2 and ≥ 0 is the -sensitive loss function: Lk (xi, yi, w) = max 0, |yi − φ(xi), w X|k − . or any other loss function. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39
  • 117. Generalization of the classification case to regression w and b minimize C w 2 X + n i=1 Lk (xi, yi, w) where Lk , for k = 1, 2 and ≥ 0 is the -sensitive loss function: Lk (xi, yi, w) = max 0, |yi − φ(xi), w X|k − . or any other loss function. Remark: A dual version, which is a quadratic optimization problem in Rn , also exists. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39
  • 118. A kernel ridge regression When is equal to 0 and k = 2, the previous problem becomes: Find w and b that minimize Υ w 2 X + n i=1 (y − φ(xi), w X)2 which can be viewed as a kernel ridge regression. This method is also known under the name of Least-Square SVM or LS-SVM. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39
  • 119. A kernel ridge regression When is equal to 0 and k = 2, the previous problem becomes: Find w and b that minimize Υ w 2 X + n i=1 (y − φ(xi), w X)2 which can be viewed as a kernel ridge regression. This method is also known under the name of Least-Square SVM or LS-SVM. A multidimensional consistency result is available in [Christmann and Steinwart, 2007]: the same method as for SVM classifiers can then be used for the regression case ! Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39
  • 120. Table of contents 1 Basics in statistical learning theory 2 Examples of consistent methods for FDA 3 SVM 4 References Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 38 / 39
  • 121. References Further details for the references are given in the joint document. Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 39 / 39