SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Lecture 4: kernels and associated functions
Stéphane Canu
stephane.canu@litislab.eu
Sao Paulo 2014
March 4, 2014
Plan
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional differentiation in RKHS
Introducing non linearities through the feature map
SVM Val
f (x) =
d
j=1
xj wj + b =
n
i=1
αi (xi x) + b
t1
t2
∈ IR2
x1
x2
x3
x4
x5
linear in x ∈ IR5
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37
Introducing non linearities through the feature map
SVM Val
f (x) =
d
j=1
xj wj + b =
n
i=1
αi (xi x) + b
t1
t2
∈ IR2
φ(t) =
t1 x1
t2
1 x2
t2 x3
t2
2 x4
t1t2 x5
linear in x ∈ IR5
quadratic in t ∈ IR2
The feature map
φ : IR2
−→ IR5
t −→ φ(t) = x
xi x = φ(ti ) φ(t)
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37
Introducing non linearities through the feature map
A. Lorena & A. de Carvalho, Uma Introducão às Support Vector Machines, 2007
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 4 / 37
Non linear case: dictionnary vs. kernel
in the non linear case: use a dictionary of functions
φj (x), j = 1, p with possibly p = ∞
for instance polynomials, wavelets...
f (x) =
p
j=1
wj φj (x) with wj =
n
i=1
αi yi φj (xi )
so that
f (x) =
n
i=1
αi yi
p
j=1
φj (xi )φj (x)
k(xi ,x)
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37
Non linear case: dictionnary vs. kernel
in the non linear case: use a dictionary of functions
φj (x), j = 1, p with possibly p = ∞
for instance polynomials, wavelets...
f (x) =
p
j=1
wj φj (x) with wj =
n
i=1
αi yi φj (xi )
so that
f (x) =
n
i=1
αi yi
p
j=1
φj (xi )φj (x)
k(xi ,x)
p ≥ n so what since k(xi , x) =
p
j=1 φj (xi )φj (x)
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37
closed form kernel: the quadratic kernel
The quadratic dictionary in IRd
:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
closed form kernel: the quadratic kernel
The quadratic dictionary in IRd
:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...
The quadratic kenrel:
s, t ∈ IRd
, k(s, t) = s t + 1
2
= 1 + 2s t + s t
2
computes the dot product of the reweighted dictionary:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1,
√
2s2, ...,
√
2sd , s2
1 , s2
2 , ..., s2
d , ...,
√
2si sj , ...
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
closed form kernel: the quadratic kernel
The quadratic dictionary in IRd
:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...
The quadratic kenrel:
s, t ∈ IRd
, k(s, t) = s t + 1
2
= 1 + 2s t + s t
2
computes the dot product of the reweighted dictionary:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1,
√
2s2, ...,
√
2sd , s2
1 , s2
2 , ..., s2
d , ...,
√
2si sj , ...
p = 1 + d + d(d+1)
2 multiplications vs. d + 1
use kernel to save computration
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
kernel: features throught pairwise comparizons
x φ(x)
e.g. a text e.g. BOW
K
n examples
nexamples
Φ
p features
nexamples
k(xi , xj ) =
p
j=1
φj (xi )φj (xj )
K The matrix of pairwise comparizons (O(n2
))
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 7 / 37
Kenrel machine
kernel as a dictionary
f (x) =
n
i=1
αi k(x, xi )
αi influence of example i depends on yi
k(x, xi ) the kernel do NOT depend on yi
Definition (Kernel)
Let X be a non empty set (the input space).
A kernel is a function k from X × X onto IR.
k : X × X −→ IR
s, t −→ k(s, t)
Kenrel machine
kernel as a dictionary
f (x) =
n
i=1
αi k(x, xi )
αi influence of example i depends on yi
k(x, xi ) the kernel do NOT depend on yi
Definition (Kernel)
Let X be a non empty set (the input space).
A kernel is a function k from X × X onto IR.
k : X × X −→ IR
s, t −→ k(s, t)
semi-parametric version: given the family qj (x), j = 1, p
f (x) =
n
i=1
αi k(x, xi )+
p
j=1
βj qj (x)
Kernel Machine
Definition (Kernel machines)
A (xi , yi )i=1,n (x) = ψ
n
i=1
αi k(x, xi ) +
p
j=1
βj qj (x)
α et β: parameters to be estimated.
Exemples
A(x) =
n
i=1
αi (x − xi )3
+ + β0 + β1x splines
A(x) = sign
i∈I
αi exp−
x−xi
2
b +β0 SVM
IP(y|x) = 1
Z exp
i∈I
αi 1I{y=yi }(x xi + b)2
exponential family
Plan
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional differentiation in RKHS
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 10 / 37
In the beginning was the kernel...
Definition (Kernel)
a function of two variable k from X × X to IR
Definition (Positive kernel)
A kernel k(s, t) on X is said to be positive
if it is symetric: k(s, t) = k(t, s)
an if for any finite positive interger n:
∀{αi }i=1,n ∈ IR, ∀{xi }i=1,n ∈ X,
n
i=1
n
j=1
αi αj k(xi , xj ) ≥ 0
it is strictly positive if for αi = 0
n
i=1
n
j=1
αi αj k(xi , xj ) > 0
Examples of positive kernels
the linear kernel: s, t ∈ IRd
, k(s, t) = s t
symetric: s t = t s
positive:
n
i=1
n
j=1
αi αj k(xi , xj ) =
n
i=1
n
j=1
αi αj xi xj
=
n
i=1
αi xi


n
j=1
αj xj

 =
n
i=1
αi xi
2
the product kernel: k(s, t) = g(s)g(t) for some g : IRd
→ IR,
symetric by construction
positive:
n
i=1
n
j=1
αi αj k(xi , xj ) =
n
i=1
n
j=1
αi αj g(xi )g(xj )
=
n
i=1
αi g(xi )


n
j=1
αj g(xj )

 =
n
i=1
αi g(xi )
2
k is positive ⇔ (its square root exists) ⇔ k(s, t) = φs, φt
J.P. Vert, 2006
Example: finite kernel
let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials,
wavelets...)
the feature map and linear kernel
feature map:
Φ : X → IRp
s → Φ = φ1(s), ..., φp(s)
Linear kernel in the feature space:
k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t)
e.g. the quadratic kernel: s, t ∈ IRd
, k(s, t) = s t + b
2
feature map:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1, ...,
√
2sj , ...,
√
2sd , s2
1 , ..., s2
j , ..., s2
d , ...,
√
2si sj , ...
Positive definite Kernel (PDK) algebra (closure)
if k1(s, t) and k2(s, t) are two positive kernels
DPK are a convex cone: ∀a1 ∈ IR+
a1k1(s, t) + k2(s, t)
product kernel k1(s, t)k2(s, t)
proofs
by linearity:
n
i=1
n
j=1
αi αj a1k1(i, j) + k2(i, j) = a1
n
i=1
n
j=1
αi αj k1(i, j) +
n
i=1
n
j=1
αi αj k2(i, j)
assuming ∃ψ s.t. k1(s, t) = ψ (s)ψ (t)
n
i=1
n
j=1
αi αj k1(xi , xj )k2(xi , xj ) =
n
i=1
n
j=1
αi αj ψ (xi )ψ (xj )k2(xi , xj )
=
n
i=1
n
j=1
αi ψ (xi ) αj ψ (xj ) k2(xi , xj )
N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004
Kernel engineering: building PDK
for any polynomial with positive coef. φ from IR to IR
φ k(s, t)
if Ψis a function from IRd
to IRd
k Ψ(s), Ψ(t)
if ϕ from IRd
to IR+
, is minimum in 0
k(s, t) = ϕ(s + t) − ϕ(s − t)
convolution of two positive kernels is a positive kernel
K1 K2
Example : the Gaussian kernel is a PDK
exp(− s − t 2
) = exp(− s 2
− t 2
+ 2s t)
= exp(− s 2
) exp(− t 2
) exp(2s t)
s t is a PDK and function exp as the limit of positive series expansion, so
exp(2s t) is a PDK
exp(− s 2
) exp(− t 2
) is a PDK as a product kernel
the product of two PDK is a PDK
an attempt at classifying PD kernels
stationary kernels, (also called translation invariant):
k(s, t) = ks(s − t)
radial (isotropic) gaussian: exp −r2
b
, r = s − t
with compact support
c.s. Matèrn : max 0, 1 − r
b
κ r
b
k
Bk
r
b
, κ ≥ (d + 1)/2
locally stationary kernels: k(s, t) = k1(s + t)ks(s − t)
K1 is a non negative function and K2 a radial kernel.
non stationary (projective kernels):
k(s, t) = kp(s t)
separable kernels k(s, t) = k1(s)k2(t) with k1 and k2(t) PDK
in this case K = k1k2 where k1 = (k1(x1), ..., k1(xn))
MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002
some examples of PD kernels...
type name k(s, t)
radial gaussian exp −r2
b , r = s − t
radial laplacian exp(−r/b)
radial rationnal 1 − r2
r2+b
radial loc. gauss. max 0, 1 − r
3b
d
exp(−r2
b )
non stat. χ2 exp(−r/b), r = k
(sk −tk )2
sk +tk
projective polynomial (s t)p
projective affine (s t + b)p
projective cosine s t/ s t
projective correlation exp s t
s t − b
Most of the kernels depends on a quantity b called the bandwidth
the importance of the Kernel bandwidth
for the affine Kernel: Bandwidth = biais
k(s, t) = (s t + b)p
= bp s t
b
+ 1
p
for the gaussian Kernel: Bandwidth = influence zone
k(s, t) =
1
Z
exp −
s − t 2
2σ2
b = 2σ2
the importance of the Kernel bandwidth
for the affine Kernel: Bandwidth = biais
k(s, t) = (s t + b)p
= bp s t
b
+ 1
p
for the gaussian Kernel: Bandwidth = influence zone
k(s, t) =
1
Z
exp −
s − t 2
2σ2
b = 2σ2
Illustration
1 d density estimation b = 1
2 b = 2
+ data
(x1, x2, ..., xn)
– Parzen estimate
IP(x) = 1
Z
n
i=1
k(x, xi )
kernels for objects and structures
kernels on histograms and probability distributions
kernel on strings
spectral string kernel k(s, t) = u φu(s)φu(t)
using sub sequences
similarities by alignements k(s, t) = π exp(β(s, t, π))
kernels on graphs
the pseudo inverse of the (regularized) graph Laplacian
L = D − A A is the adjency matrixD the degree matrix
diffusion kernels 1
Z(b)
expbL
subgraph kernel convolution (using random walks)
and kernels on HMM, automata, dynamical system...
Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006
Multiple kernel
M. Cuturi, Positive Definite Kernels in Machine Learning, 2009
Gram matrix
Definition (Gram matrix)
let k(s, t) be a positive kernel on X and (xi )i=1,n a sequence on X. the
Gram matrix is the square K of dimension n and of general term
Kij = k(xi , xj ).
practical trick to check kernel positivity:
K is positive ⇔ λi > 0 its eigenvalues are posivies: if Kui = λi ui ; i = 1, n
ui Kui = λi ui ui = λi
matrix K is the one to be used
Examples of Gram matrices with different bandwidth
raw data Gram matrix for b = 2
b = .5 b = 10
different point of view about kernels
kernel and scalar product
k(s, t) = φ(s), φ(t) H
kernel and distance
d(s, t)2
= k(s, s) + k(t, t) − 2k(s, t)
kernel and covariance: a positive matrix is a covariance matrix
IP(f) =
1
Z
exp −
1
2
(f − f0) K−1
(f − f0)
if f0 = 0 and f = Kα, IP(α) = 1
Z
exp −1
2
α Kα
Kernel and regularity (green’s function)
k(s, t) = P∗
Pδs−t for some operator P (e.g. some differential)
Let’s summarize
positive kernels
there is a lot of them
can be rather complex
2 classes: radial / projective
the bandwith matters (more than the kernel itself)
the Gram matrix summarize the pairwise comparizons
Roadmap
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional differentiation in RKHS
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 25 / 37
From kernel to functions
H0 =



f mf < ∞; fj ∈ IR; tj ∈ X, f (x) =
mf
j=1
fj k(x, tj )



let define the bilinear form (g(x) =
mg
i=1 gi k(x, si )) :
∀f , g ∈ H0, f , g H0
=
mf
j=1
mg
i=1
fj gi k(tj , si )
Evaluation functional: ∀x ∈ X
f (x) = f (•), k(x, •) H0
from k to H
for any positive kernel, a hypothesis set can be constructed H = H0 with
its metric
RKHS
Definition (reproducing kernel Hibert space (RKHS))
a Hilbert space H embeded with the inner product •, • H is said to be
with reproducing kernel if it exists a positive kernel k such that
∀s ∈ X, k(•, s) ∈ H
∀f ∈ H, f (s) = f (•), k(s, •) H
Beware: f = f (•) is a function while f (s) is the real value of f at point s
positive kernel ⇔ RKHS
any function in H is pointwise defined
defines the inner product
it defines the regularity (smoothness) of the hypothesis set
Exercice: let f (•) =
n
i=1 αi k(•, xi ). Show that f 2
H = α Kα
Other kernels (what really matters)
finite kernels
k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t)
Mercer kernels
positive on a compact set ⇔ k(s, t) =
p
j=1 λj φj (s)φj (t)
positive kernels
positive semi-definite
conditionnaly positive (for some functions pj )
∀{xi }i=1,n, ∀αi ,
n
i
αi pj (xi ) = 0; j = 1, p,
n
i=1
n
j=1
αi αj k(xi , xj ) ≥ 0
symetric non positive
k(s, t) = tanh(s t + α0)
non symetric – non positive
the key property: Jt (f ) = k(t, .) holds
C. Ong et al, ICML , 2004
The kernel map
observation: x = (x1, . . . , xj , . . . , xd )
f (x) = w x = w, x IRd
feature map: x −→ Φ(x) = (φ1(x), . . . , φj (x), . . . , φp(x))
Φ : IRd
−→ IRp
f (x) = w Φ(x) = w, Φ(x) IRp
kernel dictionary: x −→ k(x) = (k(x, x1), . . . , k(x, xi ), . . . , k(x, xn))
k : IRd
−→ IRn
f (x) =
n
i=1
αi k(x, xi ) = α, k(x) IRn
kernel map: x −→ k(•, x) p = ∞
f (x) = f (•), K(•, x) H
Roadmap
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional differentiation in RKHS
Functional differentiation in RKHS
Let J be a functional
J : H → IR
f → J(f )
examples: J1(f ) = f 2
H, J2(f ) = f (x),
J directional derivative in direction g at point f
dJ(f , g) =
lim
ε → 0
J(f + εg) − J(f )
ε
Gradient J(f )
J : H → H
f → J (f )
if dJ(f , g) = J (f ), g H
exercise: find out J1
(f ) et J2
(f )
Hint
dJ(f , g) =
dJ(f + εg)
dε ε=0
Solution
dJ1(f , g) =
lim
ε → 0
f +εg 2
− f 2
ε
=
lim
ε → 0
f 2
+ε2
g 2
+2ε f ,g H− f 2
ε
=
lim
ε → 0
ε g 2
+ 2 f , g H
= 2f , g H
⇔ J1 (f ) = 2f
dJ2(f , g) =
lim
ε → 0
f (x)+εg(x)−f (x)
ε
= g(x)
= k(x, .), g H
⇔ J2
(f ) = k(x, .)
Solution
dJ1(f , g) =
lim
ε → 0
f +εg 2
− f 2
ε
=
lim
ε → 0
f 2
+ε2
g 2
+2ε f ,g H− f 2
ε
=
lim
ε → 0
ε g 2
+ 2 f , g H
= 2f , g H
⇔ J1 (f ) = 2f
dJ2(f , g) =
lim
ε → 0
f (x)+εg(x)−f (x)
ε
= g(x)
= k(x, .), g H
⇔ J2
(f ) = k(x, .)
Minimize
f ∈H
J(f ) ⇔ ∀g ∈ H, dJ(f , g) = 0 ⇔ J(f ) = 0
Subdifferential in a RKHS H
Definition (Sub gradient)
a subgradient of J : H −→ IR at f0 is any function g ∈ H such that
∀f ∈ V(f0), J(f ) ≥ J(f0) + g, (f − f0) H
Definition (Subdifferential)
∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f .
H = IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1}
H = IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0}
Theorem (Chain rule for linear Subdifferential)
Let T be a linear operator H −→ IR and ϕ a function from IR to IR
If J(f ) = ϕ(Tf )
Then ∂J(f ) = {T∗
g | g ∈ ∂ϕ(Tf )}, where T∗
denotes T’s adjoint operator
example of subdifferential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α
build the adjoint Tf , α IRn = f , T∗
α H
example of subdifferential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
build the adjoint Tf , α IRn = f , T∗
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H
example of subdifferential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
build the adjoint Tf , α IRn = f , T∗
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H
TT∗
: IRn
−→ IRn
α −→ TT∗
α =
n
j=1
αj k(xj , xi )
= Kα
example of subdifferential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
build the adjoint Tf , α IRn = f , T∗
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H
TT∗
: IRn
−→ IRn
α −→ TT∗
α =
n
j=1
αj k(xj , xi )
= Kα
Example of subdifferentials
x given J5(f ) = |f (x)| ∂J5(f0) = g(•) = αk(•, x) ; −1 < α < 1
x given J6(f ) = max(0, 1 − f (x)) ∂J6(f1) = g(•) = αk(•, x) ; −1 < α < 0
Optimal conditions
Theorem (Fermat optimality criterion)
When J(f ) is convex, f is a stationary point of problem min
f ∈H
J(f )
If and only if 0 ∈ ∂J(f )
ff f
∂J(f )
exercice: find for a given y ∈ IR (from Obozinski)
min
x∈IR
1
2 (x − y)2
+ λ|x|
Let’s summarize
positive kernels ⇔ RKHS = H ⇔ regularity f 2
H
the key property: Jt (f ) = k(t, .) holds not only for positive kernels
f (xi ) exists (pointwise defined functions)
universal consistency in RKHS
the Gram matrix summarize the pairwise comparizons

Contenu connexe

Tendances

Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Frank Nielsen
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Frank Nielsen
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Leo Asselborn
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas EberleBigMC
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Yandex
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesisValentin De Bortoli
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 
Low-rank tensor approximation (Introduction)
Low-rank tensor approximation (Introduction)Low-rank tensor approximation (Introduction)
Low-rank tensor approximation (Introduction)Alexander Litvinenko
 
overlap add and overlap save method
overlap add and overlap save methodoverlap add and overlap save method
overlap add and overlap save methodsivakumars90
 
computational stochastic phase-field
computational stochastic phase-fieldcomputational stochastic phase-field
computational stochastic phase-fieldcerniagigante
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksValentin De Bortoli
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsGabriel Peyré
 
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...PadmaGadiyar
 

Tendances (20)

Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
 
Rdnd2008
Rdnd2008Rdnd2008
Rdnd2008
 
Adc
AdcAdc
Adc
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Gtti 10032021
Gtti 10032021Gtti 10032021
Gtti 10032021
 
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesis
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 
Low-rank tensor approximation (Introduction)
Low-rank tensor approximation (Introduction)Low-rank tensor approximation (Introduction)
Low-rank tensor approximation (Introduction)
 
overlap add and overlap save method
overlap add and overlap save methodoverlap add and overlap save method
overlap add and overlap save method
 
computational stochastic phase-field
computational stochastic phase-fieldcomputational stochastic phase-field
computational stochastic phase-field
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
 
03 image transform
03 image transform03 image transform
03 image transform
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse Problems
 
Md2521102111
Md2521102111Md2521102111
Md2521102111
 
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
 

En vedette

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
Lecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualLecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualStéphane Canu
 
Research Methods for Computational Statistics
Research Methods for Computational StatisticsResearch Methods for Computational Statistics
Research Methods for Computational StatisticsSetia Pramana
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecastingDevon Barrow
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionGianluca Bontempi
 

En vedette (8)

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Lecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualLecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dual
 
Lecture5 kernel svm
Lecture5 kernel svmLecture5 kernel svm
Lecture5 kernel svm
 
Research Methods for Computational Statistics
Research Methods for Computational StatisticsResearch Methods for Computational Statistics
Research Methods for Computational Statistics
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecasting
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 

Similaire à Lecture4 kenrels functions_rkhs

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Frank Nielsen
 
Crib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC examsCrib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC examsA Jorge Garcia
 
On the smallest enclosing information disk
 On the smallest enclosing information disk On the smallest enclosing information disk
On the smallest enclosing information diskFrank Nielsen
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics Alexander Litvinenko
 
On Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsOn Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsVjekoslavKovac1
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Alexander Litvinenko
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and transDr Fereidoun Dejahang
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixturesChristian Robert
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017Fred J. Hickernell
 
University of manchester mathematical formula tables
University of manchester mathematical formula tablesUniversity of manchester mathematical formula tables
University of manchester mathematical formula tablesGaurav Vasani
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Alexander Litvinenko
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsFrank Nielsen
 

Similaire à Lecture4 kenrels functions_rkhs (20)

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
Crib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC examsCrib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC exams
 
On the smallest enclosing information disk
 On the smallest enclosing information disk On the smallest enclosing information disk
On the smallest enclosing information disk
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics
 
On Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsOn Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular Integrals
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...
 
lec_3.pdf
lec_3.pdflec_3.pdf
lec_3.pdf
 
kactl.pdf
kactl.pdfkactl.pdf
kactl.pdf
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
 
University of manchester mathematical formula tables
University of manchester mathematical formula tablesUniversity of manchester mathematical formula tables
University of manchester mathematical formula tables
 
Dft
DftDft
Dft
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Math report
Math reportMath report
Math report
 
Derivatives
DerivativesDerivatives
Derivatives
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
 

Plus de Stéphane Canu

Lecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualLecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualStéphane Canu
 
Lecture10 outilier l0_svdd
Lecture10 outilier l0_svddLecture10 outilier l0_svdd
Lecture10 outilier l0_svddStéphane Canu
 
Lecture8 multi class_svm
Lecture8 multi class_svmLecture8 multi class_svm
Lecture8 multi class_svmStéphane Canu
 
Lecture3 linear svm_with_slack
Lecture3 linear svm_with_slackLecture3 linear svm_with_slack
Lecture3 linear svm_with_slackStéphane Canu
 
Lecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalLecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalStéphane Canu
 
Lecture9 multi kernel_svm
Lecture9 multi kernel_svmLecture9 multi kernel_svm
Lecture9 multi kernel_svmStéphane Canu
 
Main recsys factorisation
Main recsys factorisationMain recsys factorisation
Main recsys factorisationStéphane Canu
 

Plus de Stéphane Canu (8)

Lecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualLecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the Dual
 
Lecture10 outilier l0_svdd
Lecture10 outilier l0_svddLecture10 outilier l0_svdd
Lecture10 outilier l0_svdd
 
Lecture8 multi class_svm
Lecture8 multi class_svmLecture8 multi class_svm
Lecture8 multi class_svm
 
Lecture6 svdd
Lecture6 svddLecture6 svdd
Lecture6 svdd
 
Lecture3 linear svm_with_slack
Lecture3 linear svm_with_slackLecture3 linear svm_with_slack
Lecture3 linear svm_with_slack
 
Lecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalLecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primal
 
Lecture9 multi kernel_svm
Lecture9 multi kernel_svmLecture9 multi kernel_svm
Lecture9 multi kernel_svm
 
Main recsys factorisation
Main recsys factorisationMain recsys factorisation
Main recsys factorisation
 

Lecture4 kenrels functions_rkhs

  • 1. Lecture 4: kernels and associated functions Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 4, 2014
  • 2. Plan 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS
  • 3. Introducing non linearities through the feature map SVM Val f (x) = d j=1 xj wj + b = n i=1 αi (xi x) + b t1 t2 ∈ IR2 x1 x2 x3 x4 x5 linear in x ∈ IR5 Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37
  • 4. Introducing non linearities through the feature map SVM Val f (x) = d j=1 xj wj + b = n i=1 αi (xi x) + b t1 t2 ∈ IR2 φ(t) = t1 x1 t2 1 x2 t2 x3 t2 2 x4 t1t2 x5 linear in x ∈ IR5 quadratic in t ∈ IR2 The feature map φ : IR2 −→ IR5 t −→ φ(t) = x xi x = φ(ti ) φ(t) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37
  • 5. Introducing non linearities through the feature map A. Lorena & A. de Carvalho, Uma Introducão às Support Vector Machines, 2007 Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 4 / 37
  • 6. Non linear case: dictionnary vs. kernel in the non linear case: use a dictionary of functions φj (x), j = 1, p with possibly p = ∞ for instance polynomials, wavelets... f (x) = p j=1 wj φj (x) with wj = n i=1 αi yi φj (xi ) so that f (x) = n i=1 αi yi p j=1 φj (xi )φj (x) k(xi ,x) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37
  • 7. Non linear case: dictionnary vs. kernel in the non linear case: use a dictionary of functions φj (x), j = 1, p with possibly p = ∞ for instance polynomials, wavelets... f (x) = p j=1 wj φj (x) with wj = n i=1 αi yi φj (xi ) so that f (x) = n i=1 αi yi p j=1 φj (xi )φj (x) k(xi ,x) p ≥ n so what since k(xi , x) = p j=1 φj (xi )φj (x) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37
  • 8. closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, s1, s2, ..., sd , s2 1 , s2 2 , ..., s2 d , ..., si sj , ... in this case Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2 1 t2 1 + ... + s2 d t2 d + ... + si sj ti tj + ... Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
  • 9. closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, s1, s2, ..., sd , s2 1 , s2 2 , ..., s2 d , ..., si sj , ... in this case Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2 1 t2 1 + ... + s2 d t2 d + ... + si sj ti tj + ... The quadratic kenrel: s, t ∈ IRd , k(s, t) = s t + 1 2 = 1 + 2s t + s t 2 computes the dot product of the reweighted dictionary: Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, √ 2s1, √ 2s2, ..., √ 2sd , s2 1 , s2 2 , ..., s2 d , ..., √ 2si sj , ... Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
  • 10. closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, s1, s2, ..., sd , s2 1 , s2 2 , ..., s2 d , ..., si sj , ... in this case Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2 1 t2 1 + ... + s2 d t2 d + ... + si sj ti tj + ... The quadratic kenrel: s, t ∈ IRd , k(s, t) = s t + 1 2 = 1 + 2s t + s t 2 computes the dot product of the reweighted dictionary: Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, √ 2s1, √ 2s2, ..., √ 2sd , s2 1 , s2 2 , ..., s2 d , ..., √ 2si sj , ... p = 1 + d + d(d+1) 2 multiplications vs. d + 1 use kernel to save computration Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
  • 11. kernel: features throught pairwise comparizons x φ(x) e.g. a text e.g. BOW K n examples nexamples Φ p features nexamples k(xi , xj ) = p j=1 φj (xi )φj (xj ) K The matrix of pairwise comparizons (O(n2 )) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 7 / 37
  • 12. Kenrel machine kernel as a dictionary f (x) = n i=1 αi k(x, xi ) αi influence of example i depends on yi k(x, xi ) the kernel do NOT depend on yi Definition (Kernel) Let X be a non empty set (the input space). A kernel is a function k from X × X onto IR. k : X × X −→ IR s, t −→ k(s, t)
  • 13. Kenrel machine kernel as a dictionary f (x) = n i=1 αi k(x, xi ) αi influence of example i depends on yi k(x, xi ) the kernel do NOT depend on yi Definition (Kernel) Let X be a non empty set (the input space). A kernel is a function k from X × X onto IR. k : X × X −→ IR s, t −→ k(s, t) semi-parametric version: given the family qj (x), j = 1, p f (x) = n i=1 αi k(x, xi )+ p j=1 βj qj (x)
  • 14. Kernel Machine Definition (Kernel machines) A (xi , yi )i=1,n (x) = ψ n i=1 αi k(x, xi ) + p j=1 βj qj (x) α et β: parameters to be estimated. Exemples A(x) = n i=1 αi (x − xi )3 + + β0 + β1x splines A(x) = sign i∈I αi exp− x−xi 2 b +β0 SVM IP(y|x) = 1 Z exp i∈I αi 1I{y=yi }(x xi + b)2 exponential family
  • 15. Plan 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 10 / 37
  • 16. In the beginning was the kernel... Definition (Kernel) a function of two variable k from X × X to IR Definition (Positive kernel) A kernel k(s, t) on X is said to be positive if it is symetric: k(s, t) = k(t, s) an if for any finite positive interger n: ∀{αi }i=1,n ∈ IR, ∀{xi }i=1,n ∈ X, n i=1 n j=1 αi αj k(xi , xj ) ≥ 0 it is strictly positive if for αi = 0 n i=1 n j=1 αi αj k(xi , xj ) > 0
  • 17. Examples of positive kernels the linear kernel: s, t ∈ IRd , k(s, t) = s t symetric: s t = t s positive: n i=1 n j=1 αi αj k(xi , xj ) = n i=1 n j=1 αi αj xi xj = n i=1 αi xi   n j=1 αj xj   = n i=1 αi xi 2 the product kernel: k(s, t) = g(s)g(t) for some g : IRd → IR, symetric by construction positive: n i=1 n j=1 αi αj k(xi , xj ) = n i=1 n j=1 αi αj g(xi )g(xj ) = n i=1 αi g(xi )   n j=1 αj g(xj )   = n i=1 αi g(xi ) 2 k is positive ⇔ (its square root exists) ⇔ k(s, t) = φs, φt J.P. Vert, 2006
  • 18. Example: finite kernel let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials, wavelets...) the feature map and linear kernel feature map: Φ : X → IRp s → Φ = φ1(s), ..., φp(s) Linear kernel in the feature space: k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t) e.g. the quadratic kernel: s, t ∈ IRd , k(s, t) = s t + b 2 feature map: Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, √ 2s1, ..., √ 2sj , ..., √ 2sd , s2 1 , ..., s2 j , ..., s2 d , ..., √ 2si sj , ...
  • 19. Positive definite Kernel (PDK) algebra (closure) if k1(s, t) and k2(s, t) are two positive kernels DPK are a convex cone: ∀a1 ∈ IR+ a1k1(s, t) + k2(s, t) product kernel k1(s, t)k2(s, t) proofs by linearity: n i=1 n j=1 αi αj a1k1(i, j) + k2(i, j) = a1 n i=1 n j=1 αi αj k1(i, j) + n i=1 n j=1 αi αj k2(i, j) assuming ∃ψ s.t. k1(s, t) = ψ (s)ψ (t) n i=1 n j=1 αi αj k1(xi , xj )k2(xi , xj ) = n i=1 n j=1 αi αj ψ (xi )ψ (xj )k2(xi , xj ) = n i=1 n j=1 αi ψ (xi ) αj ψ (xj ) k2(xi , xj ) N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004
  • 20. Kernel engineering: building PDK for any polynomial with positive coef. φ from IR to IR φ k(s, t) if Ψis a function from IRd to IRd k Ψ(s), Ψ(t) if ϕ from IRd to IR+ , is minimum in 0 k(s, t) = ϕ(s + t) − ϕ(s − t) convolution of two positive kernels is a positive kernel K1 K2 Example : the Gaussian kernel is a PDK exp(− s − t 2 ) = exp(− s 2 − t 2 + 2s t) = exp(− s 2 ) exp(− t 2 ) exp(2s t) s t is a PDK and function exp as the limit of positive series expansion, so exp(2s t) is a PDK exp(− s 2 ) exp(− t 2 ) is a PDK as a product kernel the product of two PDK is a PDK
  • 21. an attempt at classifying PD kernels stationary kernels, (also called translation invariant): k(s, t) = ks(s − t) radial (isotropic) gaussian: exp −r2 b , r = s − t with compact support c.s. Matèrn : max 0, 1 − r b κ r b k Bk r b , κ ≥ (d + 1)/2 locally stationary kernels: k(s, t) = k1(s + t)ks(s − t) K1 is a non negative function and K2 a radial kernel. non stationary (projective kernels): k(s, t) = kp(s t) separable kernels k(s, t) = k1(s)k2(t) with k1 and k2(t) PDK in this case K = k1k2 where k1 = (k1(x1), ..., k1(xn)) MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002
  • 22. some examples of PD kernels... type name k(s, t) radial gaussian exp −r2 b , r = s − t radial laplacian exp(−r/b) radial rationnal 1 − r2 r2+b radial loc. gauss. max 0, 1 − r 3b d exp(−r2 b ) non stat. χ2 exp(−r/b), r = k (sk −tk )2 sk +tk projective polynomial (s t)p projective affine (s t + b)p projective cosine s t/ s t projective correlation exp s t s t − b Most of the kernels depends on a quantity b called the bandwidth
  • 23. the importance of the Kernel bandwidth for the affine Kernel: Bandwidth = biais k(s, t) = (s t + b)p = bp s t b + 1 p for the gaussian Kernel: Bandwidth = influence zone k(s, t) = 1 Z exp − s − t 2 2σ2 b = 2σ2
  • 24. the importance of the Kernel bandwidth for the affine Kernel: Bandwidth = biais k(s, t) = (s t + b)p = bp s t b + 1 p for the gaussian Kernel: Bandwidth = influence zone k(s, t) = 1 Z exp − s − t 2 2σ2 b = 2σ2 Illustration 1 d density estimation b = 1 2 b = 2 + data (x1, x2, ..., xn) – Parzen estimate IP(x) = 1 Z n i=1 k(x, xi )
  • 25. kernels for objects and structures kernels on histograms and probability distributions kernel on strings spectral string kernel k(s, t) = u φu(s)φu(t) using sub sequences similarities by alignements k(s, t) = π exp(β(s, t, π)) kernels on graphs the pseudo inverse of the (regularized) graph Laplacian L = D − A A is the adjency matrixD the degree matrix diffusion kernels 1 Z(b) expbL subgraph kernel convolution (using random walks) and kernels on HMM, automata, dynamical system... Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006
  • 26. Multiple kernel M. Cuturi, Positive Definite Kernels in Machine Learning, 2009
  • 27. Gram matrix Definition (Gram matrix) let k(s, t) be a positive kernel on X and (xi )i=1,n a sequence on X. the Gram matrix is the square K of dimension n and of general term Kij = k(xi , xj ). practical trick to check kernel positivity: K is positive ⇔ λi > 0 its eigenvalues are posivies: if Kui = λi ui ; i = 1, n ui Kui = λi ui ui = λi matrix K is the one to be used
  • 28. Examples of Gram matrices with different bandwidth raw data Gram matrix for b = 2 b = .5 b = 10
  • 29. different point of view about kernels kernel and scalar product k(s, t) = φ(s), φ(t) H kernel and distance d(s, t)2 = k(s, s) + k(t, t) − 2k(s, t) kernel and covariance: a positive matrix is a covariance matrix IP(f) = 1 Z exp − 1 2 (f − f0) K−1 (f − f0) if f0 = 0 and f = Kα, IP(α) = 1 Z exp −1 2 α Kα Kernel and regularity (green’s function) k(s, t) = P∗ Pδs−t for some operator P (e.g. some differential)
  • 30. Let’s summarize positive kernels there is a lot of them can be rather complex 2 classes: radial / projective the bandwith matters (more than the kernel itself) the Gram matrix summarize the pairwise comparizons
  • 31. Roadmap 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 25 / 37
  • 32. From kernel to functions H0 =    f mf < ∞; fj ∈ IR; tj ∈ X, f (x) = mf j=1 fj k(x, tj )    let define the bilinear form (g(x) = mg i=1 gi k(x, si )) : ∀f , g ∈ H0, f , g H0 = mf j=1 mg i=1 fj gi k(tj , si ) Evaluation functional: ∀x ∈ X f (x) = f (•), k(x, •) H0 from k to H for any positive kernel, a hypothesis set can be constructed H = H0 with its metric
  • 33. RKHS Definition (reproducing kernel Hibert space (RKHS)) a Hilbert space H embeded with the inner product •, • H is said to be with reproducing kernel if it exists a positive kernel k such that ∀s ∈ X, k(•, s) ∈ H ∀f ∈ H, f (s) = f (•), k(s, •) H Beware: f = f (•) is a function while f (s) is the real value of f at point s positive kernel ⇔ RKHS any function in H is pointwise defined defines the inner product it defines the regularity (smoothness) of the hypothesis set Exercice: let f (•) = n i=1 αi k(•, xi ). Show that f 2 H = α Kα
  • 34. Other kernels (what really matters) finite kernels k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t) Mercer kernels positive on a compact set ⇔ k(s, t) = p j=1 λj φj (s)φj (t) positive kernels positive semi-definite conditionnaly positive (for some functions pj ) ∀{xi }i=1,n, ∀αi , n i αi pj (xi ) = 0; j = 1, p, n i=1 n j=1 αi αj k(xi , xj ) ≥ 0 symetric non positive k(s, t) = tanh(s t + α0) non symetric – non positive the key property: Jt (f ) = k(t, .) holds C. Ong et al, ICML , 2004
  • 35. The kernel map observation: x = (x1, . . . , xj , . . . , xd ) f (x) = w x = w, x IRd feature map: x −→ Φ(x) = (φ1(x), . . . , φj (x), . . . , φp(x)) Φ : IRd −→ IRp f (x) = w Φ(x) = w, Φ(x) IRp kernel dictionary: x −→ k(x) = (k(x, x1), . . . , k(x, xi ), . . . , k(x, xn)) k : IRd −→ IRn f (x) = n i=1 αi k(x, xi ) = α, k(x) IRn kernel map: x −→ k(•, x) p = ∞ f (x) = f (•), K(•, x) H
  • 36. Roadmap 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS
  • 37. Functional differentiation in RKHS Let J be a functional J : H → IR f → J(f ) examples: J1(f ) = f 2 H, J2(f ) = f (x), J directional derivative in direction g at point f dJ(f , g) = lim ε → 0 J(f + εg) − J(f ) ε Gradient J(f ) J : H → H f → J (f ) if dJ(f , g) = J (f ), g H exercise: find out J1 (f ) et J2 (f )
  • 38. Hint dJ(f , g) = dJ(f + εg) dε ε=0
  • 39. Solution dJ1(f , g) = lim ε → 0 f +εg 2 − f 2 ε = lim ε → 0 f 2 +ε2 g 2 +2ε f ,g H− f 2 ε = lim ε → 0 ε g 2 + 2 f , g H = 2f , g H ⇔ J1 (f ) = 2f dJ2(f , g) = lim ε → 0 f (x)+εg(x)−f (x) ε = g(x) = k(x, .), g H ⇔ J2 (f ) = k(x, .)
  • 40. Solution dJ1(f , g) = lim ε → 0 f +εg 2 − f 2 ε = lim ε → 0 f 2 +ε2 g 2 +2ε f ,g H− f 2 ε = lim ε → 0 ε g 2 + 2 f , g H = 2f , g H ⇔ J1 (f ) = 2f dJ2(f , g) = lim ε → 0 f (x)+εg(x)−f (x) ε = g(x) = k(x, .), g H ⇔ J2 (f ) = k(x, .) Minimize f ∈H J(f ) ⇔ ∀g ∈ H, dJ(f , g) = 0 ⇔ J(f ) = 0
  • 41. Subdifferential in a RKHS H Definition (Sub gradient) a subgradient of J : H −→ IR at f0 is any function g ∈ H such that ∀f ∈ V(f0), J(f ) ≥ J(f0) + g, (f − f0) H Definition (Subdifferential) ∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f . H = IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1} H = IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0} Theorem (Chain rule for linear Subdifferential) Let T be a linear operator H −→ IR and ϕ a function from IR to IR If J(f ) = ϕ(Tf ) Then ∂J(f ) = {T∗ g | g ∈ ∂ϕ(Tf )}, where T∗ denotes T’s adjoint operator
  • 42. example of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f −→ Tf = (f (x1), . . . , f (xn)) T∗ : IRn −→ H α −→ T∗ α build the adjoint Tf , α IRn = f , T∗ α H
  • 43. example of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f −→ Tf = (f (x1), . . . , f (xn)) T∗ : IRn −→ H α −→ T∗ α = n i=1 αi k(•, xi ) build the adjoint Tf , α IRn = f , T∗ α H Tf , α IRn = n i=1 f (xi )αi = n i=1 f (•), k(•, xi ) Hαi = f (•), n i=1 αi k(•, xi ) T∗α H
  • 44. example of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f −→ Tf = (f (x1), . . . , f (xn)) T∗ : IRn −→ H α −→ T∗ α = n i=1 αi k(•, xi ) build the adjoint Tf , α IRn = f , T∗ α H Tf , α IRn = n i=1 f (xi )αi = n i=1 f (•), k(•, xi ) Hαi = f (•), n i=1 αi k(•, xi ) T∗α H TT∗ : IRn −→ IRn α −→ TT∗ α = n j=1 αj k(xj , xi ) = Kα
  • 45. example of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f −→ Tf = (f (x1), . . . , f (xn)) T∗ : IRn −→ H α −→ T∗ α = n i=1 αi k(•, xi ) build the adjoint Tf , α IRn = f , T∗ α H Tf , α IRn = n i=1 f (xi )αi = n i=1 f (•), k(•, xi ) Hαi = f (•), n i=1 αi k(•, xi ) T∗α H TT∗ : IRn −→ IRn α −→ TT∗ α = n j=1 αj k(xj , xi ) = Kα Example of subdifferentials x given J5(f ) = |f (x)| ∂J5(f0) = g(•) = αk(•, x) ; −1 < α < 1 x given J6(f ) = max(0, 1 − f (x)) ∂J6(f1) = g(•) = αk(•, x) ; −1 < α < 0
  • 46. Optimal conditions Theorem (Fermat optimality criterion) When J(f ) is convex, f is a stationary point of problem min f ∈H J(f ) If and only if 0 ∈ ∂J(f ) ff f ∂J(f ) exercice: find for a given y ∈ IR (from Obozinski) min x∈IR 1 2 (x − y)2 + λ|x|
  • 47. Let’s summarize positive kernels ⇔ RKHS = H ⇔ regularity f 2 H the key property: Jt (f ) = k(t, .) holds not only for positive kernels f (xi ) exists (pointwise defined functions) universal consistency in RKHS the Gram matrix summarize the pairwise comparizons