Kernel methods for data integration in systems biology

Kernel methods for data integration in systems biology
Nathalie Vialaneix
nathalie.vialaneix@inra.fr
http://www.nathalievialaneix.eu
KIM Seminar
October 18th, 2019 - Montpellier
Nathalie Vialaneix | Kernel methods for data integration in systems biology 1/48

A primer on kernel methods for
biology

Before we start: context and motivations
Data characteristics
a few (paired) samples
information at various levels
... but of heterogeneous types
and, when numeric, with a large
dimension

Before we start: context and motivations
Data characteristics
a few (paired) samples
information at various levels
... but of heterogeneous types
and, when numeric, with a large
dimension
What we want to achieve
integrative analysis
to predict a phenotype, to
understand the typology of the
samples, ...

In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp

From that, we can compute:
centers of gravity: x = 1
n
n
i=1 xi
distances and dot products:
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j

n
n
i=1 xi
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j
Kernels...
The characteristics on the n samples
(xi)i are summarized by pairwise
similarities
More formally: n × n-matrix K, st K is
symmetric and positive deﬁnite

n
n
i=1 xi
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j
Kernels...
The characteristics on the n samples
(xi)i are summarized by pairwise
similarities
More formally: n × n-matrix K, st K is
symmetric and positive deﬁnite
[Aronszajn, 1950] Representer theorem
∃! Hilbert space H and φ : → H st:
Kii = φ(xi), φ(xi ) H

Why are kernels interesting?
1 because they can reduce high dimensional data in small similarity
matrices
2 because they are not restricted to data in Rp
(kernels on graphs,
between graphs, on text, ...) some examples to come
3 because they can embed expert knowledge (i.e., phylogeny between
taxons for instance) some examples to come
4 because they offer a rigorous framework to extend many statistical
methods basic principles to come just after
5 because they offer a clean and common framework for data
integration extension 1

Why are kernels interesting?
1 because they can reduce high dimensional data in small similarity
matrices
2 because they are not restricted to data in Rp
(kernels on graphs,
between graphs, on text, ...) some examples to come
3 because they can embed expert knowledge (i.e., phylogeny between
taxons for instance) some examples to come
4 because they offer a rigorous framework to extend many statistical
methods basic principles to come just after
5 because they offer a clean and common framework for data
integration extension 1
but:
1 the choice of the relevant kernel is still up to you...
2 can strongly increase computational time when n is large... extension
2

Kernel examples
1 Rp
observations: Gaussian kernel Kii = e−γ xi−xi
2

Kernel examples
1 Rp
2
2 nodes of a graph: [Kondor and Lafferty, 2002]

Kernel examples
1 Rp
2
2 nodes of a graph: [Kondor and Lafferty, 2002]
3 sequence kernels (used to compute similarities between proteins for
instance): spectrum kernel [Jaakkola et al., 2000] (with HMM),
convolution kernel [Saigo et al., 2004]
4 kernel between graphs (or “structured data”; used in metabolomics to
compute similarities between metabolites based on their
fragmentation trees): [Shen et al., 2014, Brouard et al., 2016]
More examples: [Mariette and Vialaneix, 2019]

Principles for learning from kernels
Start from any statistical method (PCA, regression, k-means clustering)
and rewrite all quantities using:
K to compute distances and dot products
dot product is: Kii and distance is:
√
Kii + Ki i − 2Kii
(implicit) linear or convex combinations of (φ(xi))i to describe all
unobserved elements (centers of gravity and so on...)

A simple example: k-means

1: Initialization: random initialization of P centers ¯xCt
j
∈ Rp
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
ft+1
(xi) = argmin
j=1,...,P
d(xi, ¯xCt
j
)
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
5: end for Convergence
6: return Partition

1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
3: Affectation step ∀ i = 1, ..., n
ft+1
(xi) = argmin
j=1,...,P
d(xi, ¯xCt
j
)
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
6: return Partition

¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
φ(xi) − ¯xCt
j
2
H ,
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
6: return Partition

¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
φ(xi) − ¯xCt
j
2
H ,
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
φ(xl)
6: return Partition

¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
= Kii −
2
|Ct
j
|
xl∈Ct
j
Kil +
1
|Ct
j
|2
xl, xl ∈Ct
j
Kll .
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
φ(xl)
6: return Partition

Beyond kernels: relational data
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species

DNA barcoding
optimal matching
(edit) distances to
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
[Ambroise et al., 2019,
Randriamihamison et al., 2019]

DNA barcoding
optimal matching
(edit) distances to
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
[Ambroise et al., 2019,
Randriamihamison et al., 2019]
Metagenomics
dissemblance between
samples is better
captured when
phylogeny between
species is taken into
account (unifrac
distances)

Formally, relational data are:
Euclidean distances or (non
Euclidean) dissimilarities between n
entities: symmetric (n × n)-matrix D
with positive entries and null
diagonal

diagonal
kernels: a symmetric and positive
deﬁnite (n × n)-matrix K that
measures a “relation” between n
entities in X (arbitrary space)
K(x, x ) = φ(x), φ(x )

diagonal
K(x, x ) = φ(x), φ(x )
networks/graphs: groups of n entities
(nodes/vertices) linked by a
(potentially weighted) relation
(edges)
⇒ symmetric (n × n)-matrix with
positive entries and null diagonal W

diagonal
K(x, x ) = φ(x), φ(x )
networks/graphs: groups of n entities
(nodes/vertices) linked by a
(potentially weighted) relation
(edges)
⇒ symmetric (n × n)-matrix with
positive entries and null diagonal W
Similarities between n entities:
symmetric (n × n)-matrix S (with
usually positive entries) but not
necessarily deﬁnite positive

Different relational data types are related to each others
a kernel is equivalent to an Euclidean distance:
D(x, x ) := K(x, x) + K(x , x ) − 2K(x, x )
from a dissimilarity, similarities can be computed:
S(x, x) := a(x) (arbitrary), S(x, x ) =
1
2
a(x) + a(x ) − D2
(x, x )
various kernels have been proposed for graphs (e.g., based on the
graph Laplacian): [Kondor and Lafferty, 2002]

Different relational data types are related to each others
a kernel is equivalent to an Euclidean distance:
D(x, x ) := K(x, x) + K(x , x ) − 2K(x, x )
from a dissimilarity, similarities can be computed:
S(x, x) := a(x) (arbitrary), S(x, x ) =
1
2
a(x) + a(x ) − D2
(x, x )
various kernels have been proposed for graphs (e.g., based on the
graph Laplacian): [Kondor and Lafferty, 2002]
in summary
useful simpliﬁcation: “is the framework Euclidean or not?” (e.g., kernel vs
non Euclidean dissimilarity)

Principles for learning from relational data
Euclidean case (kernel K)
rewrite all quantities using:
K to compute distances and dot
products
linear or convex combinations of
(φ(xi))i to describe all
unobserved elements (centers
of gravity and so on...)
Works for: PCA, k-means, linear
regression, ...

Principles for learning from relational data
Euclidean case (kernel K)
rewrite all quantities using:
K to compute distances and dot
products
linear or convex combinations of
(φ(xi))i to describe all
unobserved elements (centers
of gravity and so on...)
Works for: PCA, k-means, linear
regression, ...
non Euclidean case (non Euclidean
dissimilarity D): do almost the same
using a pseudo-Euclidean framework
[Goldfarb, 1984]
∃ two Euclidean spaces E+ and E−
and two mappings φ+ and φ− st:
D(x, x ) = φ+(x) − φ+(x ) 2
E+
−
φ−(x) − φ−(x ) 2
E−

And now?
1 integrate multiple data sources with kernels (with application to
metagenomic datasets) extension 1
2 reduce complexity of kernel methods extension 2

Combining relational data in an
unsupervised setting

What are metagenomic data?
Source: [Sommer et al., 2010]

abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.

abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.
phylogenetic tree (evolution
history between species,
OTUs...). One tree with p leaves
built from the sequences
collected in the n samples.

What are metagenomic data used for?
produce a proﬁle of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various ﬁelds: environmental science, microbiote, ...

What are metagenomic data used for?
produce a proﬁle of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various ﬁelds: environmental science, microbiote, ...
Processed by computing a relevant dissimilarity between samples
(standard Euclidean distance is not relevant) and by using this dissimilarity
in subsequent analyses.

β-diversity data: dissimilarities between count data
Compositional dissimilarities: (nig) count of species g for sample i
Jaccard: the fraction of species speciﬁc of either sample i or j:
djac =
g I{nig>0,njg=0} + I{njg>0,nig=0}
j I{nig+njg>0}
Bray-Curtis: the fraction of the sample which is speciﬁc of either
sample i or j
dBC =
g |nig − njg|
g(nig + njg)
Other dissimilarities available in the R package philoseq, most of them
not Euclidean.

β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities

For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.

Unifrac: the fraction of the tree speciﬁc to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}

Unifrac: the fraction of the tree speciﬁc to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}
Weighted Unifrac: the fraction of the
diversity speciﬁc to sample i or to sample j.
dwUF =
e le|pei − pej|
e(pei + pej)

TARA Oceans datasets
The 2009-2013 expedition
Co-directed by Étienne Bourgois
and Éric Karsenti.
7,012 datasets collected from
35,000 samples of plankton and
water (11,535 Gb of data).
Study the plankton: bacteria,
protists, metazoans and viruses
representing more than 90% of the
biomass in the ocean.

TARA Oceans datasets
Science (May 2015) - Studies on:
eukaryotic plankton diversity
[de Vargas et al., 2015],
ocean viral communities
[Brum et al., 2015],
global plankton interactome
[Lima-Mendez et al., 2015],
global ocean microbiome
[Sunagawa et al., 2015],
. . . .
→ datasets from different types and
different sources analyzed separately.

TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).

Datasets used
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

Datasets used
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

[de Vargas et al., 2015]
Datasets used
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

[Brum et al., 2015]
Datasets used
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).
virus composition: ∼ 867 virus clusters based on shared gene content.

Common samples
48 samples,
2 depth layers: surface
(SRF) and deep chlorophyll
maximum (DCM),
31 different sampling
stations.

From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km

M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]

M m Km
unsupervised framework but input space is Rp
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.

M m Km
unsupervised framework but input space is Rp
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.
Our proposal: 2 UMKL frameworks which do not require data to have
values in Rd
.

Multi-kernel/distances integration
How to “optimally” combine several
relational datasets in an unsupervised
setting?
for kernels K1
, . . . , KM
obtained on the
same n objects, search: Kβ = M
m=1 βmKm
[Mariette and Villa-Vialaneix, 2018]
Package R mixKernel
https://cran.r-project.org/
package=mixKernel

STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefﬁcient [Robert and Escouﬁer, 1976] to the
kernel framework)

Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
.
kernel framework)
maximizev
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.

Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
.
kernel framework)
maximizev
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.
Solution: ﬁrst eigenvector of C ⇒ Set β = v
M
m=1 vm
(consensual kernel).

A kernel preserving the original topology of the data I
Similarly to [Lin et al., 2010], preserve the local geometry of the data in the
feature space.

feature space.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k

feature space.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k
Feature space geometry measured by
∆i(β) = φ∗
β(xi),


φ∗
β(x1)
...
φ∗
β(xn)


=


K∗
β(xi, x1)
...
K∗
β(xi, xn)



A kernel preserving the original topology of the data II
Sparse version
minimizeβ
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
β =
M
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
m=1
βm = 1.
Non sparse version
minimizev
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
v =
M
m=1
vmKm
and v ∈ RM
st vm ≥ 0 and v 2 = 1.

A kernel preserving the original topology of the data II
Sparse version
equivalent to a standard QP problem with linear constrains (ex: package
quadprog in R)
Non sparse version
equivalent to a QPQC problem (harder to solve) solved with “Alternating
Direction Method of Multipliers” (ADMM [Boyd et al., 2011])

Application to TARA oceans
Similarity between datasets (STATIS)
phychem and small size organisms are the most similar (conﬁrmed
by [de Vargas et al., 2015] et [Sunagawa et al., 2015]).

Application to TARA oceans
Important variables
Rhizaria abundance strongly structure the differences between samples (analyses
restricted to some organisms found differences mostly based on water depths)
and waters from Arctic Oceans and Paciﬁc Oceans differ in terms of Rhizaria
abundance
Back to choice - Jump to conclusion

Reducing complexity of kernel
methods

Large scale kernel methods
Standard complexity (number of elementary operations) of kernel learning
methods: O(n2
) or even O(n3
)
Examples
K-PCA: spectral decomposition of K (equivalent to PCA in feature
space) is O(n3
) as compared to O min(p, n)3
for standard PCA
kernel k-means: complexity of naive kernel k-means is O(Tkn2
), as
compared to O(Tkpn) for naive standard k-means

Low rank approximation solutions
Aim: approximate K with a low rank matrix (matrix with rank r n.
Then, use the approximation to train your predictor and correct it to
“re-scale” it to n. Typical computational cost: O(nr2
).

Sketch of Nyström approximation
[Williams and Seeger, 2000, Drineas and Mahoney, 2005]
Pick at random m observations in {1, . . . , m} (without loss of
generality suppose that the ﬁrst m ones have been chosen).

Sketch of Nyström approximation
[Williams and Seeger, 2000, Drineas and Mahoney, 2005]
Pick at random m observations in {1, . . . , m} (without loss of
generality suppose that the ﬁrst m ones have been chosen).
Re-write:
K =
K(m) K(m,n−m)
K(n−m,m) K(n−m,n−m) et K(n,m)
=
K(m)
K(n−m,m) ,
with K(n−m,m) = (K(m,n−m)) , and use K(m) instead of K.

Approximate spectral decomposition of K
Notations:
K
eigenvectors: (vj)j=1,...,n
eigenvalues: (λj)j=1,...,n (positive,
decreasing order)
K(m)
eigenvectors: (v
(m)
j
)j=1,...,m
eigenvalues: (λ
(m)
j
)j=1,...,m (positive,
decreasing order)

Notations:
K
decreasing order)
K(m)
eigenvectors: (v
(m)
j
)j=1,...,m
eigenvalues: (λ
(m)
j
decreasing order)
∀ j = 1, . . . , m, µj
n
m
µ
(m)
j
and vj
m
n
1
µ
(m)
j
K(n,m)
v
(m)
j

Notations:
K
decreasing order)
K(m)
eigenvectors: (v
(m)
j
)j=1,...,m
eigenvalues: (λ
(m)
j
decreasing order)
∀ j = 1, . . . , m, µj
n
m
µ
(m)
j
and vj
m
n
1
µ
(m)
j
K(n,m)
v
(m)
j
complexity of the direct calculation: O(n3
)
complexity of the approximate solution: O(m3
) + O(nm2
)

Notations:
K
decreasing order)
K(m)
eigenvectors: (v
(m)
j
)j=1,...,m
eigenvalues: (λ
(m)
j
decreasing order)
∀ j = 1, . . . , m, µj
n
m
µ
(m)
j
and vj
m
n
1
µ
(m)
j
K(n,m)
v
(m)
j
Remark: When the rank of K is < m, the approximation is exact.

What can be obtained from that...
Approximation of the original kernel K [Cortes et al., 2010, Bach, 2013]
Approximation of the kernel (ridge) regression with a control of the
estimation error [Cortes et al., 2010, Bach, 2013] (similar methods exist to
use Nyström approximation in SVM).
Various derived extensions [Mariette et al., 2017a] (online
Self-Organizing Maps)

Basics on other approaches in an online framework
Online learning: deal with samples one by one and update (at low cost)
the model
How to use online learning for reducing complexity?: cache some
operations in memory + (not always) impose sparsity on representers
(centers of gravity or so...)
[Rossi et al., 2007] for kernel k-means - [Mariette et al., 2017b] for kernel SOM

Basics on (standard) stochastic SOM
[Kohonen, 2001]
x
x
x
(xi)i=1,...,n ⊂ Rp
are affected to a unit f(xi) ∈ {1, . . . , U}
the grid is equipped with a “distance” between units: d(u, u ) and
observations affected to close units are close in Rp
every unit u corresponds to a prototype, pu (x) in Rp

[Kohonen, 2001]
x
x
x
Iterative learning (assignment step): xi is picked at random within (xk )k
and affected to best matching unit:
ft
(xi) = arg min
u
xi − pt
u
2

[Kohonen, 2001]
x
x
x
Iterative learning (representation step): all prototypes in neighboring units
are updated with a gradient descent like step:
pt+1
u ←− pt
u + µ(t)Ht
(d(f(xi), u))(xi − pt
u)

Extension of SOM to data described by a kernel
[Villa and Rossi, 2007]
Data: (xi)i=1,...,n ∈ Rp
1: Initialization:
randomly set p0
1
, ..., p0
U
in Rd
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) xi − pt
u
7: end for
8: end for
The general relational variant is implemented in SOMbrero.

Data: (xi)i=1,...,n ∈ X
1: Initialization:
randomly set p0
1
, ..., p0
U
in Rd
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) xi − pt
u
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u = n
i=1 β0
ui
φ(xi) (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) xi − pt
u
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u = n
i=1 β0
ui
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
φ(xi) − pt
u
2
X
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) xi − pt
u
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u = n
i=1 β0
ui
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
φ(xi) − pt
u
2
X
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) φ(xi) − pt
u
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u = n
i=1 β0
ui
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
(βt
u) Kβt
u − 2(βt
u) K.i
6:
βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u)) 1i − βt
u
7: end for
8: end for

Example: SOM for typology of Astraptes fulgerator from
DNA barcoding
Edit distances between DNA sequences [Olteanu and Villa-Vialaneix, 2015]
Almost perfect clustering (identifying a possible label error on one sample)
with (in addition) information on relations between species.

Problems with KSOM
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 β0
ui
2: for t = 1 → γn do
3: pick randomly i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
n
j,j =1
βt
ujβt
uj Kjj − 2
n
j=1
βt
ujKji → O(n2
U)
6:
βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))(1i − βt
u) → O(nU)
7: end for
8: end for
→ algorithm complexity: O(γn3
U) (compared to O(γUpn) for numeric)

Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
u = n
i=1 βuiφ(xi) (convex combination)
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
n
j,j =1
βt
ujβt
uj Kjj − 2
n
j=1
βt
ujKji
6: βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))(1i − βt
u)
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
u = n
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
n
j,j =1
βt
ujβt
uj Kjj
At
u
−2
n
j=1
βt
ujKji
Bt
ui
6: βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))(1i − βt
u)
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
u = n
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
6: βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))(1i − βt
u)
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
u = n
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
6: βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))
λu(t)
(1i − βt
u)
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
u = n
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
6: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
7: end for
8: end for

Data: (xi)i=1,...,n ∈ X
u = n
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj
3: B0
ui
= n
j=1 β0
uj
Kji
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
9: end for
10: end for

Data: (xi)i=1,...,n ∈ X
u = n
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj
3: B0
ui
= n
j=1 β0
uj
Kji
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
Ki j = (1 − λu(t))Bt
ui
+ λu(t)Kii
9: end for
10: end for

Data: (xi)i=1,...,n ∈ X
u = n
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj
3: B0
ui
= n
j=1 β0
uj
Kji
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
ui
+ λu(t)Kii
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
Kii +2λu(t)(1−λu(t))Bt
ui
9: end for
10: end for

Data: (xi)i=1,...,n ∈ X
u = n
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj → O(n2
U)
3: B0
ui
= n
j=1 β0
uj
Kji → O(nU)
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
ui
+ λu(t)Kii
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
ui
9: end for
10: end for

Data: (xi)i=1,...,n ∈ X
u = n
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj → O(n2
U)
3: B0
ui
= n
j=1 β0
uj
Kji → O(nU)
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui → does not depend on n
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
ui
+ λu(t)Kii
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
ui
9: end for
10: end for

Data: (xi)i=1,...,n ∈ X
u = n
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj → O(n2
U)
3: B0
ui
= n
j=1 β0
uj
Kji → O(nU)
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
ui
+ λu(t)Kii → O(nU)
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
ui
→ O(U)
9: end for
10: end for

Data: (xi)i=1,...,n ∈ X
u = n
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj → O(n2
U)
3: B0
ui
= n
j=1 β0
uj
Kji → O(nU)
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
ui
+ λu(t)Kii → O(nU)
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
ui
→ O(U)
9: end for
10: end for
Final complexity: O(γn2
U) with additional storage memory of O(U) and
O(Un).
Back to choice - Jump to conclusion

Conclusions
Kernel methods are useful for:
dealing with different types of data
even when they are high-dimensional
combining them
However, they can be:
computationally intensive to train
not easy to interpret (work-in-progress with Jérôme Mariette and
Céline Brouard on variable selection in unsupervised setting)

SOMbrero
Madalina Olteanu,
Fabrice Rossi, Marie Cottrell,
Laura Bendhaïba and
Julien Boelaert
SOMbrero and mixKernel
Jérôme Mariette
adjclust
Pierre Neuvial, Nathanaël Randriamihamison
Guillem Rigail, Christophe Ambroise and
Shubham Chaturvedi

Credits for pictures
Slide 3: image based on ENCODE project, by Darryl Leja (NHGRI), Ian Dunham
(EBI) and Michael Pazin (NHGRI)
Slide 8: k-means image from Wikimedia Commons by Weston.pace
Slide 10: Astraptes picture is from
https://www.flickr.com/photos/39139121@N00/2045403823/ by Anne Toal
(CC BY-SA 2.0), Hi-C experiment is taken from the article Matharu et al., 2015
DOI:10.1371/journal.pgen.1005640 (CC BY-SA 4.0) and metagenomics illustration is
taken from the article Sommer et al., 2010 DOI:10.1038/msb.2010.16 (CC BY-NC-SA
3.0)
Other pictures are from articles that I co-authored.

References
Ambroise, C., Dehman, A., Neuvial, P., Rigaill, G., and Vialaneix, N. (2019).
Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics.
arXiv preprint arXiv:1902.01596.
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Bach, F. (2013).
Sharp analysis of low-rank kernel matrix approximations.
Journal of Machine Learning Research, Workshop and Conference Proceedings, 30:185–209.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011).
Distributed optimization and statistical learning via the alterning direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122.
Brouard, C., Shen, H., Dürkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016).
Fast metabolite identiﬁcation with input output kernel regression.
Bioinformatics, 32(12):i28–i36.
Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J.,
Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S.,
Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker,
P., Karsenti, E., and Sullivan, M. (2015).
Patterns and ecological drivers of ocean viral communities.
Science, 348(6237).
Cortes, C., Mohri, M., and Talwalkar, A. (2010).
On the impact of kernel approximation on learning accuracy.
Journal of Machine Learning Research, Workshop and Conference Proceedings, 9:113–120.
Crone, L. and Crosby, D. (1995).

Statistical applications of a metric on subspaces to satellite meteorology.
Technometrics, 37(3):324–328.
de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I.,
Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O.,
Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F.,
Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C.,
Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S.,
Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015).
Eukaryotic plankton diversity in the sunlit ocean.
Science, 348(6237).
Drineas, P. and Mahoney, M. (2005).
On the Nyström method for approximating a Gram matrix for improved kernel-based learning.
Journal of Machine Learning Research, 6:2153–2175.
Goldfarb, L. (1984).
A uniﬁed approach to pattern recognition.
Pattern Recognition, 17(5):575–582.
Gönen, M. and Alpaydin, E. (2011).
Multiple kernel learning algorithms.
Journal of Machine Learning Research, 12:2211–2268.
Jaakkola, T., Diekhans, M., and Haussler, D. (2000).
A discriminative framework for detecting remote protein homologies.
Journal of Computational Biology, 7(1-2):95–114.
Kohonen, T. (2001).
Self-Organizing Maps, 3rd Edition, volume 30.
Springer, Berlin, Heidelberg, New York.
Kondor, R. and Lafferty, J. (2002).
Diffusion kernels on graphs and other discrete structures.

In Sammut, C. and Hoffmann, A., editors, Proceedings of the 19th International Conference on Machine Learning, pages
315–322, Sydney, Australia. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA.
Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).
The ACT (STATIS method).
Computational Statistics and Data Analysis, 18(1):97–119.
L’Hermier des Plantes, H. (1976).
Structuration des tableaux à trois indices de la statistique.
PhD thesis, Université de Montpellier.
Thèse de troisième cycle.
Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F.,
Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F.,
de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P.,
Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G.,
Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M.,
Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).
Determinants of community structure in the global plankton interactome.
Science, 348(6237).
Lin, Y., Liu, T., and CS., F. (2010).
Multiple kernel learning for dimensionality reduction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.
Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017a).
Efficient interpretable variants of online SOM for large dissimilarity data.
Neurocomputing, 225:31–48.
Mariette, J., Rossi, F., Olteanu, M., and Villa-Vialaneix, N. (2017b).
Accelerating stochastic kernel som.
In Verleysen, M., editor, XXVth European Symposium on Artificial Neural Networks, Computational Intelligence and Machine
Learning (ESANN 2017), pages 269–274, Bruges, Belgium. i6doc.
Mariette, J. and Vialaneix, N. (2019).

Approches à noyau pour l’analyse et l’intégration de données omiques en biologie des systèmes.
Forthcoming (book chapter).
Mariette, J. and Villa-Vialaneix, N. (2018).
Unsupervised multiple kernel learning for heterogeneous data integration.
Bioinformatics, 34(6):1009–1015.
Olteanu, M. and Villa-Vialaneix, N. (2015).
On-line relational and multiple relational SOM.
Neurocomputing, 147:15–30.
Randriamihamison, N., Vialaneix, N., and Neuvial, P. (2019).
Applicability and interpretability of hierarchical agglomerative clustering with or without contiguity constraints.
Submitted for publication. Preprint arXiv 1909.10923.
Robert, P. and Escoufier, Y. (1976).
A unifying tool for linear multivariate statistical methods: the rv-coefficient.
Applied Statistics, 25(3):257–265.
Rossi, F., Hasenfuss, A., and Hammer, B. (2007).
Accelerating relational clustering algorithms with sparse prototype representation.
In Proceedings of the 6th Workshop on Self-Organizing Maps (WSOM 07), Bielefield, Germany. Neuroinformatics Group,
Bielefield University.
Saigo, H., Vert, J.-P., Ueda, N., and Akutsu, T. (2004).
Protein homology detection using string alignment kernels.
Bioinformatics, 20(11):1682–1689.
Shen, H., Dührkop, K., Böcher, S., and Rousu, J. (2014).
Metabolite identification through multiple kernel learning on fragmentation trees.
Bioinformatics, 30(12):i157–i64.
Sommer, M., Church, G., and Dantas, G. (2010).
A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion.

Molecular Systems Biology, 6(360).
Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,
Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka,
F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral,
M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,
Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P.,
Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015).
Structure and function of the global ocean microbiome.
Science, 348(6237).
Villa, N. and Rossi, F. (2007).
A comparison between dissimilarity SOM and kernel SOM for clustering the vertices of a graph.
In 6th International Workshop on Self-Organizing Maps (WSOM 2007), Bieleﬁeld, Germany. Neuroinformatics Group, Bieleﬁeld
University.
Williams, C. and Seeger, M. (2000).
Using the Nyström method to speed up kernel machines.
In Leen, T., Dietterich, T., and Tresp, V., editors, Advances in Neural Information Processing Systems (Proceedings of NIPS
2000), volume 13, Denver, CO, USA. Neural Information Processing Systems Foundation.
Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011).
Unsupervised multiple kernel clustering.
Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144.

Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).

Optimization issues
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).

Optimization issues
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
in R).
Solved using Alternating Direction Method of Multipliers (ADMM
[Boyd et al., 2011]) by replacing the previous optimization problem
with
min
x,z
x Sx + 1{x≥0}(x) + 1{ z 2
2
≥1}(z)
with the constraint x − z = 0.

Optimization issues
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
in R).
Solved using Alternating Direction Method of Multipliers (ADMM
[Boyd et al., 2011])
1 minx x Sx + y (x − z) + λ
2
x − z 2
under the constraint x ≥ 0
(standard QP problem)
2 project on the unit ball z = x
min{ x 2,1}
3 update auxiliary variable y = y + λ(x − z)

A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?

framework
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)

framework
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)
the inﬂuence of a given species in a given dataset on a given PC
subspace is accessed by computing the Crone-Crosby distance
between these two PCA subspaces [Crone and Crosby, 1995] (∼
Frobenius norm between the projectors)

Kernel methods for data integration in systems biology

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (11)

Similaire à Kernel methods for data integration in systems biology

Similaire à Kernel methods for data integration in systems biology (20)

Plus de tuxette

Plus de tuxette (20)

Dernier

Dernier (20)

Kernel methods for data integration in systems biology