SlideShare une entreprise Scribd logo
1  sur  115
Télécharger pour lire hors ligne
Kernel methods for data integration in systems biology
Nathalie Vialaneix
nathalie.vialaneix@inra.fr
http://www.nathalievialaneix.eu
KIM Seminar
October 18th, 2019 - Montpellier
Nathalie Vialaneix | Kernel methods for data integration in systems biology 1/48
A primer on kernel methods for
biology
Nathalie Vialaneix | Kernel methods for data integration in systems biology 2/48
Before we start: context and motivations
Data characteristics
a few (paired) samples
information at various levels
... but of heterogeneous types
and, when numeric, with a large
dimension
Nathalie Vialaneix | Kernel methods for data integration in systems biology 3/48
Before we start: context and motivations
Data characteristics
a few (paired) samples
information at various levels
... but of heterogeneous types
and, when numeric, with a large
dimension
What we want to achieve
integrative analysis
to predict a phenotype, to
understand the typology of the
samples, ...
Nathalie Vialaneix | Kernel methods for data integration in systems biology 3/48
In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp
Nathalie Vialaneix | Kernel methods for data integration in systems biology 4/48
In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp
From that, we can compute:
centers of gravity: x = 1
n
n
i=1 xi
distances and dot products:
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j
Nathalie Vialaneix | Kernel methods for data integration in systems biology 4/48
In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp
From that, we can compute:
centers of gravity: x = 1
n
n
i=1 xi
distances and dot products:
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j
Kernels...
The characteristics on the n samples
(xi)i are summarized by pairwise
similarities
More formally: n × n-matrix K, st K is
symmetric and positive definite
Nathalie Vialaneix | Kernel methods for data integration in systems biology 4/48
In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp
From that, we can compute:
centers of gravity: x = 1
n
n
i=1 xi
distances and dot products:
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j
Kernels...
The characteristics on the n samples
(xi)i are summarized by pairwise
similarities
More formally: n × n-matrix K, st K is
symmetric and positive definite
[Aronszajn, 1950] Representer theorem
∃! Hilbert space H and φ : → H st:
Kii = φ(xi), φ(xi ) H
Nathalie Vialaneix | Kernel methods for data integration in systems biology 4/48
Why are kernels interesting?
1 because they can reduce high dimensional data in small similarity
matrices
2 because they are not restricted to data in Rp
(kernels on graphs,
between graphs, on text, ...) some examples to come
3 because they can embed expert knowledge (i.e., phylogeny between
taxons for instance) some examples to come
4 because they offer a rigorous framework to extend many statistical
methods basic principles to come just after
5 because they offer a clean and common framework for data
integration extension 1
Nathalie Vialaneix | Kernel methods for data integration in systems biology 5/48
Why are kernels interesting?
1 because they can reduce high dimensional data in small similarity
matrices
2 because they are not restricted to data in Rp
(kernels on graphs,
between graphs, on text, ...) some examples to come
3 because they can embed expert knowledge (i.e., phylogeny between
taxons for instance) some examples to come
4 because they offer a rigorous framework to extend many statistical
methods basic principles to come just after
5 because they offer a clean and common framework for data
integration extension 1
but:
1 the choice of the relevant kernel is still up to you...
2 can strongly increase computational time when n is large... extension
2
Nathalie Vialaneix | Kernel methods for data integration in systems biology 5/48
Kernel examples
1 Rp
observations: Gaussian kernel Kii = e−γ xi−xi
2
Nathalie Vialaneix | Kernel methods for data integration in systems biology 6/48
Kernel examples
1 Rp
observations: Gaussian kernel Kii = e−γ xi−xi
2
2 nodes of a graph: [Kondor and Lafferty, 2002]
Nathalie Vialaneix | Kernel methods for data integration in systems biology 6/48
Kernel examples
1 Rp
observations: Gaussian kernel Kii = e−γ xi−xi
2
2 nodes of a graph: [Kondor and Lafferty, 2002]
3 sequence kernels (used to compute similarities between proteins for
instance): spectrum kernel [Jaakkola et al., 2000] (with HMM),
convolution kernel [Saigo et al., 2004]
4 kernel between graphs (or “structured data”; used in metabolomics to
compute similarities between metabolites based on their
fragmentation trees): [Shen et al., 2014, Brouard et al., 2016]
More examples: [Mariette and Vialaneix, 2019]
Nathalie Vialaneix | Kernel methods for data integration in systems biology 6/48
Principles for learning from kernels
Start from any statistical method (PCA, regression, k-means clustering)
and rewrite all quantities using:
K to compute distances and dot products
dot product is: Kii and distance is:
√
Kii + Ki i − 2Kii
(implicit) linear or convex combinations of (φ(xi))i to describe all
unobserved elements (centers of gravity and so on...)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 7/48
A simple example: k-means
Nathalie Vialaneix | Kernel methods for data integration in systems biology 8/48
A simple example: k-means
1: Initialization: random initialization of P centers ¯xCt
j
∈ Rp
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
ft+1
(xi) = argmin
j=1,...,P
d(xi, ¯xCt
j
)
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
5: end for Convergence
6: return Partition
Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
A simple example: k-means
1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
ft+1
(xi) = argmin
j=1,...,P
d(xi, ¯xCt
j
)
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
5: end for Convergence
6: return Partition
Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
A simple example: k-means
1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
2: for t = 1 to T do
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
φ(xi) − ¯xCt
j
2
H ,
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
5: end for Convergence
6: return Partition
Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
A simple example: k-means
1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
2: for t = 1 to T do
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
φ(xi) − ¯xCt
j
2
H ,
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
φ(xl)
5: end for Convergence
6: return Partition
Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
A simple example: k-means
1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
2: for t = 1 to T do
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
= Kii −
2
|Ct
j
|
xl∈Ct
j
Kil +
1
|Ct
j
|2
xl, xl ∈Ct
j
Kll .
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
φ(xl)
5: end for Convergence
6: return Partition
Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
Beyond kernels: relational data
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Nathalie Vialaneix | Kernel methods for data integration in systems biology 10/48
Beyond kernels: relational data
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
[Ambroise et al., 2019,
Randriamihamison et al., 2019]
Nathalie Vialaneix | Kernel methods for data integration in systems biology 10/48
Beyond kernels: relational data
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
[Ambroise et al., 2019,
Randriamihamison et al., 2019]
Metagenomics
dissemblance between
samples is better
captured when
phylogeny between
species is taken into
account (unifrac
distances)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 10/48
Formally, relational data are:
Euclidean distances or (non
Euclidean) dissimilarities between n
entities: symmetric (n × n)-matrix D
with positive entries and null
diagonal
Nathalie Vialaneix | Kernel methods for data integration in systems biology 11/48
Formally, relational data are:
Euclidean distances or (non
Euclidean) dissimilarities between n
entities: symmetric (n × n)-matrix D
with positive entries and null
diagonal
kernels: a symmetric and positive
definite (n × n)-matrix K that
measures a “relation” between n
entities in X (arbitrary space)
K(x, x ) = φ(x), φ(x )
Nathalie Vialaneix | Kernel methods for data integration in systems biology 11/48
Formally, relational data are:
Euclidean distances or (non
Euclidean) dissimilarities between n
entities: symmetric (n × n)-matrix D
with positive entries and null
diagonal
kernels: a symmetric and positive
definite (n × n)-matrix K that
measures a “relation” between n
entities in X (arbitrary space)
K(x, x ) = φ(x), φ(x )
networks/graphs: groups of n entities
(nodes/vertices) linked by a
(potentially weighted) relation
(edges)
⇒ symmetric (n × n)-matrix with
positive entries and null diagonal W
Nathalie Vialaneix | Kernel methods for data integration in systems biology 11/48
Formally, relational data are:
Euclidean distances or (non
Euclidean) dissimilarities between n
entities: symmetric (n × n)-matrix D
with positive entries and null
diagonal
kernels: a symmetric and positive
definite (n × n)-matrix K that
measures a “relation” between n
entities in X (arbitrary space)
K(x, x ) = φ(x), φ(x )
networks/graphs: groups of n entities
(nodes/vertices) linked by a
(potentially weighted) relation
(edges)
⇒ symmetric (n × n)-matrix with
positive entries and null diagonal W
Similarities between n entities:
symmetric (n × n)-matrix S (with
usually positive entries) but not
necessarily definite positive
Nathalie Vialaneix | Kernel methods for data integration in systems biology 11/48
Different relational data types are related to each others
a kernel is equivalent to an Euclidean distance:
D(x, x ) := K(x, x) + K(x , x ) − 2K(x, x )
from a dissimilarity, similarities can be computed:
S(x, x) := a(x) (arbitrary), S(x, x ) =
1
2
a(x) + a(x ) − D2
(x, x )
various kernels have been proposed for graphs (e.g., based on the
graph Laplacian): [Kondor and Lafferty, 2002]
Nathalie Vialaneix | Kernel methods for data integration in systems biology 12/48
Different relational data types are related to each others
a kernel is equivalent to an Euclidean distance:
D(x, x ) := K(x, x) + K(x , x ) − 2K(x, x )
from a dissimilarity, similarities can be computed:
S(x, x) := a(x) (arbitrary), S(x, x ) =
1
2
a(x) + a(x ) − D2
(x, x )
various kernels have been proposed for graphs (e.g., based on the
graph Laplacian): [Kondor and Lafferty, 2002]
in summary
useful simplification: “is the framework Euclidean or not?” (e.g., kernel vs
non Euclidean dissimilarity)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 12/48
Principles for learning from relational data
Euclidean case (kernel K)
rewrite all quantities using:
K to compute distances and dot
products
linear or convex combinations of
(φ(xi))i to describe all
unobserved elements (centers
of gravity and so on...)
Works for: PCA, k-means, linear
regression, ...
Nathalie Vialaneix | Kernel methods for data integration in systems biology 13/48
Principles for learning from relational data
Euclidean case (kernel K)
rewrite all quantities using:
K to compute distances and dot
products
linear or convex combinations of
(φ(xi))i to describe all
unobserved elements (centers
of gravity and so on...)
Works for: PCA, k-means, linear
regression, ...
non Euclidean case (non Euclidean
dissimilarity D): do almost the same
using a pseudo-Euclidean framework
[Goldfarb, 1984]
∃ two Euclidean spaces E+ and E−
and two mappings φ+ and φ− st:
D(x, x ) = φ+(x) − φ+(x ) 2
E+
−
φ−(x) − φ−(x ) 2
E−
Nathalie Vialaneix | Kernel methods for data integration in systems biology 13/48
And now?
1 integrate multiple data sources with kernels (with application to
metagenomic datasets) extension 1
2 reduce complexity of kernel methods extension 2
Nathalie Vialaneix | Kernel methods for data integration in systems biology 14/48
Combining relational data in an
unsupervised setting
Nathalie Vialaneix | Kernel methods for data integration in systems biology 15/48
What are metagenomic data?
Source: [Sommer et al., 2010]
Nathalie Vialaneix | Kernel methods for data integration in systems biology 16/48
What are metagenomic data?
Source: [Sommer et al., 2010]
abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 16/48
What are metagenomic data?
Source: [Sommer et al., 2010]
abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.
phylogenetic tree (evolution
history between species,
OTUs...). One tree with p leaves
built from the sequences
collected in the n samples.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 16/48
What are metagenomic data used for?
produce a profile of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various fields: environmental science, microbiote, ...
Nathalie Vialaneix | Kernel methods for data integration in systems biology 17/48
What are metagenomic data used for?
produce a profile of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various fields: environmental science, microbiote, ...
Processed by computing a relevant dissimilarity between samples
(standard Euclidean distance is not relevant) and by using this dissimilarity
in subsequent analyses.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 17/48
β-diversity data: dissimilarities between count data
Compositional dissimilarities: (nig) count of species g for sample i
Jaccard: the fraction of species specific of either sample i or j:
djac =
g I{nig>0,njg=0} + I{njg>0,nig=0}
j I{nig+njg>0}
Bray-Curtis: the fraction of the sample which is specific of either
sample i or j
dBC =
g |nig − njg|
g(nig + njg)
Other dissimilarities available in the R package philoseq, most of them
not Euclidean.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 18/48
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
Nathalie Vialaneix | Kernel methods for data integration in systems biology 19/48
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 19/48
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Unifrac: the fraction of the tree specific to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}
Nathalie Vialaneix | Kernel methods for data integration in systems biology 19/48
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Unifrac: the fraction of the tree specific to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}
Weighted Unifrac: the fraction of the
diversity specific to sample i or to sample j.
dwUF =
e le|pei − pej|
e(pei + pej)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 19/48
TARA Oceans datasets
The 2009-2013 expedition
Co-directed by Étienne Bourgois
and Éric Karsenti.
7,012 datasets collected from
35,000 samples of plankton and
water (11,535 Gb of data).
Study the plankton: bacteria,
protists, metazoans and viruses
representing more than 90% of the
biomass in the ocean.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 20/48
TARA Oceans datasets
Science (May 2015) - Studies on:
eukaryotic plankton diversity
[de Vargas et al., 2015],
ocean viral communities
[Brum et al., 2015],
global plankton interactome
[Lima-Mendez et al., 2015],
global ocean microbiome
[Sunagawa et al., 2015],
. . . .
→ datasets from different types and
different sources analyzed separately.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 21/48
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
TARA Oceans datasets that we used
[de Vargas et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
TARA Oceans datasets that we used
[Brum et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).
virus composition: ∼ 867 virus clusters based on shared gene content.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
TARA Oceans datasets that we used
Common samples
48 samples,
2 depth layers: surface
(SRF) and deep chlorophyll
maximum (DCM),
31 different sampling
stations.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 23/48
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
Nathalie Vialaneix | Kernel methods for data integration in systems biology 24/48
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
Nathalie Vialaneix | Kernel methods for data integration in systems biology 24/48
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
unsupervised framework but input space is Rp
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 24/48
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
unsupervised framework but input space is Rp
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.
Our proposal: 2 UMKL frameworks which do not require data to have
values in Rd
.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 24/48
Multi-kernel/distances integration
How to “optimally” combine several
relational datasets in an unsupervised
setting?
for kernels K1
, . . . , KM
obtained on the
same n objects, search: Kβ = M
m=1 βmKm
with βm ≥ 0 and m βm = 1
[Mariette and Villa-Vialaneix, 2018]
Package R mixKernel
https://cran.r-project.org/
package=mixKernel
Nathalie Vialaneix | Kernel methods for data integration in systems biology 25/48
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the
kernel framework)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 26/48
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the
kernel framework)
maximizev
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 26/48
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the
kernel framework)
maximizev
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.
Solution: first eigenvector of C ⇒ Set β = v
M
m=1 vm
(consensual kernel).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 26/48
A kernel preserving the original topology of the data I
Similarly to [Lin et al., 2010], preserve the local geometry of the data in the
feature space.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 27/48
A kernel preserving the original topology of the data I
Similarly to [Lin et al., 2010], preserve the local geometry of the data in the
feature space.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k
Nathalie Vialaneix | Kernel methods for data integration in systems biology 27/48
A kernel preserving the original topology of the data I
Similarly to [Lin et al., 2010], preserve the local geometry of the data in the
feature space.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k
Feature space geometry measured by
∆i(β) = φ∗
β(xi),


φ∗
β(x1)
...
φ∗
β(xn)


=


K∗
β(xi, x1)
...
K∗
β(xi, xn)


Nathalie Vialaneix | Kernel methods for data integration in systems biology 27/48
A kernel preserving the original topology of the data II
Sparse version
minimizeβ
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
β =
M
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
m=1
βm = 1.
Non sparse version
minimizev
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
v =
M
m=1
vmKm
and v ∈ RM
st vm ≥ 0 and v 2 = 1.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 28/48
A kernel preserving the original topology of the data II
Sparse version
equivalent to a standard QP problem with linear constrains (ex: package
quadprog in R)
Non sparse version
equivalent to a QPQC problem (harder to solve) solved with “Alternating
Direction Method of Multipliers” (ADMM [Boyd et al., 2011])
Nathalie Vialaneix | Kernel methods for data integration in systems biology 29/48
Application to TARA oceans
Similarity between datasets (STATIS)
phychem and small size organisms are the most similar (confirmed
by [de Vargas et al., 2015] et [Sunagawa et al., 2015]).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 30/48
Application to TARA oceans
Important variables
Rhizaria abundance strongly structure the differences between samples (analyses
restricted to some organisms found differences mostly based on water depths)
and waters from Arctic Oceans and Pacific Oceans differ in terms of Rhizaria
abundance
Back to choice - Jump to conclusion
Nathalie Vialaneix | Kernel methods for data integration in systems biology 31/48
Reducing complexity of kernel
methods
Nathalie Vialaneix | Kernel methods for data integration in systems biology 32/48
Large scale kernel methods
Standard complexity (number of elementary operations) of kernel learning
methods: O(n2
) or even O(n3
)
Examples
K-PCA: spectral decomposition of K (equivalent to PCA in feature
space) is O(n3
) as compared to O min(p, n)3
for standard PCA
kernel k-means: complexity of naive kernel k-means is O(Tkn2
), as
compared to O(Tkpn) for naive standard k-means
Nathalie Vialaneix | Kernel methods for data integration in systems biology 33/48
Low rank approximation solutions
Aim: approximate K with a low rank matrix (matrix with rank r n.
Then, use the approximation to train your predictor and correct it to
“re-scale” it to n. Typical computational cost: O(nr2
).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 34/48
Sketch of Nyström approximation
[Williams and Seeger, 2000, Drineas and Mahoney, 2005]
Pick at random m observations in {1, . . . , m} (without loss of
generality suppose that the first m ones have been chosen).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 35/48
Sketch of Nyström approximation
[Williams and Seeger, 2000, Drineas and Mahoney, 2005]
Pick at random m observations in {1, . . . , m} (without loss of
generality suppose that the first m ones have been chosen).
Re-write:
K =
K(m) K(m,n−m)
K(n−m,m) K(n−m,n−m) et K(n,m)
=
K(m)
K(n−m,m) ,
with K(n−m,m) = (K(m,n−m)) , and use K(m) instead of K.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 35/48
Approximate spectral decomposition of K
Notations:
K
eigenvectors: (vj)j=1,...,n
eigenvalues: (λj)j=1,...,n (positive,
decreasing order)
K(m)
eigenvectors: (v
(m)
j
)j=1,...,m
eigenvalues: (λ
(m)
j
)j=1,...,m (positive,
decreasing order)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 36/48
Approximate spectral decomposition of K
Notations:
K
eigenvectors: (vj)j=1,...,n
eigenvalues: (λj)j=1,...,n (positive,
decreasing order)
K(m)
eigenvectors: (v
(m)
j
)j=1,...,m
eigenvalues: (λ
(m)
j
)j=1,...,m (positive,
decreasing order)
∀ j = 1, . . . , m, µj
n
m
µ
(m)
j
and vj
m
n
1
µ
(m)
j
K(n,m)
v
(m)
j
Nathalie Vialaneix | Kernel methods for data integration in systems biology 36/48
Approximate spectral decomposition of K
Notations:
K
eigenvectors: (vj)j=1,...,n
eigenvalues: (λj)j=1,...,n (positive,
decreasing order)
K(m)
eigenvectors: (v
(m)
j
)j=1,...,m
eigenvalues: (λ
(m)
j
)j=1,...,m (positive,
decreasing order)
∀ j = 1, . . . , m, µj
n
m
µ
(m)
j
and vj
m
n
1
µ
(m)
j
K(n,m)
v
(m)
j
complexity of the direct calculation: O(n3
)
complexity of the approximate solution: O(m3
) + O(nm2
)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 36/48
Approximate spectral decomposition of K
Notations:
K
eigenvectors: (vj)j=1,...,n
eigenvalues: (λj)j=1,...,n (positive,
decreasing order)
K(m)
eigenvectors: (v
(m)
j
)j=1,...,m
eigenvalues: (λ
(m)
j
)j=1,...,m (positive,
decreasing order)
∀ j = 1, . . . , m, µj
n
m
µ
(m)
j
and vj
m
n
1
µ
(m)
j
K(n,m)
v
(m)
j
Remark: When the rank of K is < m, the approximation is exact.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 36/48
What can be obtained from that...
Approximation of the original kernel K [Cortes et al., 2010, Bach, 2013]
Approximation of the kernel (ridge) regression with a control of the
estimation error [Cortes et al., 2010, Bach, 2013] (similar methods exist to
use Nyström approximation in SVM).
Various derived extensions [Mariette et al., 2017a] (online
Self-Organizing Maps)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 37/48
Basics on other approaches in an online framework
Online learning: deal with samples one by one and update (at low cost)
the model
How to use online learning for reducing complexity?: cache some
operations in memory + (not always) impose sparsity on representers
(centers of gravity or so...)
[Rossi et al., 2007] for kernel k-means - [Mariette et al., 2017b] for kernel SOM
Nathalie Vialaneix | Kernel methods for data integration in systems biology 38/48
Basics on (standard) stochastic SOM
[Kohonen, 2001]
x
x
x
(xi)i=1,...,n ⊂ Rp
are affected to a unit f(xi) ∈ {1, . . . , U}
the grid is equipped with a “distance” between units: d(u, u ) and
observations affected to close units are close in Rp
every unit u corresponds to a prototype, pu (x) in Rp
Nathalie Vialaneix | Kernel methods for data integration in systems biology 39/48
Basics on (standard) stochastic SOM
[Kohonen, 2001]
x
x
x
Iterative learning (assignment step): xi is picked at random within (xk )k
and affected to best matching unit:
ft
(xi) = arg min
u
xi − pt
u
2
Nathalie Vialaneix | Kernel methods for data integration in systems biology 39/48
Basics on (standard) stochastic SOM
[Kohonen, 2001]
x
x
x
Iterative learning (representation step): all prototypes in neighboring units
are updated with a gradient descent like step:
pt+1
u ←− pt
u + µ(t)Ht
(d(f(xi), u))(xi − pt
u)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 39/48
Extension of SOM to data described by a kernel
[Villa and Rossi, 2007]
Data: (xi)i=1,...,n ∈ Rp
1: Initialization:
randomly set p0
1
, ..., p0
U
in Rd
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) xi − pt
u
7: end for
8: end for
The general relational variant is implemented in SOMbrero.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
Extension of SOM to data described by a kernel
[Villa and Rossi, 2007]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
randomly set p0
1
, ..., p0
U
in Rd
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) xi − pt
u
7: end for
8: end for
The general relational variant is implemented in SOMbrero.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
Extension of SOM to data described by a kernel
[Villa and Rossi, 2007]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u = n
i=1 β0
ui
φ(xi) (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) xi − pt
u
7: end for
8: end for
The general relational variant is implemented in SOMbrero.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
Extension of SOM to data described by a kernel
[Villa and Rossi, 2007]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u = n
i=1 β0
ui
φ(xi) (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
φ(xi) − pt
u
2
X
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) xi − pt
u
7: end for
8: end for
The general relational variant is implemented in SOMbrero.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
Extension of SOM to data described by a kernel
[Villa and Rossi, 2007]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u = n
i=1 β0
ui
φ(xi) (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
φ(xi) − pt
u
2
X
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) φ(xi) − pt
u
7: end for
8: end for
The general relational variant is implemented in SOMbrero.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
Extension of SOM to data described by a kernel
[Villa and Rossi, 2007]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u = n
i=1 β0
ui
φ(xi) (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
(βt
u) Kβt
u − 2(βt
u) K.i
5: for all u = 1 → U do Representation
6:
βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u)) 1i − βt
u
7: end for
8: end for
The general relational variant is implemented in SOMbrero.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
Example: SOM for typology of Astraptes fulgerator from
DNA barcoding
Edit distances between DNA sequences [Olteanu and Villa-Vialaneix, 2015]
Almost perfect clustering (identifying a possible label error on one sample)
with (in addition) information on relations between species.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 41/48
Problems with KSOM
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 β0
ui
φ(xi) (convex combination)
2: for t = 1 → γn do
3: pick randomly i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
n
j,j =1
βt
ujβt
uj Kjj − 2
n
j=1
βt
ujKji → O(n2
U)
5: for all u = 1 → U do Representation
6:
βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))(1i − βt
u) → O(nU)
7: end for
8: end for
→ algorithm complexity: O(γn3
U) (compared to O(γUpn) for numeric)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 42/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: for t = 1 → γn do
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
n
j,j =1
βt
ujβt
uj Kjj − 2
n
j=1
βt
ujKji
5: for all u = 1 → U do Representation
6: βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))(1i − βt
u)
7: end for
8: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: for t = 1 → γn do
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
n
j,j =1
βt
ujβt
uj Kjj
At
u
−2
n
j=1
βt
ujKji
Bt
ui
5: for all u = 1 → U do Representation
6: βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))(1i − βt
u)
7: end for
8: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: for t = 1 → γn do
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
5: for all u = 1 → U do Representation
6: βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))(1i − βt
u)
7: end for
8: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: for t = 1 → γn do
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
5: for all u = 1 → U do Representation
6: βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u))
λu(t)
(1i − βt
u)
7: end for
8: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: for t = 1 → γn do
3: pick at random i ∈ {1, . . . , n}
4: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
5: for all u = 1 → U do Representation
6: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
7: end for
8: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj
3: B0
ui
= n
j=1 β0
uj
Kji
4: for t = 1 → γn do
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
7: for all u = 1 → U do Representation
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
9: end for
10: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj
3: B0
ui
= n
j=1 β0
uj
Kji
4: for t = 1 → γn do
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
7: for all u = 1 → U do Representation
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
Ki j = (1 − λu(t))Bt
ui
+ λu(t)Kii
9: end for
10: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj
3: B0
ui
= n
j=1 β0
uj
Kji
4: for t = 1 → γn do
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
7: for all u = 1 → U do Representation
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
Ki j = (1 − λu(t))Bt
ui
+ λu(t)Kii
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
Kii +2λu(t)(1−λu(t))Bt
ui
9: end for
10: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj → O(n2
U)
3: B0
ui
= n
j=1 β0
uj
Kji → O(nU)
4: for t = 1 → γn do
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui
7: for all u = 1 → U do Representation
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
Ki j = (1 − λu(t))Bt
ui
+ λu(t)Kii
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
Kii +2λu(t)(1−λu(t))Bt
ui
9: end for
10: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj → O(n2
U)
3: B0
ui
= n
j=1 β0
uj
Kji → O(nU)
4: for t = 1 → γn do
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui → does not depend on n
7: for all u = 1 → U do Representation
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
Ki j = (1 − λu(t))Bt
ui
+ λu(t)Kii
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
Kii +2λu(t)(1−λu(t))Bt
ui
9: end for
10: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj → O(n2
U)
3: B0
ui
= n
j=1 β0
uj
Kji → O(nU)
4: for t = 1 → γn do
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui → does not depend on n
7: for all u = 1 → U do Representation
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
Ki j = (1 − λu(t))Bt
ui
+ λu(t)Kii → O(nU)
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
Kii +2λu(t)(1−λu(t))Bt
ui
→ O(U)
9: end for
10: end for
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Reducing the stochastic K-SOM complexity
[Mariette et al., 2017a]
Data: (xi)i=1,...,n ∈ X
1: Initialization: p0
u = n
i=1 βuiφ(xi) (convex combination)
2: A0
u = n
j,j =1 β0
uj
β0
uj
Kjj → O(n2
U)
3: B0
ui
= n
j=1 β0
uj
Kji → O(nU)
4: for t = 1 → γn do
5: pick at random i ∈ {1, . . . , n}
6: Assignment ft
(xi) = arg min
u=1,...,U
At
u − 2Bt
ui → does not depend on n
7: for all u = 1 → U do Representation
8: βt+1
u = (1 − λu(t))βt
u + λu(t)1i
Bt+1
ui
= n
j=1 βt+1
uj
Ki j = (1 − λu(t))Bt
ui
+ λu(t)Kii → O(nU)
At+1
u = n
j,j =1 βt+1
uj
βt+1
uj
Kjj = (1−λu(t))2
At
u+λu(t)2
Kii +2λu(t)(1−λu(t))Bt
ui
→ O(U)
9: end for
10: end for
Final complexity: O(γn2
U) with additional storage memory of O(U) and
O(Un).
Back to choice - Jump to conclusion
Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
Conclusions
Kernel methods are useful for:
dealing with different types of data
even when they are high-dimensional
combining them
However, they can be:
computationally intensive to train
not easy to interpret (work-in-progress with Jérôme Mariette and
Céline Brouard on variable selection in unsupervised setting)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 44/48
SOMbrero
Madalina Olteanu,
Fabrice Rossi, Marie Cottrell,
Laura Bendhaïba and
Julien Boelaert
SOMbrero and mixKernel
Jérôme Mariette
adjclust
Pierre Neuvial, Nathanaël Randriamihamison
Guillem Rigail, Christophe Ambroise and
Shubham Chaturvedi
Nathalie Vialaneix | Kernel methods for data integration in systems biology 45/48
Credits for pictures
Slide 3: image based on ENCODE project, by Darryl Leja (NHGRI), Ian Dunham
(EBI) and Michael Pazin (NHGRI)
Slide 8: k-means image from Wikimedia Commons by Weston.pace
Slide 10: Astraptes picture is from
https://www.flickr.com/photos/39139121@N00/2045403823/ by Anne Toal
(CC BY-SA 2.0), Hi-C experiment is taken from the article Matharu et al., 2015
DOI:10.1371/journal.pgen.1005640 (CC BY-SA 4.0) and metagenomics illustration is
taken from the article Sommer et al., 2010 DOI:10.1038/msb.2010.16 (CC BY-NC-SA
3.0)
Other pictures are from articles that I co-authored.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
References
Ambroise, C., Dehman, A., Neuvial, P., Rigaill, G., and Vialaneix, N. (2019).
Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics.
arXiv preprint arXiv:1902.01596.
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Bach, F. (2013).
Sharp analysis of low-rank kernel matrix approximations.
Journal of Machine Learning Research, Workshop and Conference Proceedings, 30:185–209.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011).
Distributed optimization and statistical learning via the alterning direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122.
Brouard, C., Shen, H., Dürkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016).
Fast metabolite identification with input output kernel regression.
Bioinformatics, 32(12):i28–i36.
Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J.,
Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S.,
Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker,
P., Karsenti, E., and Sullivan, M. (2015).
Patterns and ecological drivers of ocean viral communities.
Science, 348(6237).
Cortes, C., Mohri, M., and Talwalkar, A. (2010).
On the impact of kernel approximation on learning accuracy.
Journal of Machine Learning Research, Workshop and Conference Proceedings, 9:113–120.
Crone, L. and Crosby, D. (1995).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
Statistical applications of a metric on subspaces to satellite meteorology.
Technometrics, 37(3):324–328.
de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I.,
Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O.,
Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F.,
Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C.,
Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S.,
Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015).
Eukaryotic plankton diversity in the sunlit ocean.
Science, 348(6237).
Drineas, P. and Mahoney, M. (2005).
On the Nyström method for approximating a Gram matrix for improved kernel-based learning.
Journal of Machine Learning Research, 6:2153–2175.
Goldfarb, L. (1984).
A unified approach to pattern recognition.
Pattern Recognition, 17(5):575–582.
Gönen, M. and Alpaydin, E. (2011).
Multiple kernel learning algorithms.
Journal of Machine Learning Research, 12:2211–2268.
Jaakkola, T., Diekhans, M., and Haussler, D. (2000).
A discriminative framework for detecting remote protein homologies.
Journal of Computational Biology, 7(1-2):95–114.
Kohonen, T. (2001).
Self-Organizing Maps, 3rd Edition, volume 30.
Springer, Berlin, Heidelberg, New York.
Kondor, R. and Lafferty, J. (2002).
Diffusion kernels on graphs and other discrete structures.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
In Sammut, C. and Hoffmann, A., editors, Proceedings of the 19th International Conference on Machine Learning, pages
315–322, Sydney, Australia. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA.
Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).
The ACT (STATIS method).
Computational Statistics and Data Analysis, 18(1):97–119.
L’Hermier des Plantes, H. (1976).
Structuration des tableaux à trois indices de la statistique.
PhD thesis, Université de Montpellier.
Thèse de troisième cycle.
Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F.,
Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F.,
de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P.,
Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G.,
Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M.,
Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).
Determinants of community structure in the global plankton interactome.
Science, 348(6237).
Lin, Y., Liu, T., and CS., F. (2010).
Multiple kernel learning for dimensionality reduction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.
Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017a).
Efficient interpretable variants of online SOM for large dissimilarity data.
Neurocomputing, 225:31–48.
Mariette, J., Rossi, F., Olteanu, M., and Villa-Vialaneix, N. (2017b).
Accelerating stochastic kernel som.
In Verleysen, M., editor, XXVth European Symposium on Artificial Neural Networks, Computational Intelligence and Machine
Learning (ESANN 2017), pages 269–274, Bruges, Belgium. i6doc.
Mariette, J. and Vialaneix, N. (2019).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
Approches à noyau pour l’analyse et l’intégration de données omiques en biologie des systèmes.
Forthcoming (book chapter).
Mariette, J. and Villa-Vialaneix, N. (2018).
Unsupervised multiple kernel learning for heterogeneous data integration.
Bioinformatics, 34(6):1009–1015.
Olteanu, M. and Villa-Vialaneix, N. (2015).
On-line relational and multiple relational SOM.
Neurocomputing, 147:15–30.
Randriamihamison, N., Vialaneix, N., and Neuvial, P. (2019).
Applicability and interpretability of hierarchical agglomerative clustering with or without contiguity constraints.
Submitted for publication. Preprint arXiv 1909.10923.
Robert, P. and Escoufier, Y. (1976).
A unifying tool for linear multivariate statistical methods: the rv-coefficient.
Applied Statistics, 25(3):257–265.
Rossi, F., Hasenfuss, A., and Hammer, B. (2007).
Accelerating relational clustering algorithms with sparse prototype representation.
In Proceedings of the 6th Workshop on Self-Organizing Maps (WSOM 07), Bielefield, Germany. Neuroinformatics Group,
Bielefield University.
Saigo, H., Vert, J.-P., Ueda, N., and Akutsu, T. (2004).
Protein homology detection using string alignment kernels.
Bioinformatics, 20(11):1682–1689.
Shen, H., Dührkop, K., Böcher, S., and Rousu, J. (2014).
Metabolite identification through multiple kernel learning on fragmentation trees.
Bioinformatics, 30(12):i157–i64.
Sommer, M., Church, G., and Dantas, G. (2010).
A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
Molecular Systems Biology, 6(360).
Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,
Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka,
F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral,
M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,
Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P.,
Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015).
Structure and function of the global ocean microbiome.
Science, 348(6237).
Villa, N. and Rossi, F. (2007).
A comparison between dissimilarity SOM and kernel SOM for clustering the vertices of a graph.
In 6th International Workshop on Self-Organizing Maps (WSOM 2007), Bielefield, Germany. Neuroinformatics Group, Bielefield
University.
Williams, C. and Seeger, M. (2000).
Using the Nyström method to speed up kernel machines.
In Leen, T., Dietterich, T., and Tresp, V., editors, Advances in Neural Information Processing Systems (Proceedings of NIPS
2000), volume 13, Denver, CO, USA. Neural Information Processing Systems Foundation.
Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011).
Unsupervised multiple kernel clustering.
Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Solved using Alternating Direction Method of Multipliers (ADMM
[Boyd et al., 2011]) by replacing the previous optimization problem
with
min
x,z
x Sx + 1{x≥0}(x) + 1{ z 2
2
≥1}(z)
with the constraint x − z = 0.
Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Solved using Alternating Direction Method of Multipliers (ADMM
[Boyd et al., 2011])
1 minx x Sx + y (x − z) + λ
2
x − z 2
under the constraint x ≥ 0
(standard QP problem)
2 project on the unit ball z = x
min{ x 2,1}
3 update auxiliary variable y = y + λ(x − z)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
Nathalie Vialaneix | Kernel methods for data integration in systems biology 48/48
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 48/48
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)
the influence of a given species in a given dataset on a given PC
subspace is accessed by computing the Crone-Crosby distance
between these two PCA subspaces [Crone and Crosby, 1995] (∼
Frobenius norm between the projectors)
Nathalie Vialaneix | Kernel methods for data integration in systems biology 48/48

Contenu connexe

Tendances

From Big Data to Precision Medicine
From Big Data to Precision Medicine From Big Data to Precision Medicine
From Big Data to Precision Medicine Year of the X
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Robert (Rob) Salomon
 
E-learning Research Article Presentation
E-learning Research Article PresentationE-learning Research Article Presentation
E-learning Research Article PresentationLiberty Joy
 
Artificial intelligence in medicine (projeck)
Artificial intelligence in medicine (projeck)Artificial intelligence in medicine (projeck)
Artificial intelligence in medicine (projeck)YasserAli152984
 
Resistance aux antimicrobiens aux origines de l'emergence et de la propagatio...
Resistance aux antimicrobiens aux origines de l'emergence et de la propagatio...Resistance aux antimicrobiens aux origines de l'emergence et de la propagatio...
Resistance aux antimicrobiens aux origines de l'emergence et de la propagatio...Dr Taoufik Djerboua
 
Surveillance entomologique dans la lutte contre le paludisme
Surveillance entomologique dans la lutte contre le paludismeSurveillance entomologique dans la lutte contre le paludisme
Surveillance entomologique dans la lutte contre le paludismeInstitut Pasteur de Madagascar
 
Introduction au développement chimique pharmaceutique
Introduction au développement chimique pharmaceutiqueIntroduction au développement chimique pharmaceutique
Introduction au développement chimique pharmaceutiqueDiolez Christian
 
Chapitre 1Microbiologie générale.
Chapitre 1Microbiologie générale.Chapitre 1Microbiologie générale.
Chapitre 1Microbiologie générale.SenouciKhadidja
 
Gene Set Analysis and Visualization Workshop. Part II: Visualization
Gene Set Analysis and Visualization Workshop. Part II: VisualizationGene Set Analysis and Visualization Workshop. Part II: Visualization
Gene Set Analysis and Visualization Workshop. Part II: VisualizationSvetlana Frenkel
 

Tendances (11)

From Big Data to Precision Medicine
From Big Data to Precision Medicine From Big Data to Precision Medicine
From Big Data to Precision Medicine
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
 
E-learning Research Article Presentation
E-learning Research Article PresentationE-learning Research Article Presentation
E-learning Research Article Presentation
 
Artificial intelligence in medicine (projeck)
Artificial intelligence in medicine (projeck)Artificial intelligence in medicine (projeck)
Artificial intelligence in medicine (projeck)
 
Resistance aux antimicrobiens aux origines de l'emergence et de la propagatio...
Resistance aux antimicrobiens aux origines de l'emergence et de la propagatio...Resistance aux antimicrobiens aux origines de l'emergence et de la propagatio...
Resistance aux antimicrobiens aux origines de l'emergence et de la propagatio...
 
Surveillance entomologique dans la lutte contre le paludisme
Surveillance entomologique dans la lutte contre le paludismeSurveillance entomologique dans la lutte contre le paludisme
Surveillance entomologique dans la lutte contre le paludisme
 
Introduction au développement chimique pharmaceutique
Introduction au développement chimique pharmaceutiqueIntroduction au développement chimique pharmaceutique
Introduction au développement chimique pharmaceutique
 
Chapitre 1Microbiologie générale.
Chapitre 1Microbiologie générale.Chapitre 1Microbiologie générale.
Chapitre 1Microbiologie générale.
 
Les infections mixtes
Les infections mixtesLes infections mixtes
Les infections mixtes
 
Gene Set Analysis and Visualization Workshop. Part II: Visualization
Gene Set Analysis and Visualization Workshop. Part II: VisualizationGene Set Analysis and Visualization Workshop. Part II: Visualization
Gene Set Analysis and Visualization Workshop. Part II: Visualization
 
Proteomics
Proteomics Proteomics
Proteomics
 

Similaire à Kernel methods for data integration in systems biology

Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology tuxette
 
Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...tuxette
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquestuxette
 
Kernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational BiologyKernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational BiologyMichiel Stock
 
Data mining classifiers.
Data mining classifiers.Data mining classifiers.
Data mining classifiers.ShwetaPatil174
 
Learning from (dis)similarity data
Learning from (dis)similarity dataLearning from (dis)similarity data
Learning from (dis)similarity datatuxette
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...tuxette
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIRtuxette
 
Non-parametric analysis of models and data
Non-parametric analysis of models and dataNon-parametric analysis of models and data
Non-parametric analysis of models and datahaharrington
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Grouping techniques for facing Volume and Velocity in the Big Data
Grouping techniques for facing Volume and Velocity in the Big DataGrouping techniques for facing Volume and Velocity in the Big Data
Grouping techniques for facing Volume and Velocity in the Big DataFacultad de Informática UCM
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation sourcebutest
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIRtuxette
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Miningijdmtaiir
 

Similaire à Kernel methods for data integration in systems biology (20)

Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
 
Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiques
 
Kernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational BiologyKernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational Biology
 
Data mining classifiers.
Data mining classifiers.Data mining classifiers.
Data mining classifiers.
 
Basen Network
Basen NetworkBasen Network
Basen Network
 
Learning from (dis)similarity data
Learning from (dis)similarity dataLearning from (dis)similarity data
Learning from (dis)similarity data
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
08 entropie
08 entropie08 entropie
08 entropie
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
Non-parametric analysis of models and data
Non-parametric analysis of models and dataNon-parametric analysis of models and data
Non-parametric analysis of models and data
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Grouping techniques for facing Volume and Velocity in the Big Data
Grouping techniques for facing Volume and Velocity in the Big DataGrouping techniques for facing Volume and Velocity in the Big Data
Grouping techniques for facing Volume and Velocity in the Big Data
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
 

Plus de tuxette

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathstuxette
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-Ctuxette
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?tuxette
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquestuxette
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeantuxette
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquestuxette
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...tuxette
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation datatuxette
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?tuxette
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysistuxette
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricestuxette
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Predictiontuxette
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelstuxette
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random foresttuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICStuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICStuxette
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNNtuxette
 

Plus de tuxette (20)

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random forest
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNN
 

Dernier

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 

Dernier (20)

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 

Kernel methods for data integration in systems biology

  • 1. Kernel methods for data integration in systems biology Nathalie Vialaneix nathalie.vialaneix@inra.fr http://www.nathalievialaneix.eu KIM Seminar October 18th, 2019 - Montpellier Nathalie Vialaneix | Kernel methods for data integration in systems biology 1/48
  • 2. A primer on kernel methods for biology Nathalie Vialaneix | Kernel methods for data integration in systems biology 2/48
  • 3. Before we start: context and motivations Data characteristics a few (paired) samples information at various levels ... but of heterogeneous types and, when numeric, with a large dimension Nathalie Vialaneix | Kernel methods for data integration in systems biology 3/48
  • 4. Before we start: context and motivations Data characteristics a few (paired) samples information at various levels ... but of heterogeneous types and, when numeric, with a large dimension What we want to achieve integrative analysis to predict a phenotype, to understand the typology of the samples, ... Nathalie Vialaneix | Kernel methods for data integration in systems biology 3/48
  • 5. In short: what are kernels? Data we are used to... n samples on which p variables are measured (xi)i=1,...,n with xi ∈ Rp Nathalie Vialaneix | Kernel methods for data integration in systems biology 4/48
  • 6. In short: what are kernels? Data we are used to... n samples on which p variables are measured (xi)i=1,...,n with xi ∈ Rp From that, we can compute: centers of gravity: x = 1 n n i=1 xi distances and dot products: d(xi, xi ) = p j=1 (xij − xi j)2 and xi, xi = p j=1 xijxi j Nathalie Vialaneix | Kernel methods for data integration in systems biology 4/48
  • 7. In short: what are kernels? Data we are used to... n samples on which p variables are measured (xi)i=1,...,n with xi ∈ Rp From that, we can compute: centers of gravity: x = 1 n n i=1 xi distances and dot products: d(xi, xi ) = p j=1 (xij − xi j)2 and xi, xi = p j=1 xijxi j Kernels... The characteristics on the n samples (xi)i are summarized by pairwise similarities More formally: n × n-matrix K, st K is symmetric and positive definite Nathalie Vialaneix | Kernel methods for data integration in systems biology 4/48
  • 8. In short: what are kernels? Data we are used to... n samples on which p variables are measured (xi)i=1,...,n with xi ∈ Rp From that, we can compute: centers of gravity: x = 1 n n i=1 xi distances and dot products: d(xi, xi ) = p j=1 (xij − xi j)2 and xi, xi = p j=1 xijxi j Kernels... The characteristics on the n samples (xi)i are summarized by pairwise similarities More formally: n × n-matrix K, st K is symmetric and positive definite [Aronszajn, 1950] Representer theorem ∃! Hilbert space H and φ : → H st: Kii = φ(xi), φ(xi ) H Nathalie Vialaneix | Kernel methods for data integration in systems biology 4/48
  • 9. Why are kernels interesting? 1 because they can reduce high dimensional data in small similarity matrices 2 because they are not restricted to data in Rp (kernels on graphs, between graphs, on text, ...) some examples to come 3 because they can embed expert knowledge (i.e., phylogeny between taxons for instance) some examples to come 4 because they offer a rigorous framework to extend many statistical methods basic principles to come just after 5 because they offer a clean and common framework for data integration extension 1 Nathalie Vialaneix | Kernel methods for data integration in systems biology 5/48
  • 10. Why are kernels interesting? 1 because they can reduce high dimensional data in small similarity matrices 2 because they are not restricted to data in Rp (kernels on graphs, between graphs, on text, ...) some examples to come 3 because they can embed expert knowledge (i.e., phylogeny between taxons for instance) some examples to come 4 because they offer a rigorous framework to extend many statistical methods basic principles to come just after 5 because they offer a clean and common framework for data integration extension 1 but: 1 the choice of the relevant kernel is still up to you... 2 can strongly increase computational time when n is large... extension 2 Nathalie Vialaneix | Kernel methods for data integration in systems biology 5/48
  • 11. Kernel examples 1 Rp observations: Gaussian kernel Kii = e−γ xi−xi 2 Nathalie Vialaneix | Kernel methods for data integration in systems biology 6/48
  • 12. Kernel examples 1 Rp observations: Gaussian kernel Kii = e−γ xi−xi 2 2 nodes of a graph: [Kondor and Lafferty, 2002] Nathalie Vialaneix | Kernel methods for data integration in systems biology 6/48
  • 13. Kernel examples 1 Rp observations: Gaussian kernel Kii = e−γ xi−xi 2 2 nodes of a graph: [Kondor and Lafferty, 2002] 3 sequence kernels (used to compute similarities between proteins for instance): spectrum kernel [Jaakkola et al., 2000] (with HMM), convolution kernel [Saigo et al., 2004] 4 kernel between graphs (or “structured data”; used in metabolomics to compute similarities between metabolites based on their fragmentation trees): [Shen et al., 2014, Brouard et al., 2016] More examples: [Mariette and Vialaneix, 2019] Nathalie Vialaneix | Kernel methods for data integration in systems biology 6/48
  • 14. Principles for learning from kernels Start from any statistical method (PCA, regression, k-means clustering) and rewrite all quantities using: K to compute distances and dot products dot product is: Kii and distance is: √ Kii + Ki i − 2Kii (implicit) linear or convex combinations of (φ(xi))i to describe all unobserved elements (centers of gravity and so on...) Nathalie Vialaneix | Kernel methods for data integration in systems biology 7/48
  • 15. A simple example: k-means Nathalie Vialaneix | Kernel methods for data integration in systems biology 8/48
  • 16. A simple example: k-means 1: Initialization: random initialization of P centers ¯xCt j ∈ Rp 2: for t = 1 to T do 3: Affectation step ∀ i = 1, ..., n ft+1 (xi) = argmin j=1,...,P d(xi, ¯xCt j ) 4: Representation step ∀ j = 1, . . . , P, ¯xCt j = 1 |Ct j | xl∈Ct j xl 5: end for Convergence 6: return Partition Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
  • 17. A simple example: k-means 1: Initialization: random initialization of a partition of (xi)i and ¯xC1 j = 1 |C1 j | xi∈C1 j φ(xi) 2: for t = 1 to T do 3: Affectation step ∀ i = 1, ..., n ft+1 (xi) = argmin j=1,...,P d(xi, ¯xCt j ) 4: Representation step ∀ j = 1, . . . , P, ¯xCt j = 1 |Ct j | xl∈Ct j xl 5: end for Convergence 6: return Partition Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
  • 18. A simple example: k-means 1: Initialization: random initialization of a partition of (xi)i and ¯xC1 j = 1 |C1 j | xi∈C1 j φ(xi) 2: for t = 1 to T do 3: Affectation step ft+1 (xi) = argmin j=1,...,P φ(xi) − ¯xCt j 2 H , 4: Representation step ∀ j = 1, . . . , P, ¯xCt j = 1 |Ct j | xl∈Ct j xl 5: end for Convergence 6: return Partition Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
  • 19. A simple example: k-means 1: Initialization: random initialization of a partition of (xi)i and ¯xC1 j = 1 |C1 j | xi∈C1 j φ(xi) 2: for t = 1 to T do 3: Affectation step ft+1 (xi) = argmin j=1,...,P φ(xi) − ¯xCt j 2 H , 4: Representation step ∀ j = 1, . . . , P, ¯xCt j = 1 |Ct j | xl∈Ct j φ(xl) 5: end for Convergence 6: return Partition Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
  • 20. A simple example: k-means 1: Initialization: random initialization of a partition of (xi)i and ¯xC1 j = 1 |C1 j | xi∈C1 j φ(xi) 2: for t = 1 to T do 3: Affectation step ft+1 (xi) = argmin j=1,...,P = Kii − 2 |Ct j | xl∈Ct j Kil + 1 |Ct j |2 xl, xl ∈Ct j Kll . 4: Representation step ∀ j = 1, . . . , P, ¯xCt j = 1 |Ct j | xl∈Ct j φ(xl) 5: end for Convergence 6: return Partition Nathalie Vialaneix | Kernel methods for data integration in systems biology 9/48
  • 21. Beyond kernels: relational data DNA barcoding Astraptes fulgerator optimal matching (edit) distances to differentiate species Nathalie Vialaneix | Kernel methods for data integration in systems biology 10/48
  • 22. Beyond kernels: relational data DNA barcoding Astraptes fulgerator optimal matching (edit) distances to differentiate species Hi-C data pairwise measure (similarity) related to the physical 3D distance between loci in the cell, at genome scale [Ambroise et al., 2019, Randriamihamison et al., 2019] Nathalie Vialaneix | Kernel methods for data integration in systems biology 10/48
  • 23. Beyond kernels: relational data DNA barcoding Astraptes fulgerator optimal matching (edit) distances to differentiate species Hi-C data pairwise measure (similarity) related to the physical 3D distance between loci in the cell, at genome scale [Ambroise et al., 2019, Randriamihamison et al., 2019] Metagenomics dissemblance between samples is better captured when phylogeny between species is taken into account (unifrac distances) Nathalie Vialaneix | Kernel methods for data integration in systems biology 10/48
  • 24. Formally, relational data are: Euclidean distances or (non Euclidean) dissimilarities between n entities: symmetric (n × n)-matrix D with positive entries and null diagonal Nathalie Vialaneix | Kernel methods for data integration in systems biology 11/48
  • 25. Formally, relational data are: Euclidean distances or (non Euclidean) dissimilarities between n entities: symmetric (n × n)-matrix D with positive entries and null diagonal kernels: a symmetric and positive definite (n × n)-matrix K that measures a “relation” between n entities in X (arbitrary space) K(x, x ) = φ(x), φ(x ) Nathalie Vialaneix | Kernel methods for data integration in systems biology 11/48
  • 26. Formally, relational data are: Euclidean distances or (non Euclidean) dissimilarities between n entities: symmetric (n × n)-matrix D with positive entries and null diagonal kernels: a symmetric and positive definite (n × n)-matrix K that measures a “relation” between n entities in X (arbitrary space) K(x, x ) = φ(x), φ(x ) networks/graphs: groups of n entities (nodes/vertices) linked by a (potentially weighted) relation (edges) ⇒ symmetric (n × n)-matrix with positive entries and null diagonal W Nathalie Vialaneix | Kernel methods for data integration in systems biology 11/48
  • 27. Formally, relational data are: Euclidean distances or (non Euclidean) dissimilarities between n entities: symmetric (n × n)-matrix D with positive entries and null diagonal kernels: a symmetric and positive definite (n × n)-matrix K that measures a “relation” between n entities in X (arbitrary space) K(x, x ) = φ(x), φ(x ) networks/graphs: groups of n entities (nodes/vertices) linked by a (potentially weighted) relation (edges) ⇒ symmetric (n × n)-matrix with positive entries and null diagonal W Similarities between n entities: symmetric (n × n)-matrix S (with usually positive entries) but not necessarily definite positive Nathalie Vialaneix | Kernel methods for data integration in systems biology 11/48
  • 28. Different relational data types are related to each others a kernel is equivalent to an Euclidean distance: D(x, x ) := K(x, x) + K(x , x ) − 2K(x, x ) from a dissimilarity, similarities can be computed: S(x, x) := a(x) (arbitrary), S(x, x ) = 1 2 a(x) + a(x ) − D2 (x, x ) various kernels have been proposed for graphs (e.g., based on the graph Laplacian): [Kondor and Lafferty, 2002] Nathalie Vialaneix | Kernel methods for data integration in systems biology 12/48
  • 29. Different relational data types are related to each others a kernel is equivalent to an Euclidean distance: D(x, x ) := K(x, x) + K(x , x ) − 2K(x, x ) from a dissimilarity, similarities can be computed: S(x, x) := a(x) (arbitrary), S(x, x ) = 1 2 a(x) + a(x ) − D2 (x, x ) various kernels have been proposed for graphs (e.g., based on the graph Laplacian): [Kondor and Lafferty, 2002] in summary useful simplification: “is the framework Euclidean or not?” (e.g., kernel vs non Euclidean dissimilarity) Nathalie Vialaneix | Kernel methods for data integration in systems biology 12/48
  • 30. Principles for learning from relational data Euclidean case (kernel K) rewrite all quantities using: K to compute distances and dot products linear or convex combinations of (φ(xi))i to describe all unobserved elements (centers of gravity and so on...) Works for: PCA, k-means, linear regression, ... Nathalie Vialaneix | Kernel methods for data integration in systems biology 13/48
  • 31. Principles for learning from relational data Euclidean case (kernel K) rewrite all quantities using: K to compute distances and dot products linear or convex combinations of (φ(xi))i to describe all unobserved elements (centers of gravity and so on...) Works for: PCA, k-means, linear regression, ... non Euclidean case (non Euclidean dissimilarity D): do almost the same using a pseudo-Euclidean framework [Goldfarb, 1984] ∃ two Euclidean spaces E+ and E− and two mappings φ+ and φ− st: D(x, x ) = φ+(x) − φ+(x ) 2 E+ − φ−(x) − φ−(x ) 2 E− Nathalie Vialaneix | Kernel methods for data integration in systems biology 13/48
  • 32. And now? 1 integrate multiple data sources with kernels (with application to metagenomic datasets) extension 1 2 reduce complexity of kernel methods extension 2 Nathalie Vialaneix | Kernel methods for data integration in systems biology 14/48
  • 33. Combining relational data in an unsupervised setting Nathalie Vialaneix | Kernel methods for data integration in systems biology 15/48
  • 34. What are metagenomic data? Source: [Sommer et al., 2010] Nathalie Vialaneix | Kernel methods for data integration in systems biology 16/48
  • 35. What are metagenomic data? Source: [Sommer et al., 2010] abundance data sparse n × p-matrices with count data of samples in rows and descriptors (species, OTUs, KEGG groups, k-mer, ...) in columns. Generally p n. Nathalie Vialaneix | Kernel methods for data integration in systems biology 16/48
  • 36. What are metagenomic data? Source: [Sommer et al., 2010] abundance data sparse n × p-matrices with count data of samples in rows and descriptors (species, OTUs, KEGG groups, k-mer, ...) in columns. Generally p n. phylogenetic tree (evolution history between species, OTUs...). One tree with p leaves built from the sequences collected in the n samples. Nathalie Vialaneix | Kernel methods for data integration in systems biology 16/48
  • 37. What are metagenomic data used for? produce a profile of the diversity of a given sample ⇒ allows to compare diversity between various conditions used in various fields: environmental science, microbiote, ... Nathalie Vialaneix | Kernel methods for data integration in systems biology 17/48
  • 38. What are metagenomic data used for? produce a profile of the diversity of a given sample ⇒ allows to compare diversity between various conditions used in various fields: environmental science, microbiote, ... Processed by computing a relevant dissimilarity between samples (standard Euclidean distance is not relevant) and by using this dissimilarity in subsequent analyses. Nathalie Vialaneix | Kernel methods for data integration in systems biology 17/48
  • 39. β-diversity data: dissimilarities between count data Compositional dissimilarities: (nig) count of species g for sample i Jaccard: the fraction of species specific of either sample i or j: djac = g I{nig>0,njg=0} + I{njg>0,nig=0} j I{nig+njg>0} Bray-Curtis: the fraction of the sample which is specific of either sample i or j dBC = g |nig − njg| g(nig + njg) Other dissimilarities available in the R package philoseq, most of them not Euclidean. Nathalie Vialaneix | Kernel methods for data integration in systems biology 18/48
  • 40. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities Nathalie Vialaneix | Kernel methods for data integration in systems biology 19/48
  • 41. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Nathalie Vialaneix | Kernel methods for data integration in systems biology 19/48
  • 42. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Unifrac: the fraction of the tree specific to either sample i or sample j. dUF = e le(I{pei>0,pej=0} + I{pej>0,pei=0}) e leI{pei+pej>0} Nathalie Vialaneix | Kernel methods for data integration in systems biology 19/48
  • 43. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Unifrac: the fraction of the tree specific to either sample i or sample j. dUF = e le(I{pei>0,pej=0} + I{pej>0,pei=0}) e leI{pei+pej>0} Weighted Unifrac: the fraction of the diversity specific to sample i or to sample j. dwUF = e le|pei − pej| e(pei + pej) Nathalie Vialaneix | Kernel methods for data integration in systems biology 19/48
  • 44. TARA Oceans datasets The 2009-2013 expedition Co-directed by Étienne Bourgois and Éric Karsenti. 7,012 datasets collected from 35,000 samples of plankton and water (11,535 Gb of data). Study the plankton: bacteria, protists, metazoans and viruses representing more than 90% of the biomass in the ocean. Nathalie Vialaneix | Kernel methods for data integration in systems biology 20/48
  • 45. TARA Oceans datasets Science (May 2015) - Studies on: eukaryotic plankton diversity [de Vargas et al., 2015], ocean viral communities [Brum et al., 2015], global plankton interactome [Lima-Mendez et al., 2015], global ocean microbiome [Sunagawa et al., 2015], . . . . → datasets from different types and different sources analyzed separately. Nathalie Vialaneix | Kernel methods for data integration in systems biology 21/48
  • 46. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
  • 47. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
  • 48. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
  • 49. TARA Oceans datasets that we used [de Vargas et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm), nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm). Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
  • 50. TARA Oceans datasets that we used [Brum et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm), nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm). virus composition: ∼ 867 virus clusters based on shared gene content. Nathalie Vialaneix | Kernel methods for data integration in systems biology 22/48
  • 51. TARA Oceans datasets that we used Common samples 48 samples, 2 depth layers: surface (SRF) and deep chlorophyll maximum (DCM), 31 different sampling stations. Nathalie Vialaneix | Kernel methods for data integration in systems biology 23/48
  • 52. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km Nathalie Vialaneix | Kernel methods for data integration in systems biology 24/48
  • 53. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] Nathalie Vialaneix | Kernel methods for data integration in systems biology 24/48
  • 54. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] unsupervised framework but input space is Rp [Zhuang et al., 2011] K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the distortion between all training data ij K∗ (xi, xj) xi − xj 2 ; AND minimize the approximation of the original data by the kernel embedding i xi − j K∗ (xi, xj)xj 2 . Nathalie Vialaneix | Kernel methods for data integration in systems biology 24/48
  • 55. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] unsupervised framework but input space is Rp [Zhuang et al., 2011] K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the distortion between all training data ij K∗ (xi, xj) xi − xj 2 ; AND minimize the approximation of the original data by the kernel embedding i xi − j K∗ (xi, xj)xj 2 . Our proposal: 2 UMKL frameworks which do not require data to have values in Rd . Nathalie Vialaneix | Kernel methods for data integration in systems biology 24/48
  • 56. Multi-kernel/distances integration How to “optimally” combine several relational datasets in an unsupervised setting? for kernels K1 , . . . , KM obtained on the same n objects, search: Kβ = M m=1 βmKm with βm ≥ 0 and m βm = 1 [Mariette and Villa-Vialaneix, 2018] Package R mixKernel https://cran.r-project.org/ package=mixKernel Nathalie Vialaneix | Kernel methods for data integration in systems biology 25/48
  • 57. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) Nathalie Vialaneix | Kernel methods for data integration in systems biology 26/48
  • 58. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) maximizev M m=1 K∗ (v), Km Km F F = v Cv for K∗ (v) = M m=1 vmKm and v ∈ RM such that v 2 = 1. Nathalie Vialaneix | Kernel methods for data integration in systems biology 26/48
  • 59. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) maximizev M m=1 K∗ (v), Km Km F F = v Cv for K∗ (v) = M m=1 vmKm and v ∈ RM such that v 2 = 1. Solution: first eigenvector of C ⇒ Set β = v M m=1 vm (consensual kernel). Nathalie Vialaneix | Kernel methods for data integration in systems biology 26/48
  • 60. A kernel preserving the original topology of the data I Similarly to [Lin et al., 2010], preserve the local geometry of the data in the feature space. Nathalie Vialaneix | Kernel methods for data integration in systems biology 27/48
  • 61. A kernel preserving the original topology of the data I Similarly to [Lin et al., 2010], preserve the local geometry of the data in the feature space. Proxy of the local geometry Km −→ Gm k k−nearest neighbors graph −→ Am k adjacency matrix ⇒ W = m I{Am k >0} or W = m Am k Nathalie Vialaneix | Kernel methods for data integration in systems biology 27/48
  • 62. A kernel preserving the original topology of the data I Similarly to [Lin et al., 2010], preserve the local geometry of the data in the feature space. Proxy of the local geometry Km −→ Gm k k−nearest neighbors graph −→ Am k adjacency matrix ⇒ W = m I{Am k >0} or W = m Am k Feature space geometry measured by ∆i(β) = φ∗ β(xi),   φ∗ β(x1) ... φ∗ β(xn)   =   K∗ β(xi, x1) ... K∗ β(xi, xn)   Nathalie Vialaneix | Kernel methods for data integration in systems biology 27/48
  • 63. A kernel preserving the original topology of the data II Sparse version minimizeβ N i,j=1 Wij ∆i(β) − ∆j(β) 2 for K∗ β = M m=1 βmKm and β ∈ RM st βm ≥ 0 and M m=1 βm = 1. Non sparse version minimizev N i,j=1 Wij ∆i(β) − ∆j(β) 2 for K∗ v = M m=1 vmKm and v ∈ RM st vm ≥ 0 and v 2 = 1. Nathalie Vialaneix | Kernel methods for data integration in systems biology 28/48
  • 64. A kernel preserving the original topology of the data II Sparse version equivalent to a standard QP problem with linear constrains (ex: package quadprog in R) Non sparse version equivalent to a QPQC problem (harder to solve) solved with “Alternating Direction Method of Multipliers” (ADMM [Boyd et al., 2011]) Nathalie Vialaneix | Kernel methods for data integration in systems biology 29/48
  • 65. Application to TARA oceans Similarity between datasets (STATIS) phychem and small size organisms are the most similar (confirmed by [de Vargas et al., 2015] et [Sunagawa et al., 2015]). Nathalie Vialaneix | Kernel methods for data integration in systems biology 30/48
  • 66. Application to TARA oceans Important variables Rhizaria abundance strongly structure the differences between samples (analyses restricted to some organisms found differences mostly based on water depths) and waters from Arctic Oceans and Pacific Oceans differ in terms of Rhizaria abundance Back to choice - Jump to conclusion Nathalie Vialaneix | Kernel methods for data integration in systems biology 31/48
  • 67. Reducing complexity of kernel methods Nathalie Vialaneix | Kernel methods for data integration in systems biology 32/48
  • 68. Large scale kernel methods Standard complexity (number of elementary operations) of kernel learning methods: O(n2 ) or even O(n3 ) Examples K-PCA: spectral decomposition of K (equivalent to PCA in feature space) is O(n3 ) as compared to O min(p, n)3 for standard PCA kernel k-means: complexity of naive kernel k-means is O(Tkn2 ), as compared to O(Tkpn) for naive standard k-means Nathalie Vialaneix | Kernel methods for data integration in systems biology 33/48
  • 69. Low rank approximation solutions Aim: approximate K with a low rank matrix (matrix with rank r n. Then, use the approximation to train your predictor and correct it to “re-scale” it to n. Typical computational cost: O(nr2 ). Nathalie Vialaneix | Kernel methods for data integration in systems biology 34/48
  • 70. Sketch of Nyström approximation [Williams and Seeger, 2000, Drineas and Mahoney, 2005] Pick at random m observations in {1, . . . , m} (without loss of generality suppose that the first m ones have been chosen). Nathalie Vialaneix | Kernel methods for data integration in systems biology 35/48
  • 71. Sketch of Nyström approximation [Williams and Seeger, 2000, Drineas and Mahoney, 2005] Pick at random m observations in {1, . . . , m} (without loss of generality suppose that the first m ones have been chosen). Re-write: K = K(m) K(m,n−m) K(n−m,m) K(n−m,n−m) et K(n,m) = K(m) K(n−m,m) , with K(n−m,m) = (K(m,n−m)) , and use K(m) instead of K. Nathalie Vialaneix | Kernel methods for data integration in systems biology 35/48
  • 72. Approximate spectral decomposition of K Notations: K eigenvectors: (vj)j=1,...,n eigenvalues: (λj)j=1,...,n (positive, decreasing order) K(m) eigenvectors: (v (m) j )j=1,...,m eigenvalues: (λ (m) j )j=1,...,m (positive, decreasing order) Nathalie Vialaneix | Kernel methods for data integration in systems biology 36/48
  • 73. Approximate spectral decomposition of K Notations: K eigenvectors: (vj)j=1,...,n eigenvalues: (λj)j=1,...,n (positive, decreasing order) K(m) eigenvectors: (v (m) j )j=1,...,m eigenvalues: (λ (m) j )j=1,...,m (positive, decreasing order) ∀ j = 1, . . . , m, µj n m µ (m) j and vj m n 1 µ (m) j K(n,m) v (m) j Nathalie Vialaneix | Kernel methods for data integration in systems biology 36/48
  • 74. Approximate spectral decomposition of K Notations: K eigenvectors: (vj)j=1,...,n eigenvalues: (λj)j=1,...,n (positive, decreasing order) K(m) eigenvectors: (v (m) j )j=1,...,m eigenvalues: (λ (m) j )j=1,...,m (positive, decreasing order) ∀ j = 1, . . . , m, µj n m µ (m) j and vj m n 1 µ (m) j K(n,m) v (m) j complexity of the direct calculation: O(n3 ) complexity of the approximate solution: O(m3 ) + O(nm2 ) Nathalie Vialaneix | Kernel methods for data integration in systems biology 36/48
  • 75. Approximate spectral decomposition of K Notations: K eigenvectors: (vj)j=1,...,n eigenvalues: (λj)j=1,...,n (positive, decreasing order) K(m) eigenvectors: (v (m) j )j=1,...,m eigenvalues: (λ (m) j )j=1,...,m (positive, decreasing order) ∀ j = 1, . . . , m, µj n m µ (m) j and vj m n 1 µ (m) j K(n,m) v (m) j Remark: When the rank of K is < m, the approximation is exact. Nathalie Vialaneix | Kernel methods for data integration in systems biology 36/48
  • 76. What can be obtained from that... Approximation of the original kernel K [Cortes et al., 2010, Bach, 2013] Approximation of the kernel (ridge) regression with a control of the estimation error [Cortes et al., 2010, Bach, 2013] (similar methods exist to use Nyström approximation in SVM). Various derived extensions [Mariette et al., 2017a] (online Self-Organizing Maps) Nathalie Vialaneix | Kernel methods for data integration in systems biology 37/48
  • 77. Basics on other approaches in an online framework Online learning: deal with samples one by one and update (at low cost) the model How to use online learning for reducing complexity?: cache some operations in memory + (not always) impose sparsity on representers (centers of gravity or so...) [Rossi et al., 2007] for kernel k-means - [Mariette et al., 2017b] for kernel SOM Nathalie Vialaneix | Kernel methods for data integration in systems biology 38/48
  • 78. Basics on (standard) stochastic SOM [Kohonen, 2001] x x x (xi)i=1,...,n ⊂ Rp are affected to a unit f(xi) ∈ {1, . . . , U} the grid is equipped with a “distance” between units: d(u, u ) and observations affected to close units are close in Rp every unit u corresponds to a prototype, pu (x) in Rp Nathalie Vialaneix | Kernel methods for data integration in systems biology 39/48
  • 79. Basics on (standard) stochastic SOM [Kohonen, 2001] x x x Iterative learning (assignment step): xi is picked at random within (xk )k and affected to best matching unit: ft (xi) = arg min u xi − pt u 2 Nathalie Vialaneix | Kernel methods for data integration in systems biology 39/48
  • 80. Basics on (standard) stochastic SOM [Kohonen, 2001] x x x Iterative learning (representation step): all prototypes in neighboring units are updated with a gradient descent like step: pt+1 u ←− pt u + µ(t)Ht (d(f(xi), u))(xi − pt u) Nathalie Vialaneix | Kernel methods for data integration in systems biology 39/48
  • 81. Extension of SOM to data described by a kernel [Villa and Rossi, 2007] Data: (xi)i=1,...,n ∈ Rp 1: Initialization: randomly set p0 1 , ..., p0 U in Rd 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U xi − pt u 2 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) xi − pt u 7: end for 8: end for The general relational variant is implemented in SOMbrero. Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
  • 82. Extension of SOM to data described by a kernel [Villa and Rossi, 2007] Data: (xi)i=1,...,n ∈ X 1: Initialization: randomly set p0 1 , ..., p0 U in Rd 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U xi − pt u 2 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) xi − pt u 7: end for 8: end for The general relational variant is implemented in SOMbrero. Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
  • 83. Extension of SOM to data described by a kernel [Villa and Rossi, 2007] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 β0 ui φ(xi) (convex combination) 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U xi − pt u 2 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) xi − pt u 7: end for 8: end for The general relational variant is implemented in SOMbrero. Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
  • 84. Extension of SOM to data described by a kernel [Villa and Rossi, 2007] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 β0 ui φ(xi) (convex combination) 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U φ(xi) − pt u 2 X 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) xi − pt u 7: end for 8: end for The general relational variant is implemented in SOMbrero. Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
  • 85. Extension of SOM to data described by a kernel [Villa and Rossi, 2007] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 β0 ui φ(xi) (convex combination) 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U φ(xi) − pt u 2 X 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) φ(xi) − pt u 7: end for 8: end for The general relational variant is implemented in SOMbrero. Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
  • 86. Extension of SOM to data described by a kernel [Villa and Rossi, 2007] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 β0 ui φ(xi) (convex combination) 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U (βt u) Kβt u − 2(βt u) K.i 5: for all u = 1 → U do Representation 6: βt+1 u = βt u + µ(t)Ht (d(ft (xi), u)) 1i − βt u 7: end for 8: end for The general relational variant is implemented in SOMbrero. Nathalie Vialaneix | Kernel methods for data integration in systems biology 40/48
  • 87. Example: SOM for typology of Astraptes fulgerator from DNA barcoding Edit distances between DNA sequences [Olteanu and Villa-Vialaneix, 2015] Almost perfect clustering (identifying a possible label error on one sample) with (in addition) information on relations between species. Nathalie Vialaneix | Kernel methods for data integration in systems biology 41/48
  • 88. Problems with KSOM Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 β0 ui φ(xi) (convex combination) 2: for t = 1 → γn do 3: pick randomly i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U n j,j =1 βt ujβt uj Kjj − 2 n j=1 βt ujKji → O(n2 U) 5: for all u = 1 → U do Representation 6: βt+1 u = βt u + µ(t)Ht (d(ft (xi), u))(1i − βt u) → O(nU) 7: end for 8: end for → algorithm complexity: O(γn3 U) (compared to O(γUpn) for numeric) Nathalie Vialaneix | Kernel methods for data integration in systems biology 42/48
  • 89. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: for t = 1 → γn do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U n j,j =1 βt ujβt uj Kjj − 2 n j=1 βt ujKji 5: for all u = 1 → U do Representation 6: βt+1 u = βt u + µ(t)Ht (d(ft (xi), u))(1i − βt u) 7: end for 8: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 90. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: for t = 1 → γn do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U n j,j =1 βt ujβt uj Kjj At u −2 n j=1 βt ujKji Bt ui 5: for all u = 1 → U do Representation 6: βt+1 u = βt u + µ(t)Ht (d(ft (xi), u))(1i − βt u) 7: end for 8: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 91. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: for t = 1 → γn do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui 5: for all u = 1 → U do Representation 6: βt+1 u = βt u + µ(t)Ht (d(ft (xi), u))(1i − βt u) 7: end for 8: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 92. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: for t = 1 → γn do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui 5: for all u = 1 → U do Representation 6: βt+1 u = βt u + µ(t)Ht (d(ft (xi), u)) λu(t) (1i − βt u) 7: end for 8: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 93. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: for t = 1 → γn do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui 5: for all u = 1 → U do Representation 6: βt+1 u = (1 − λu(t))βt u + λu(t)1i 7: end for 8: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 94. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: A0 u = n j,j =1 β0 uj β0 uj Kjj 3: B0 ui = n j=1 β0 uj Kji 4: for t = 1 → γn do 5: pick at random i ∈ {1, . . . , n} 6: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui 7: for all u = 1 → U do Representation 8: βt+1 u = (1 − λu(t))βt u + λu(t)1i 9: end for 10: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 95. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: A0 u = n j,j =1 β0 uj β0 uj Kjj 3: B0 ui = n j=1 β0 uj Kji 4: for t = 1 → γn do 5: pick at random i ∈ {1, . . . , n} 6: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui 7: for all u = 1 → U do Representation 8: βt+1 u = (1 − λu(t))βt u + λu(t)1i Bt+1 ui = n j=1 βt+1 uj Ki j = (1 − λu(t))Bt ui + λu(t)Kii 9: end for 10: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 96. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: A0 u = n j,j =1 β0 uj β0 uj Kjj 3: B0 ui = n j=1 β0 uj Kji 4: for t = 1 → γn do 5: pick at random i ∈ {1, . . . , n} 6: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui 7: for all u = 1 → U do Representation 8: βt+1 u = (1 − λu(t))βt u + λu(t)1i Bt+1 ui = n j=1 βt+1 uj Ki j = (1 − λu(t))Bt ui + λu(t)Kii At+1 u = n j,j =1 βt+1 uj βt+1 uj Kjj = (1−λu(t))2 At u+λu(t)2 Kii +2λu(t)(1−λu(t))Bt ui 9: end for 10: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 97. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: A0 u = n j,j =1 β0 uj β0 uj Kjj → O(n2 U) 3: B0 ui = n j=1 β0 uj Kji → O(nU) 4: for t = 1 → γn do 5: pick at random i ∈ {1, . . . , n} 6: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui 7: for all u = 1 → U do Representation 8: βt+1 u = (1 − λu(t))βt u + λu(t)1i Bt+1 ui = n j=1 βt+1 uj Ki j = (1 − λu(t))Bt ui + λu(t)Kii At+1 u = n j,j =1 βt+1 uj βt+1 uj Kjj = (1−λu(t))2 At u+λu(t)2 Kii +2λu(t)(1−λu(t))Bt ui 9: end for 10: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 98. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: A0 u = n j,j =1 β0 uj β0 uj Kjj → O(n2 U) 3: B0 ui = n j=1 β0 uj Kji → O(nU) 4: for t = 1 → γn do 5: pick at random i ∈ {1, . . . , n} 6: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui → does not depend on n 7: for all u = 1 → U do Representation 8: βt+1 u = (1 − λu(t))βt u + λu(t)1i Bt+1 ui = n j=1 βt+1 uj Ki j = (1 − λu(t))Bt ui + λu(t)Kii At+1 u = n j,j =1 βt+1 uj βt+1 uj Kjj = (1−λu(t))2 At u+λu(t)2 Kii +2λu(t)(1−λu(t))Bt ui 9: end for 10: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 99. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: A0 u = n j,j =1 β0 uj β0 uj Kjj → O(n2 U) 3: B0 ui = n j=1 β0 uj Kji → O(nU) 4: for t = 1 → γn do 5: pick at random i ∈ {1, . . . , n} 6: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui → does not depend on n 7: for all u = 1 → U do Representation 8: βt+1 u = (1 − λu(t))βt u + λu(t)1i Bt+1 ui = n j=1 βt+1 uj Ki j = (1 − λu(t))Bt ui + λu(t)Kii → O(nU) At+1 u = n j,j =1 βt+1 uj βt+1 uj Kjj = (1−λu(t))2 At u+λu(t)2 Kii +2λu(t)(1−λu(t))Bt ui → O(U) 9: end for 10: end for Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 100. Reducing the stochastic K-SOM complexity [Mariette et al., 2017a] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u = n i=1 βuiφ(xi) (convex combination) 2: A0 u = n j,j =1 β0 uj β0 uj Kjj → O(n2 U) 3: B0 ui = n j=1 β0 uj Kji → O(nU) 4: for t = 1 → γn do 5: pick at random i ∈ {1, . . . , n} 6: Assignment ft (xi) = arg min u=1,...,U At u − 2Bt ui → does not depend on n 7: for all u = 1 → U do Representation 8: βt+1 u = (1 − λu(t))βt u + λu(t)1i Bt+1 ui = n j=1 βt+1 uj Ki j = (1 − λu(t))Bt ui + λu(t)Kii → O(nU) At+1 u = n j,j =1 βt+1 uj βt+1 uj Kjj = (1−λu(t))2 At u+λu(t)2 Kii +2λu(t)(1−λu(t))Bt ui → O(U) 9: end for 10: end for Final complexity: O(γn2 U) with additional storage memory of O(U) and O(Un). Back to choice - Jump to conclusion Nathalie Vialaneix | Kernel methods for data integration in systems biology 43/48
  • 101. Conclusions Kernel methods are useful for: dealing with different types of data even when they are high-dimensional combining them However, they can be: computationally intensive to train not easy to interpret (work-in-progress with Jérôme Mariette and Céline Brouard on variable selection in unsupervised setting) Nathalie Vialaneix | Kernel methods for data integration in systems biology 44/48
  • 102. SOMbrero Madalina Olteanu, Fabrice Rossi, Marie Cottrell, Laura Bendhaïba and Julien Boelaert SOMbrero and mixKernel Jérôme Mariette adjclust Pierre Neuvial, Nathanaël Randriamihamison Guillem Rigail, Christophe Ambroise and Shubham Chaturvedi Nathalie Vialaneix | Kernel methods for data integration in systems biology 45/48
  • 103. Credits for pictures Slide 3: image based on ENCODE project, by Darryl Leja (NHGRI), Ian Dunham (EBI) and Michael Pazin (NHGRI) Slide 8: k-means image from Wikimedia Commons by Weston.pace Slide 10: Astraptes picture is from https://www.flickr.com/photos/39139121@N00/2045403823/ by Anne Toal (CC BY-SA 2.0), Hi-C experiment is taken from the article Matharu et al., 2015 DOI:10.1371/journal.pgen.1005640 (CC BY-SA 4.0) and metagenomics illustration is taken from the article Sommer et al., 2010 DOI:10.1038/msb.2010.16 (CC BY-NC-SA 3.0) Other pictures are from articles that I co-authored. Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
  • 104. References Ambroise, C., Dehman, A., Neuvial, P., Rigaill, G., and Vialaneix, N. (2019). Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. arXiv preprint arXiv:1902.01596. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404. Bach, F. (2013). Sharp analysis of low-rank kernel matrix approximations. Journal of Machine Learning Research, Workshop and Conference Proceedings, 30:185–209. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alterning direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122. Brouard, C., Shen, H., Dürkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016). Fast metabolite identification with input output kernel regression. Bioinformatics, 32(12):i28–i36. Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J., Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S., Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker, P., Karsenti, E., and Sullivan, M. (2015). Patterns and ecological drivers of ocean viral communities. Science, 348(6237). Cortes, C., Mohri, M., and Talwalkar, A. (2010). On the impact of kernel approximation on learning accuracy. Journal of Machine Learning Research, Workshop and Conference Proceedings, 9:113–120. Crone, L. and Crosby, D. (1995). Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
  • 105. Statistical applications of a metric on subspaces to satellite meteorology. Technometrics, 37(3):324–328. de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I., Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O., Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F., Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S., Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015). Eukaryotic plankton diversity in the sunlit ocean. Science, 348(6237). Drineas, P. and Mahoney, M. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153–2175. Goldfarb, L. (1984). A unified approach to pattern recognition. Pattern Recognition, 17(5):575–582. Gönen, M. and Alpaydin, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268. Jaakkola, T., Diekhans, M., and Haussler, D. (2000). A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7(1-2):95–114. Kohonen, T. (2001). Self-Organizing Maps, 3rd Edition, volume 30. Springer, Berlin, Heidelberg, New York. Kondor, R. and Lafferty, J. (2002). Diffusion kernels on graphs and other discrete structures. Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
  • 106. In Sammut, C. and Hoffmann, A., editors, Proceedings of the 19th International Conference on Machine Learning, pages 315–322, Sydney, Australia. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA. Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994). The ACT (STATIS method). Computational Statistics and Data Analysis, 18(1):97–119. L’Hermier des Plantes, H. (1976). Structuration des tableaux à trois indices de la statistique. PhD thesis, Université de Montpellier. Thèse de troisième cycle. Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F., Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F., de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P., Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G., Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M., Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015). Determinants of community structure in the global plankton interactome. Science, 348(6237). Lin, Y., Liu, T., and CS., F. (2010). Multiple kernel learning for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160. Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017a). Efficient interpretable variants of online SOM for large dissimilarity data. Neurocomputing, 225:31–48. Mariette, J., Rossi, F., Olteanu, M., and Villa-Vialaneix, N. (2017b). Accelerating stochastic kernel som. In Verleysen, M., editor, XXVth European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2017), pages 269–274, Bruges, Belgium. i6doc. Mariette, J. and Vialaneix, N. (2019). Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
  • 107. Approches à noyau pour l’analyse et l’intégration de données omiques en biologie des systèmes. Forthcoming (book chapter). Mariette, J. and Villa-Vialaneix, N. (2018). Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics, 34(6):1009–1015. Olteanu, M. and Villa-Vialaneix, N. (2015). On-line relational and multiple relational SOM. Neurocomputing, 147:15–30. Randriamihamison, N., Vialaneix, N., and Neuvial, P. (2019). Applicability and interpretability of hierarchical agglomerative clustering with or without contiguity constraints. Submitted for publication. Preprint arXiv 1909.10923. Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the rv-coefficient. Applied Statistics, 25(3):257–265. Rossi, F., Hasenfuss, A., and Hammer, B. (2007). Accelerating relational clustering algorithms with sparse prototype representation. In Proceedings of the 6th Workshop on Self-Organizing Maps (WSOM 07), Bielefield, Germany. Neuroinformatics Group, Bielefield University. Saigo, H., Vert, J.-P., Ueda, N., and Akutsu, T. (2004). Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682–1689. Shen, H., Dührkop, K., Böcher, S., and Rousu, J. (2014). Metabolite identification through multiple kernel learning on fragmentation trees. Bioinformatics, 30(12):i157–i64. Sommer, M., Church, G., and Dantas, G. (2010). A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion. Nathalie Vialaneix | Kernel methods for data integration in systems biology 46/48
  • 108. Molecular Systems Biology, 6(360). Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A., Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka, F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P., Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015). Structure and function of the global ocean microbiome. Science, 348(6237). Villa, N. and Rossi, F. (2007). A comparison between dissimilarity SOM and kernel SOM for clustering the vertices of a graph. In 6th International Workshop on Self-Organizing Maps (WSOM 2007), Bielefield, Germany. Neuroinformatics Group, Bielefield University. Williams, C. and Seeger, M. (2000). Using the Nyström method to speed up kernel machines. In Leen, T., Dietterich, T., and Tresp, V., editors, Advances in Neural Information Processing Systems (Proceedings of NIPS 2000), volume 13, Denver, CO, USA. Neural Information Processing Systems Foundation. Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011). Unsupervised multiple kernel clustering. Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144. Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
  • 109. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
  • 110. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
  • 111. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Solved using Alternating Direction Method of Multipliers (ADMM [Boyd et al., 2011]) by replacing the previous optimization problem with min x,z x Sx + 1{x≥0}(x) + 1{ z 2 2 ≥1}(z) with the constraint x − z = 0. Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
  • 112. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Solved using Alternating Direction Method of Multipliers (ADMM [Boyd et al., 2011]) 1 minx x Sx + y (x − z) + λ 2 x − z 2 under the constraint x ≥ 0 (standard QP problem) 2 project on the unit ball z = x min{ x 2,1} 3 update auxiliary variable y = y + λ(x − z) Nathalie Vialaneix | Kernel methods for data integration in systems biology 47/48
  • 113. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? Nathalie Vialaneix | Kernel methods for data integration in systems biology 48/48
  • 114. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? our datasets are either numeric (environmental) or are built from a n × p count matrix ⇒ for a given species, randomly permute counts and re-do the analysis (kernel computation - with the same optimized weights - and K-PCA) Nathalie Vialaneix | Kernel methods for data integration in systems biology 48/48
  • 115. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? our datasets are either numeric (environmental) or are built from a n × p count matrix ⇒ for a given species, randomly permute counts and re-do the analysis (kernel computation - with the same optimized weights - and K-PCA) the influence of a given species in a given dataset on a given PC subspace is accessed by computing the Crone-Crosby distance between these two PCA subspaces [Crone and Crosby, 1995] (∼ Frobenius norm between the projectors) Nathalie Vialaneix | Kernel methods for data integration in systems biology 48/48