Multiple kernel learning applied to the integration of
Tara oceans datasets
Nathalie Villa-Vialaneix
Joint work with Jérôm...
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL ...
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL ...
What are metagenomic data?
Source: [Sommer et al., 2010]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning ...
What are metagenomic data?
Source: [Sommer et al., 2010]
abundance data sparse
n × p-matrices with count data
of samples i...
What are metagenomic data?
Source: [Sommer et al., 2010]
abundance data sparse
n × p-matrices with count data
of samples i...
What are metagenomic data used for?
produce a profile of the diversity of a given sample ⇒ allows to
compare diversity betw...
What are metagenomic data used for?
produce a profile of the diversity of a given sample ⇒ allows to
compare diversity betw...
β-diversity data: dissimilarities between count data
Compositional dissimilarities: (nig) count of species g for sample i
...
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
Nathalie Villa-Vialaneix | Unsupervised multip...
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
...
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
...
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
...
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL ...
TARA Oceans datasets
The 2009-2013 expedition
Co-directed by Étienne Bourgois
and Éric Karsenti.
7,012 datasets collected ...
TARA Oceans datasets
Science (May 2015) - Studies on:
eukaryotic plankton diversity
[de Vargas et al., 2015],
ocean viral ...
Background of this talk
Objectives
Until now: many papers using many methods. No integrated analysis
performed.
What do th...
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temper...
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temper...
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temper...
TARA Oceans datasets that we used
[de Vargas et al., 2015]
Datasets used
environmental dataset: 22 numeric features (tempe...
TARA Oceans datasets that we used
[Brum et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperatur...
TARA Oceans datasets that we used
Common samples
48 samples,
2 depth layers: surface
(SRF) and deep chlorophyll
maximum (D...
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL ...
Kernel methods
Kernel viewed as the dot product in an implicit Hilbert space
K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀...
Kernel methods
Kernel viewed as the dot product in an implicit Hilbert space
K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀...
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the f...
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the f...
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the f...
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the f...
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the f...
Usefulness of K-PCA
Non linear PCA
Source: By Petter Strandmark - Own work, CC BY 3.0, https://commons.wikimedia.org/w/ind...
Usefulness of K-PCA
[Mariette et al., 2017] K-PCA for non numeric datasets - here a
quantitative time series: job trajecto...
From multiple dissimilarities to multiple kernels
1 several (non Euclidean) dissimilarities D1
, . . . , DM
, transformed ...
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
Nathalie Vill...
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised fr...
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised fr...
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised fr...
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F...
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F...
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F...
A kernel preserving the original topology of the data I
From an idea similar to that of [Lin et al., 2010], find a kernel s...
A kernel preserving the original topology of the data I
From an idea similar to that of [Lin et al., 2010], find a kernel s...
A kernel preserving the original topology of the data I
From an idea similar to that of [Lin et al., 2010], find a kernel s...
A kernel preserving the original topology of the data II
Sparse version
minimize
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
β =
M
...
A kernel preserving the original topology of the data II
Sparse version
minimize
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
β =
M
...
A kernel preserving the original topology of the data II
Non sparse version
minimize
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
v ...
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constra...
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constra...
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constra...
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constra...
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in...
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in...
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in...
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL ...
Integrating ’omics data using kernels
M TARA Oceans datasets
(xm
i
)i=1,...,n,m=1,...,M measured on the same
ocean samples...
Integrating ’omics data using kernels
Environmental dataset: standard euclidean
distance, given by K(xi, xj) = xT
i
xj.
Na...
Integrating ’omics data using kernels
Bacteria phylogenomic tree: the weighted
Unifrac distance, given by
dwUF (xi, xj) =
...
Integrating ’omics data using kernels
All composition based datasets: bacteria
functional composition, eukaryote (pico,
na...
Integrating ’omics data using kernels
Combinaison of M kernels by a weighted
sum
K∗
=
M
m=1
βmKm
,
where βm ≥ 0 and M
m=1 ...
Integrating ’omics data using kernels
Apply standard data mining methods
(clustering, linear model, PCA, . . . ) in the
fe...
Correlation between kernels (STATIS)
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
Correlation between kernels (STATIS)
Low correlations between the bacteria functional composition and
other datasets.
Nath...
Correlation between kernels (STATIS)
Low correlations between the bacteria functional composition and
other datasets.
Stro...
Influence of k (nb of neighbors) on (βm)m
k ≥ 5 provides stable results
Nathalie Villa-Vialaneix | Unsupervised multiple ke...
(βm)m values returned by graph-MKL
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
(βm)m values returned by graph-MKL
The dataset the less correlated to the others: the bacteria functional
composition has ...
(βm)m values returned by graph-MKL
The dataset the less correlated to the others: the bacteria functional
composition has ...
Proof of concept: using [Sunagawa et al., 2015]
Datasets
139 samples, 3 layers (SRF, DCM and MES)
kernels: phychem, pro-OT...
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 31/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 32/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 33/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 34/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 35/41
Proof of concept: using [Sunagawa et al., 2015]
Proteobacteria (clade SAR11 (Alphaproteobacteria) and SAR86)
dominate the ...
K-PCA on K∗
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 37/41
K-PCA on K∗
- environmental dataset
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 38/41
K-PCA on K∗
- environmental dataset
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 39/41
Conclusion et perspectives
Summary
an integrative exploratory method
... particularly well suited for multi metagenomic da...
Questions?
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
References
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):3...
Eukaryotic plankton diversity in the sunlit ocean.
Science, 348(6237).
Gönen, M. and Alpaydin, E. (2011).
Multiple kernel ...
Efficient interpretable variants of online SOM for large dissimilarity data.
Neurocomputing, 225:31–48.
Olteanu, M. and Vil...
Prochain SlideShare
Chargement dans…5
×

Multiple kernel learning applied to the integration of Tara oceans datasets

369 vues

Publié le

Séminaire de probabilités et statistique, Institut Élie Cartan de Lorraine
http://www.iecl.univ-lorraine.fr
Nancy, France
February, 9th 2017

Publié dans : Sciences
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Multiple kernel learning applied to the integration of Tara oceans datasets

  1. 1. Multiple kernel learning applied to the integration of Tara oceans datasets Nathalie Villa-Vialaneix Joint work with Jérôme Mariette http://www.nathalievilla.org 7 Février 2017 Institut Élie Cartan, Université de Lorraine Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 1/41
  2. 2. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 2/41
  3. 3. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 3/41
  4. 4. What are metagenomic data? Source: [Sommer et al., 2010] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
  5. 5. What are metagenomic data? Source: [Sommer et al., 2010] abundance data sparse n × p-matrices with count data of samples in rows and descriptors (species, OTUs, KEGG groups, k-mer, ...) in columns. Generally p n. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
  6. 6. What are metagenomic data? Source: [Sommer et al., 2010] abundance data sparse n × p-matrices with count data of samples in rows and descriptors (species, OTUs, KEGG groups, k-mer, ...) in columns. Generally p n. philogenetic tree (evolution history between species, OTUs...). One tree with p leaves built from the sequences collected in the n samples. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
  7. 7. What are metagenomic data used for? produce a profile of the diversity of a given sample ⇒ allows to compare diversity between various conditions used in various fields: environmental science, microbiote, ... Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41
  8. 8. What are metagenomic data used for? produce a profile of the diversity of a given sample ⇒ allows to compare diversity between various conditions used in various fields: environmental science, microbiote, ... Processed by computing a relevant dissimilarity between samples (standard Euclidean distance is not relevant) and by using this dissimilarity in subsequent analyses. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41
  9. 9. β-diversity data: dissimilarities between count data Compositional dissimilarities: (nig) count of species g for sample i Jaccard: the fraction of species specific of either sample i or j: djac = g I{nig>0,njg=0} + I{njg>0,nig=0} j I{nig+njg>0} Bray-Curtis: the fraction of the sample which is specific of either sample i or j dBC = g |nig − njg| g(nig + njg) Other dissimilarities available in the R package philoseq, most of them not Euclidean. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 6/41
  10. 10. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
  11. 11. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
  12. 12. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Unifrac: the fraction of the tree specific to either sample i or sample j. dUF = e le(I{pei>0,pej=0} + I{pej>0,pei=0}) e leI{pei+pej>0} Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
  13. 13. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Unifrac: the fraction of the tree specific to either sample i or sample j. dUF = e le(I{pei>0,pej=0} + I{pej>0,pei=0}) e leI{pei+pej>0} Weighted Unifrac: the fraction of the diversity specific to sample i or to sample j. dwUF = e le|pei − pej| e(pei + pej) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
  14. 14. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 8/41
  15. 15. TARA Oceans datasets The 2009-2013 expedition Co-directed by Étienne Bourgois and Éric Karsenti. 7,012 datasets collected from 35,000 samples of plankton and water (11,535 Gb of data). Study the plankton: bacteria, protists, metazoans and viruses representing more than 90% of the biomass in the ocean. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 9/41
  16. 16. TARA Oceans datasets Science (May 2015) - Studies on: eukaryotic plankton diversity [de Vargas et al., 2015], ocean viral communities [Brum et al., 2015], global plankton interactome [Lima-Mendez et al., 2015], global ocean microbiome [Sunagawa et al., 2015], . . . . → datasets from different types and different sources analyzed separately. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 10/41
  17. 17. Background of this talk Objectives Until now: many papers using many methods. No integrated analysis performed. What do the datasets reveal if integrated in a single analysis? Our purpose: develop a generic method to integrate phylogenetic, taxonomic and functional community composition together with environmental factors. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 11/41
  18. 18. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  19. 19. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  20. 20. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  21. 21. TARA Oceans datasets that we used [de Vargas et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm), nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  22. 22. TARA Oceans datasets that we used [Brum et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm), nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm). virus composition: ∼ 867 virus clusters based on shared gene content. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  23. 23. TARA Oceans datasets that we used Common samples 48 samples, 2 depth layers: surface (SRF) and deep chlorophyll maximum (DCM), 31 different sampling stations. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 13/41
  24. 24. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 14/41
  25. 25. Kernel methods Kernel viewed as the dot product in an implicit Hilbert space K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀ m ∈ N, ∀x1, ..., xm ∈ X, ∀ α1, ..., αm ∈ R, m i,j=1 αiαjK(xi, xj) ≥ 0. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41
  26. 26. Kernel methods Kernel viewed as the dot product in an implicit Hilbert space K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀ m ∈ N, ∀x1, ..., xm ∈ X, ∀ α1, ..., αm ∈ R, m i,j=1 αiαjK(xi, xj) ≥ 0. ⇒ [Aronszajn, 1950] ∃!(H, ., . ), φ : X → H st: K(xi, xj) = φ(xi), φ(xj) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41
  27. 27. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  28. 28. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. In practice: K is centered: K ← K − 1 N KIN + 1 N2 IN KIN; K-PCA is performed by the eigen-decomposition of (centered) K Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  29. 29. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. In practice: K is centered: K ← K − 1 N KIN + 1 N2 IN KIN; K-PCA is performed by the eigen-decomposition of (centered) K If (αk )k=1,...,N ∈ RN and (λk )k=1,...,N are the eigenvectors and eigenvalues, PC axes are: ak = N i=1 αkiφ(xi) and ak = (aki)i=1,...,n are orthonormal in the feature space induced by the kernel: ∀ k, k , ak , ak = αk Kαk = δkk with δkk = 0 if k k 1 otherwise . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  30. 30. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. In practice: K is centered: K ← K − 1 N KIN + 1 N2 IN KIN; K-PCA is performed by the eigen-decomposition of (centered) K Coordinate of the projection of the observations (φ(xi))i: ak , φ(xi) = n j=1 αkjKji = Ki.αk = λk αki, where Ki. is the i-th row of K. No representation for the variables (no real variables...). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  31. 31. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. In practice: K is centered: K ← K − 1 N KIN + 1 N2 IN KIN; K-PCA is performed by the eigen-decomposition of (centered) K Other unsupervised kernel methods: kernel SOM [Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  32. 32. Usefulness of K-PCA Non linear PCA Source: By Petter Strandmark - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=3936753 Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41
  33. 33. Usefulness of K-PCA [Mariette et al., 2017] K-PCA for non numeric datasets - here a quantitative time series: job trajectories after graduation from the French survey “Generation 98” [Cottrell and Letrémy, 2005] color is the mode of the trajectories Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41
  34. 34. From multiple dissimilarities to multiple kernels 1 several (non Euclidean) dissimilarities D1 , . . . , DM , transformed into similarities with [Lee and Verleysen, 2007]: Km (xi, xj) = − 1 2  Dm (xi, xj) − 2 N N k=1 Dm (xi, xk ) + 1 N2 N k, k =1 Dm (xk , xk )   2 if non positive, clipping or flipping (removing the negative part of the eigenvalues decomposition or taking its opposite) produce kernels [Chen et al., 2009]. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 18/41
  35. 35. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
  36. 36. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
  37. 37. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] unsupervised framework but input space is Rd [Zhuang et al., 2011] K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the distortion between all training data ij K∗ (xi, xj) xi − xj 2 ; AND minimize the approximation of the original data by the kernel embedding i xi − j K∗ (xi, xj)xj 2 . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
  38. 38. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] unsupervised framework but input space is Rd [Zhuang et al., 2011] K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the distortion between all training data ij K∗ (xi, xj) xi − xj 2 ; AND minimize the approximation of the original data by the kernel embedding i xi − j K∗ (xi, xj)xj 2 . Our proposal: 2 UMKL frameworks which do not require data to have values in Rd . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
  39. 39. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
  40. 40. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) maximize M m=1 K∗ (v), Km Km F F = v Cv for K∗ (v) = M m=1 vmKm and v ∈ RM such that v 2 = 1. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
  41. 41. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) maximize M m=1 K∗ (v), Km Km F F = v Cv for K∗ (v) = M m=1 vmKm and v ∈ RM such that v 2 = 1. Solution: first eigenvector of C ⇒ Set β = v M m=1 vm (consensual kernel). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
  42. 42. A kernel preserving the original topology of the data I From an idea similar to that of [Lin et al., 2010], find a kernel such that the local geometry of the data in the feature space is similar to that of the original data. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
  43. 43. A kernel preserving the original topology of the data I From an idea similar to that of [Lin et al., 2010], find a kernel such that the local geometry of the data in the feature space is similar to that of the original data. Proxy of the local geometry Km −→ Gm k k−nearest neighbors graph −→ Am k adjacency matrix ⇒ W = m I{Am k >0} or W = m Am k Adjacency matrix image from: By S. Mohammad H. Oloomi, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=35313532 Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
  44. 44. A kernel preserving the original topology of the data I From an idea similar to that of [Lin et al., 2010], find a kernel such that the local geometry of the data in the feature space is similar to that of the original data. Proxy of the local geometry Km −→ Gm k k−nearest neighbors graph −→ Am k adjacency matrix ⇒ W = m I{Am k >0} or W = m Am k Feature space geometry measured by ∆i(β) = φ∗ β(xi),   φ∗ β(x1) ... φ∗ β(xN)   =   K∗ β (xi, x1) ... K∗ β (xi, xN)   Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
  45. 45. A kernel preserving the original topology of the data II Sparse version minimize N i,j=1 Wij ∆i(β) − ∆j(β) 2 for K∗ β = M m=1 βmKm and β ∈ RM st βm ≥ 0 and M m=1 βm = 1. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
  46. 46. A kernel preserving the original topology of the data II Sparse version minimize N i,j=1 Wij ∆i(β) − ∆j(β) 2 for K∗ β = M m=1 βmKm and β ∈ RM st βm ≥ 0 and M m=1 βm = 1. ⇔ minimize M m,m =1 βmβm Smm β ∈ RM such that βm ≥ 0 and M m=1 βm = 1, for Smm = N i,j=1 Wij ∆m i − ∆m j 2 and ∆m i =   Km (xi, x1) ... Km (xi, xN)   . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
  47. 47. A kernel preserving the original topology of the data II Non sparse version minimize N i,j=1 Wij ∆i(β) − ∆j(β) 2 for K∗ v = M m=1 vmKm and v ∈ RM st vm ≥ 0 and v 2 = 1. ⇔ minimize M m,m =1 vmvm Smm v ∈ RM such that vm ≥ 0 and v 2 = 1, for Smm = N i,j=1 Wij ∆m i − ∆m j 2 and ∆m i =   Km (xi, x1) ... Km (xi, xN)   . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
  48. 48. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
  49. 49. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
  50. 50. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Solved using Alternating Direction Method of Multipliers (ADMM [Boyd et al., 2011]) by replacing the previous optimization problem with min x,z x Sx + 1{x≥0}(x) + 1{ z 2 2 ≥1}(z) with the constraint x − z = 0. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
  51. 51. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Solved using Alternating Direction Method of Multipliers (ADMM [Boyd et al., 2011]) 1 minx x Sx + y (x − z) + λ 2 x − z 2 under the constraint x ≥ 0 (standard QP problem) 2 project on the unit ball z = x min{ x 2,1} 3 update auxiliary variable y = y + λ(x − z) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
  52. 52. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
  53. 53. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? our datasets are either numeric (environmental) or are built from a n × p count matrix ⇒ for a given species, randomly permute counts and re-do the analysis (kernel computation - with the same optimized weights - and K-PCA) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
  54. 54. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? our datasets are either numeric (environmental) or are built from a n × p count matrix ⇒ for a given species, randomly permute counts and re-do the analysis (kernel computation - with the same optimized weights - and K-PCA) the influence of a given species in a given dataset on a given PC subspace is accessed by computing the Crone-Crosby distance between these two PCA subspaces [Crone and Crosby, 1995] (∼ Frobenius norm between the projectors) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
  55. 55. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 25/41
  56. 56. Integrating ’omics data using kernels M TARA Oceans datasets (xm i )i=1,...,n,m=1,...,M measured on the same ocean samples (1, . . . , N) which take values in an arbitrary space (Xm )m: environmental dataset, bacteria phylogenomic tree, bacteria functional composition, eukaryote pico-plankton composition, . . . virus composition. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  57. 57. Integrating ’omics data using kernels Environmental dataset: standard euclidean distance, given by K(xi, xj) = xT i xj. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  58. 58. Integrating ’omics data using kernels Bacteria phylogenomic tree: the weighted Unifrac distance, given by dwUF (xi, xj) = e le|pei − pej| e pei + pej . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  59. 59. Integrating ’omics data using kernels All composition based datasets: bacteria functional composition, eukaryote (pico, nano, micro, meso)-plankton composition and virus composition calculated using the Bray-Curtis dissimilarity, dBC(xi, xj) = g |nig − njg| g nig + njg , nig: gene g abundances summarized at the KEGG orthologous groups level in sample i. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  60. 60. Integrating ’omics data using kernels Combinaison of M kernels by a weighted sum K∗ = M m=1 βmKm , where βm ≥ 0 and M m=1 βm = 1. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  61. 61. Integrating ’omics data using kernels Apply standard data mining methods (clustering, linear model, PCA, . . . ) in the feature space. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  62. 62. Correlation between kernels (STATIS) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
  63. 63. Correlation between kernels (STATIS) Low correlations between the bacteria functional composition and other datasets. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
  64. 64. Correlation between kernels (STATIS) Low correlations between the bacteria functional composition and other datasets. Strong correlation between environmental variables and small organisms (bacteria, eukarote pico-plankton and virus). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
  65. 65. Influence of k (nb of neighbors) on (βm)m k ≥ 5 provides stable results Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 28/41
  66. 66. (βm)m values returned by graph-MKL Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
  67. 67. (βm)m values returned by graph-MKL The dataset the less correlated to the others: the bacteria functional composition has the highest coefficient. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
  68. 68. (βm)m values returned by graph-MKL The dataset the less correlated to the others: the bacteria functional composition has the highest coefficient. Three kernels have a weight equal to 0 (sparse version). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
  69. 69. Proof of concept: using [Sunagawa et al., 2015] Datasets 139 samples, 3 layers (SRF, DCM and MES) kernels: phychem, pro-OTUs and pro-OGs Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 30/41
  70. 70. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 31/41
  71. 71. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 32/41
  72. 72. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 33/41
  73. 73. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 34/41
  74. 74. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 35/41
  75. 75. Proof of concept: using [Sunagawa et al., 2015] Proteobacteria (clade SAR11 (Alphaproteobacteria) and SAR86) dominate the sampled areas of the ocean in term of relative abundance and taxonomic richness. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 36/41
  76. 76. K-PCA on K∗ Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 37/41
  77. 77. K-PCA on K∗ - environmental dataset Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 38/41
  78. 78. K-PCA on K∗ - environmental dataset Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 39/41
  79. 79. Conclusion et perspectives Summary an integrative exploratory method ... particularly well suited for multi metagenomic datasets with enhanced interpretability Perspectives implement ADMM solution and test it improve biological interpretation soon-to-be-released R package Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 40/41
  80. 80. Questions? Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
  81. 81. References Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alterning direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122. Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J., Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S., Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker, P., Karsenti, E., and Sullivan, M. (2015). Patterns and ecological drivers of ocean viral communities. Science, 348(6237). Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009). Similarity-based classification: concepts and algorithm. Journal of Machine Learning Research, 10:747–776. Cottrell, M. and Letrémy, P. (2005). How to use the Kohonen algorithm to simultaneously analyse individuals in a survey. Neurocomputing, 63:193–207. Crone, L. and Crosby, D. (1995). Statistical applications of a metric on subspaces to satellite meteorology. Technometrics, 37(3):324–328. de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I., Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O., Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F., Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S., Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
  82. 82. Eukaryotic plankton diversity in the sunlit ocean. Science, 348(6237). Gönen, M. and Alpaydin, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268. Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994). The ACT (STATIS method). Computational Statistics and Data Analysis, 18(1):97–119. Lee, J. and Verleysen, M. (2007). Nonlinear Dimensionality Reduction. Information Science and Statistics. Springer, New York; London. L’Hermier des Plantes, H. (1976). Structuration des tableaux à trois indices de la statistique. PhD thesis, Université de Montpellier. Thèse de troisième cycle. Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F., Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F., de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P., Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G., Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M., Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015). Determinants of community structure in the global plankton interactome. Science, 348(6237). Lin, Y., Liu, T., and CS., F. (2010). Multiple kernel learning for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160. Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
  83. 83. Efficient interpretable variants of online SOM for large dissimilarity data. Neurocomputing, 225:31–48. Olteanu, M. and Villa-Vialaneix, N. (2015). On-line relational and multiple relational SOM. Neurocomputing, 147:15–30. Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the rv-coefficient. Applied Statistics, 25(3):257–265. Schölkopf, B., Smola, A., and Müller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319. Sommer, M., Church, G., and Dantas, G. (2010). A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion. Molecular Systems Biology, 6(360). Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A., Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka, F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P., Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015). Structure and function of the global ocean microbiome. Science, 348(6237). Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011). Unsupervised multiple kernel clustering. Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41

×