Our aim is to predict Multiple System Atrophy (MSA), a rare neurodegenerative disorder, using multiple omics datasets in cell lines.
We develop a probabilistic data integration method, POPLS-DA, to identify consistent molecular biomarkers across high dimensional and correlated omics layers.
Omics data integration for MSA | International Society for Clinical Biostatistics 2020
1. Statistical integration of
methylation, transcriptome and
proteome in cell lines
Said el Bouhaddani1, Hae-Won Uh1, Jeanine Houwing-Duistermaat1,2
1Department of Biostatistics and Research Support, Julius Center, University Medical Center Utrecht,
Netherlands;
2Department of Statistics, University of Leeds, UK.
2. Background
Multiple System Atrophy (MSA) is a rare neurodegenerative disorder. Almost 80% of patients
are disabled within 5 years of disease onset. The key pathogenic event when developing MSA
is an abnormal accumulation of harmful proteins. Molecular causes and consequences of this
aggregation need to be elucidated, e.g. using multiple omics datasets.
We have access to DNA-methylome, transcriptome, and proteome data, measured in cell
lines that show harmful protein aggregation and in negative controls. Standard sequential
analysis of these data shows no overlap of the significant genes.
Our aim is to develop a data integration method to identify consistent molecular biomarkers
that can classify cells with protein aggregation across all datasets. Apart from the high
dimensionality (p>N), also platform-specific heterogeneity between the omics data need to
be considered.
3. Motivating data & challenges
Methylome
- 850k sites on 4 cases, 4 controls
Transcriptome
- 25k probes on 3 cases, 3 controls
Proteome
- 2k proteins on 9 cases, 9 controls
Preprocessing: normalize data and map all IDs
to gene IDs
Final dataset: 1732 overlapping genes on 16
cases and 16 controls
Challenges
- High dimensional (p>N)
- Highly correlated
- Different platforms
4. Methods
There are several estimation methods proposed. The
general model is written as
𝑥 𝑘 = 𝑡 𝑘 𝑊⊤
+ 𝑡 𝑠,𝑘 𝑊𝑠,𝑘
⊤
+ 𝑒 𝑘
𝑦 𝑘 = 𝑡 𝑘 𝐵 + ℎ 𝑘
Underlying general model
For each omics dataset 𝒌=1,…,3, we introduce
- Joint latent variables 𝑡 underlying omics data 𝑥
and MSA outcome 𝑦
- Omic-specific latent variables 𝑡 𝑠 for each omics
dataset
5. Methods
Sparse PLS-DA 𝑡 𝑠 = 0, algorithmic,
sequential estimation
Sparse OPLS-DA Algorithmic, sequential
estimation
Probabilistic OPLS-
DA
Likelihood, simultaneous
estimation
Three methods considered
Estimation methods
Sparse PLS-DA (sPLS-DA) [1]
1.Convert binary 𝑦 to numerical ‘dummy’ 𝑦
2.Maximize 𝑤⊤ 𝑋⊤ 𝑦 with an L1 penalty on 𝑤
3.Calculate 𝑦 = 𝑥𝑊𝐵 and obtain class-predictions
Sparse OPLS-DA (sOPLS-DA) [2]
1.Obtain estimates for 𝑡 𝑠 𝑊𝑠
⊤ using OPLS
2.Subtract these parts from the original data matrix 𝑋
3.Follow steps in sparse PLS-DA using corrected 𝑋
Probabilistic OPLS-DA (POPLS-DA)
1.Formulate observed likelihood 𝑓(𝑥, 𝑦)
2.Formulate complete likelihood 𝑓 𝑥, 𝑦, 𝑡 =
𝑓 𝑥 𝑡 𝑓 𝑦 𝑡 𝑓(𝑡)
• Each term is computationally efficiently optimized
3.Utilize EM algorithm on 𝑓(𝑥, 𝑦, 𝑡) to obtain maximizers
for 𝑓(𝑥, 𝑦)
6. Simulation study
Conclusions
- POPLS-DA scores highest on accuracy, even in small sample size
- sparse OPLS-DA likely to overfit: it estimates omics-specific parts in each dataset, while sample
size is low
Setup
- Simulate 𝑋 and 𝑦 from “underlying model”
- Setup close to real data:
- 1000 features,
- 3 data types with resp. 8, 6 and 18 samples
- Two joint, two specific components
- Calculate accuracy of prediction using large
simulated test data:
- 500*{8,6,18} samples
- Compare sPLS-DA, sOPLS-DA, POPLS-DA
7. Data analysis
Results
- Two joint, two specific
components
- Sparsity level: 50 genes
retained (not for POPLS-DA)
- All methods separate MSA cases from controls
Conclusions
- sOPLS-DA clusters more homogeneous
- POPLS-DA has more spread, less certain about
predictions
- Top ten genes directly involved in harmful protein
aggregation
8. Conclusions
- POPLS-DA discriminates MSA based on multiple omics data, performs best
for small sample size
- Simulation: algorithmic methods sPLS-DA and sOPLS-DA likely to overfit,
need larger sample size
- MSA cases separated from controls based on 3 omics datasets, top genes
biologically important
- POPLS-DA will be added to OmicsPLS package (on cran.r-
project.org/package=OmicsPLS)
9. s.elbouhaddani@umcutrecht.nl
Günter Höglinger
Jörg Tost
Matthias Höllerhage
E-Rare EU project: MSAomics
H2020 project: IMFORFUTURE
Acknowledgments
References
[1] Lê Cao, K., Boitard, S. & Besse, P. Sparse PLS discriminant analysis:
biologically relevant feature selection and graphical displays for multiclass
problems. BMC Bioinformatics 12, 253 (2011). https://doi.org/10.1186/1471-
2105-12-253
[2] Bylesjö, M., Rantalainen, M., Cloarec, O., Nicholson, J.K., Holmes, E. and
Trygg, J. (2006), OPLS discriminant analysis: combining the strengths of PLS‐DA
and SIMCA classification. J. Chemometrics, 20: 341-351. doi:10.1002/cem.1006