Introduction: Since the introduction of the LASSO, computational approaches to variable selection have been rigorously developed in the statistical literature. The need for such methods has become increasingly important with the advent of high-throughput technologies in genomics and brain imaging studies where it is believed that the number of truly important variables is small relative to the total number of variables. While the focus of these methods has been on additive models, there are several applications where interaction models can reflect biological phenomena and improve statistical power. For example, genome wide association studies (GWAS) have been unable to explain a large proportion of heritability (the variance in phenotype attributable to genetic variants) and it has been suggested that this missing heritability may in part be due to gene-environment interactions. Furthermore, diseases are now thought to be the result of entire biological networks whose states are affected by environmental factors. These systemic changes can induce or eliminate strong correlations between elements in a network without necessarily affecting their mean levels.
Methods: Therefore, we propose a multivariate penalization procedure for detecting interactions between high dimensional data ($p >> n$) and an environmental factor, where the effect of this environmental factor on the high dimensional data is widespread and plays a role in predicting the response. Our approach improves on existing procedures for detecting such interactions in several ways; 1) it simultaneously performs model selection and estimation 2) it automatically enforces the strong heredity property, i.e., an interaction term can only be included in the model if the corresponding main effects are in the model 3) it reduces the dimensionality of the problem and leverages the high correlations by transforming the input feature space using network connectivity measures and 4) it leads to interpretable models which are biologically meaningful.
Results: An extensive simulation study shows that our method outperforms LASSO, Elastic Net and Group LASSO in terms of both prediction accuracy and feature selection. We apply our methods to the NIH pediatric brain development study to refine estimates of which regions of the frontal cortex are associated with intelligence scores, and a sample of mother-child pairs from a prospective birth cohort to identify epigenetic marks observed at birth that help predict childhood obesity. Our method is implemented in an \texttt{R} package- http://sahirbhatnagar.com/eclust/
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
A model for interpretable high dimensional interactions
1. A Model for Interpretable High Dimensional
Interactions
Sahir Rai Bhatnagar
Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood
Poster Number 67
13. formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
6
14. formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
6
15. formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
6
16. formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
6
17. formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
6
19. ECLUST - our proposed method: 3 phases
Original Data
20. ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
21. ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
22. ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
23. ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
24. ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
7
25. the objective of statistical
methods is the reduction of data.
A quantity of data . . . is to be
replaced by relatively few quantities
which shall adequately represent
. . . the relevant information
contained in the original data.
- Sir R. A. Fisher, 1922
7
26. Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
8
27. Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
8
28. Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
Strong heredity principle2
:
ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
8
32. Open source software
• Software implementation in R: http://sahirbhatnagar.com/eclust/
• Allows user specified interaction terms
• Automatically determines the optimal tuning parameters through
cross validation
• Can also be applied to genetic data
11
35. Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
12
36. Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
12
37. Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
12
38. Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
12
39. Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
12
41. Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
13
42. Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
13
43. Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
• Need more samples . . . Got data? (Poster 67)
13
44. acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, Andr´e Anne
Houde
• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz
• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,
Dr. Forest
• Greg Voisin, Dr. Forgetta,
Dr. Klein
• Mothers and children from the
study
14