Single Nucleotide Polymorphism Analysis
Predictive Analytics and Data Science Conference May 27-28
Asst. Prof. Vitara Pungpapong, Ph.D.
Department of Statistics
Faculty of Commerce and Accountancy
Chulalongkorn University
2. Outline
• What is SNP array?
• Typical SNP analysis
• Challenges
• The ICM/M Method
• Results
Vitara Pungpapong 2
3. Microarray
• Usually known as Chip-chip.
• First publication in 1999
• Each known gene is a one spot on the
chip.
• Laser induced fluorescence (LIF) is used to
obtain color and intensity of each gene.
• Varying colors show varying levels of gene
activity.
• A microarray chip can contain 10,000 –
20,000 genes.
Vitara Pungpapong 3
4. Single Nucleotide Polymorphism
• Usually called chip-seq or SNP.
Vitara Pungpapong 4
(https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism)
5. Microarray vs SNPs
• Microarray is more suitable for small genomes
• More bias in microarray
• SNPs generally produces profiles with a better
signal-to-noise ratio, and allows detection of
more peaks and narrower peaks.
• SNPs generate more high-throughput data (>
1Tb) which requires more effort in analysis.
Vitara Pungpapong 5
6. 1000 Genome Project
• http://www.1000genomes.org/
• The 1000 Genome Project provide the largest public catalog of
human genetic variation.
• The Project ran from 2008 and completed in 2015.
• The human genome consists of approximately 3 billion DNA
base pairs and is estimated to carry around 20,000 protein
coding genes.
• The samples for the 1000 Genomes Project are anonymous
and have no associated medical or phenotype data.
• The project holds self-reported ethnicity and gender.
• All participants declared themselves to be healthy at the time
the samples were collected.
Vitara Pungpapong 6
12. Preprocessing Data in GWAS
• SNP Call Rate (98-99%)
• Sample Call Rate (98-99%)
• Data Imputation
• Minor Allele Frequency (Remove extremely rare
SNPs, i.e., <5% frequency)
• Hardy-Weinberg Equilibrium
• Recode SNPs to the count of minor allele (0, 1, 2)
• For more information, refer to Turner et. Al.
(2011).
Vitara Pungpapong 12
14. Challenges in GWAS
Vitara Pungpapong 14
- Want to incorporate biological pathway in GWAS
- Want to analyze all SNPs at once
15. High-dimensional Regression
• Regression with n < p
• Challenges in high-dimensional regression
– Large p small n problem
– Multicollinearity
– Sparsity
Vitara Pungpapong 15
16. Bayesian Model Setup
Vitara Pungpapong 16
𝐘 = 𝐗𝛽 + 𝜀, 𝜀~𝑁 0, 𝜎2
𝐼 𝑛
Consider a normal regression model:
Prior to capture sparsity in regression coefficient:
𝛽𝑗|𝜏𝑗 ~ 1 − 𝜏𝑗 𝛿0 𝛽𝑗 + 𝜏𝑗 𝛾𝛼 𝛽𝑗 𝜎 .
where 𝛿0 . is a Dirac delta function at zero
𝜏𝑗 = 1 𝛽 𝑗≠0
𝛾𝛼 𝛽𝑗 𝜎 =
𝛼 𝑛 − 1
2𝜎
exp −
𝛼 𝑛 − 1
𝜎
|𝛽𝑗|
18. Bayesian Model Setup
• The Ising model is employed to model relationship among
SNPs.
• The Ising model assumes that the relationship lies in an
undirected graph G = (V, E) where V is a set of vertices and E is
a set of edges.
• The Ising prior for 𝜏 = 𝜏1, … , 𝜏 𝑝
𝑡
where 𝜏𝑗 = 1 𝛽 𝑗≠0
Vitara Pungpapong 18
𝑃 𝜏 =
1
𝑍(𝑎, 𝑏
exp 𝑎
𝑗
𝜏𝑗 + 𝑏
<𝑗,𝑘>∈𝐸
𝜏𝑗 𝜏 𝑘 𝜏1
𝜏2
𝜏3 𝜏4
𝜏5
19. The ICM/M Algorithm
• Pungpapong et. al. (2015).
• Idea: The conditional distributions are used to obtain
parameters
• The ICM/M consists of two main parts:
– Conditional median for each regression coefficient
– Conditional mode for hyperparameters and auxiliary parameters
Vitara Pungpapong 19
24. Extension of the ICM/M to GLMs
• Borrow the idea of an iteratively reweighted least squares
(IRLS).
Vitara Pungpapong 24
25. Simulation Studies
• A total of 1,782 SNPs were randomly selected from
the Framingham dataset (Cupples et. al. 2007)
• 24 human regulatory pathways were retrieved from
KEGG database which involved 1,502 genes.
• 311 SNPS involved in 5 pathways were assumed to
have nonzero effect where the effect sizes were
randomly generated from Uniform[0.5, 3].
• Phenotype were simulated from the normal
regression model with the error variance = 5.
Vitara Pungpapong 25
27. Framingham Data Analysis
• Dataset: Framingham heart study (Cupples et. al. 2007)
• Phenotype: log transformation of vitamin D level
• Sample size: 952 for training set and 519 for test set
• The gene-pathway information relevant to vitamin D
level is obtained from the KEGG database
• There are 84,834 SNPs resided in 2,167 genetic regions in
112 pathways.
• Univariate tests were applied for screening process
resulting in 7,824 SNPs left for the analysis.
Vitara Pungpapong 27
28. Framingham Data Analysis
• Prediction errors and no. of identified SNPs
Vitara Pungpapong 28
Method Prediction
Error
No. of Identified
SNPs
Lasso .2560 14
Adaptive
Lasso
.2085 5
ICM/M .2121 5
30. Parkinson’s Disease Data Analysis
• Data come from 3 different studies on PD
– Autopsy-Confirmed Parkinson Disease GWAS Consortium
(APDGC) (dbGaP Study Accession: phs000394.v1.p1)
– Genome-Wide Association Study of Parkinson Disease:
Genes and Environment (dbGaP Study Accession:
phs000196.v2.p1)
– NINDS-Genome-Wide Genotyping in Parkinson's Disease:
First Stage Analysis and Public Release of Data (n=1741)
– dbGaP Study Accession: phs000089.v3.p2
• Combined three data sets and obtained overlapping
SNPs (𝑛 = 6,704, 𝑝 = 888,398
Vitara Pungpapong 30
32. Parkinson’s Disease Data Analysis
• ICM/M found 46 SNPs having nonzero
regression coefficients across 22
chromosomes.
• 8 genes known to PD were identified (e.g.,
TLR4, TNF, …).
Vitara Pungpapong 32
33. References
• Cupples, L. A.et al. (2007). The framingham heart study 100k snp genome-
wide association study resource: Overview of 17 phenotype working group
reports. BMC Medical Genetics, 8(Suppl 1):S1.
• Ho et. al. (2011). ChIP-chip versus ChIP-seq: Lessons for experimental
design and data analysis, BMC Genomics 2011 12:134.
• Meinshausen et. al. (2009). P-values for high-dimensional regression.
Journal of the American Statistical Association, 104:1671–1681.
• Pungpapong et. al. (2015). Selecting Massive Variables Using An Iterated
Conditional Modes/Medians Algorithm, Electronic Journal of Statistics 9 :
1243-1266.
• Turner, S. (2011). Quality control procedures for genome-wide association
studies. Curr Protoc Hum Genet 2011;68:1–19.1.18.
Vitara Pungpapong 33