Data analysis

Interrogating Methylation
Possible impacts on DNA methylation:
• Goal: capturing biologically meaningful variation
Joubert et al 2012 EHP
Site-specific Regional Global
Jaffe et al 2012 IJE Christensen et al 2012
PLoS Genet

Interrogating Methylation
• When might we expect:
• Global methylation changes?
• Regional methylation changes?
• Site-specific methylation changes?
Any good examples that people can think of?

Increase confidence Decrease confidence
Statistical
significance
Reaches genome-wide significance
Does not meet predefined significance
threshold that takes into account multiple
testing
Effect size Large (>10% difference) Small (<5% difference)
Bias and
confounding
Bias reduced by design or controlled
for in the analyses
Bias or uncontrolled confounding may exist
and explain the differences observed
Genomic location
Differential methylation is in a region
that may impact regulation of
transcription
Current knowledge cannot explain the
influence of the observed difference in
methylation at that locus on transcription
Functional
relevance
Affects expression Does not affect expression
Biological relevance
Gene codes for known biological
function
Biological relevance of DMR location unknown
Validation
Replicated in an independent human
cohort or animal model using a
different technique
No validation of results attempted or results
are not replicated in a validation study
Confidence in Methylation
Association
Michels et al. (2013) Nature Methods Reviews 4

Data Considerations: Methylation
Distribution
Example
histograms with
normal density
curves
● Distributions may be normal, skewed or other-modal
○ important at analysis stage

Illumina 27k Chip
Du et al. BMC Bioinformatics. 2010
Illumina 450k Chip
● Variance across the spectrum of methylation is not equal
● Extreme values of methylation show reduced variance
compared to intermediate values
Data Issues: Heteroscedasticity

Many Researchers opt for the M-value
M= log2 (β/1-β)
Du et al. BMC Bioinformatics. (2010)
● Simple to convert in R
● Some R analysis packages have this option built in e.g. CpGassoc
Arcsine (variance-stabilising transformation)
Y= arcsin(√Y)
Lin et al. Nucleic Acids Research (2008)
Beta-value Transformation

What to Use?
• Beta-value has a more intuitive
biological interpretation
• M-value is more statistically valid
for the differential analysis of
methylation levels
• One possibility is to use M-value
method for conducting
differential methylation analysis
and including the Beta-value
statistics when reporting the
results to investigators
• Drawback to M-value is that may
be capturing significant but very
small absolute changes in
methylation

Global Variation in
Methylation
Exploration of the Major Determinants of Methylation

Clustering Analysis
• Purpose of clustering is to organize objects into groups such
that the objects in a group are more similar to each other than
objects in different groups
• Unsupervised clustering of DNA methylation data is often used
for the identification:
• Methylation subgroups
• Groups of samples with a similar methylation profile across a collection
CpG
• Some options often used for methylation data
• Non-parametric clustering
• K-means (requires pre-specification of the number of classes)
• Principle Component Analysis (PCA)
• Semi-parametric
• Recursively partitioned mixture model (RPMM) (Houseman et al. 2008)

Clustering Analysis
• Clustering can use various linkage and distance
methods
• Distance quantifies dissimilarity between sample data
• The linkage method is used when deciding the distance
for observations that have already been merged
together
• i.e. choosing what point in a cluster to measure the inter-
cluster distance from

Distance Metrics
Distance quantifies dissimilarity
between sample data
• Euclidean: square root of sum
of squares of attribute
differences
• Shortest distance
• Manhattan: sum of the
differences of their
corresponding components
• the distance that would be
traveled to get from one data
point to the other if a grid-like
path is followed

Linkage
Types of linkage:
• Complete - defines the
cluster distance between
two clusters to be the
maximum distance between
their individual components
• Average – the mean
similarity of one cluster to
another
• Median – the median
similarity of one cluster to
another, going to be
relatively similar to
“average” linkage results
Average
Linkage
Complete
Linkage
Single
Linkage
dendrogram that displays a hierarchical
relationship

Evaluating Classifier Performance
• compare the labeled outcome of the supervised
classification algorithm with the known labeled targets
• e.g. Area under the curve, sensitivity and specificity
• How well have we labelled the input data according to the
target labels?
• Measure the relation between elements of each class
and not to the given labels
• Adjusted Rand Index (ARI) evaluates how well an algorithm
separates the elements belonging to different classes
• Rand indices near 1 indicate high agreement
• Rand indices near -1 indicate separation
• Can have any number of groups (unlike sensitivity and
specificity)

Principle Component Analysis
• Principal components are
found by calculating the
eigenvectors and
eigenvalues of the data
covariance matrix
• eigenvector with the
largest eigenvalue is the
direction of greatest
variation, the one with the
second largest eigenvalue
is the (orthogonal)
direction with the next
highest variation
The directions U and V are
principle components
• Orthogonal to each other

Principle Component Analysis
Genes mirror geography
within Europe.
Novembre et al. Nature
(2008)
First two principle
components of genetic
variation in a sample of
3,000 European
individuals genotyped at
over half a million
variable DNA

Recursively Partitioned Mixture
Model (RPMM)
• General procedure: divide samples based on
methylation profile using a mixture of beta distributions
to recursively split samples via 2-class models with
Bayesian information criterion (BIC) used at each
potential split to decide whether the split was to be
maintained or abandoned
• Result: K classes, representing K terminal nodes, and posterior
probabilities of class membership for the samples
library(RPMM)
data(IlluminaMethylation)
rpmm <- blcTree(IllumBeta)
ProbClassMembership = blcTreeLeafMatrix(rpmm)

Model (RPMM)
• Note: No closed form MLE for parameters for the beta-distribution -
computationally intensive
• M-value transformation and fitting a Gaussian RPMM tends produce similar results
expit2 <- function(x) log2(x)-log2(1-x)
RPMMSolution = glcTree(expit2(IllumBeta))
par(mfrow=c(2,1))
plotTree.blcTree(rpmm,
labelFunction=function(u,digits) table(as.character(tissue[u$index])))
title("Dendrogram using Beta Distribution")
plotTree.blcTree(RPMMSolution,
labelFunction=function(u,digits) table(as.character(tissue[u$index])))
title("Dendrogram using Gaussian Distribution")

Model (RPMM)

Model (RPMM)
Clustering Samples- identify similar
global methylation profiles
Tissue-specific DNA methylation
dependent upon CpG island context.
Christensen et al. PLoS Genetics (2009)
Methylation is yellow for unmethylated
and blue for methylated
Methylation profile classes significantly
differentiate all normal tissue types
(n = 217, P<0.0001)

Model (RPMM)
Clustering CpGs - examine classes of CpGs
with similar methylation profiles
Tissue-specific DNA methylation
dependent upon CpG island context.
Christensen et al. PLoS Genetics (2009)
Mean regression coefficients for age
associated methylation (by decade), and its
95% confidence interval from GEE for
each CpG RPMM class
• CpGs clustered with RPMM into eight
classes for each group of samples
• The bottom plot indicates the CpG
island status for each locus

Summary
• Benefits to analysis of global changes
• Identify major determinant of methylation profile
• Suggests something about the nature of epigenetic
regulation associated with the exposure or phenotype of
interest
• Drawbacks
• Major determinants may not be of biological interest
• Sources of major variation tend to be batch effects and
tissue-specific differences

Objectives of Model Building
• Interest may be on the association between a
response and one or two important risk factors
• The estimates are not subject to confounding
• We are not oversimplifying these associations by ignoring
important effect modification
• Interest may be prediction
• The set of regressors that best minimize the prediction error
• Identifying the important independent predictors of an
outcome

Association Models
• Many ways to analyze methylation in R
• Linear, generalized linear models (logistic, poisson, etc),
mixed models, failure time models, Cox PH, etc
• Not going to expand on these models and modeling
assumptions, beyond the scope of this workshop
• For association models, covariates included should be
based on subject matter knowledge
• Goal is to reduce bias
• R packages have been developed for efficient linear
analysis of microarray data

Linear Models for each Gene
Model: E 𝑦𝑦𝑗𝑗 ~ ̂𝛽𝛽𝑗𝑗𝑗 + ̂𝛽𝛽𝑗𝑗1 𝑋𝑋
• Can consider linear model for CpG 𝑗𝑗 has residual
variance 𝜎𝜎𝑗𝑗
2
with sample value 𝑠𝑠𝑗𝑗
2
and degrees of
freedom 𝑓𝑓𝑗𝑗
• The unscaled standard deviation for the covariate of
interest is kth covariate is 𝑢𝑢𝑗𝑗𝑗𝑗
• Standard T statistic:
• 𝑡𝑡𝑗𝑗𝑗𝑗~
�𝛽𝛽𝑗𝑗𝑗𝑗
𝑢𝑢𝑗𝑗𝑗𝑗 𝑠𝑠𝑗𝑗
=
𝑠𝑠𝑠𝑠 �𝛽𝛽𝑗𝑗𝑗𝑗
with 𝑓𝑓𝑗𝑗 degrees of freedom

Limma and Empirical Bayes
• Limma is a package in R that takes advantage of the
information across genes (or CpGs) to calculate a
moderated t-statistic
• Uses a Bayesian approach to shrinkage of the estimated
sample variances towards a pooled estimate
• Results in more stable inference when the sample
size is small
• However, must consider the potential impact of samples
that seem to be a global outlier
• Microarray may have failed

• Uses a Bayesian approach to shrinkage of the estimated
sample variances towards a pooled estimate
• Posterior residual standard deviations:
• ̃𝑠𝑠𝑗𝑗
2
=
𝑓𝑓0 𝑠𝑠0
2+𝑓𝑓𝑗𝑗 𝑠𝑠𝑗𝑗
2
𝑓𝑓0+𝑓𝑓𝑗𝑗
• Prior sample value 𝑠𝑠0 and degrees of freedom 𝑓𝑓0
• Moderated T-statistic:
• 𝑡𝑡𝑗𝑗𝑗𝑗~
𝑢𝑢𝑗𝑗𝑗𝑗 ̃𝑠𝑠𝑗𝑗
with 𝑓𝑓0 + 𝑓𝑓𝑗𝑗 degrees of freedom
• The extra 𝑓𝑓0degrees of freedom represent the extra information
borrowed from all the interrogated sites for inference about each
individual gene

Running limma in R:
library(limma)
tempmod<-model.matrix(~group,data=pheno)
fit <- lmFit(methyldata, tempmod)
fit <- eBayes(fit)
topTable(fit)

Multiple Testing
• Assume we have 450,000 tests, each test is
independent, and we specify our type 1 error to be 0.05
• Type I Error: probability of rejecting H0 given H0 is true
• i.e. a false positive
• Under the null we would expect 450,000*0.05 loci to
have p<0.05
• This is 22,500 CpG loci
• Need to correct for multiple comparisons

Types of Errors
Null True Alternative
True
Total
Not called
Significant
U T m-R
Called
Significant
V S R
m0 m-m0 m
V = # of Type 1 errors (false positives)

Controlling for the Family Wise
Error Rate (FWER)
• Let T1…Tk be K independent tests of the null hypotheses
H1…Hk
• Family wise error rate (FWER) = the probability of
rejecting at least one Hi null hypothesis given it is true
• 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 𝑃𝑃(𝑉𝑉 ≥ 1)
• Bonferroni procedure: use significance level α/K
• Very conservative: can increase number of false negatives
• More efficient methods possible
adjustedpvals = p.adjust(Pvalues, method = "bonferroni")

False Discovery Rate (FDR)
• False discovery rate (FDR): the expected proportion
of Type I errors among the rejected hypotheses
• 𝐹𝐹𝐹𝐹𝐹𝐹 = 𝐸𝐸
𝑉𝑉
𝑅𝑅
𝑅𝑅 > 0 𝑃𝑃(𝑅𝑅 > 0)
• To control for FDR at level δ=0.05
• Order the unadjusted p-values: p1 ≤ p2 ≤ … ≤ pm
• Then find the test with the highest rank, j, for which the p-
value, pj , is less than or equal to (j/m) x δ
• 𝑝𝑝(𝑗𝑗) ≤ 𝛿𝛿
𝑗𝑗
𝑚𝑚
• Declare the tests of rank 1, 2, …, j as significant

False Discovery Rate (FDR)
• q-value: the minimum FDR that can be attained
when calling that “feature” significant
• Expected proportion of false positives incurred when
calling that feature significant
• If a CpG has a q-value of 0.04 it means that 4% of CpGs
that have a p-value at least as small as that CpG are false
positives
adjustedpvals = p.adjust(Pvalues, method = "fdr")

Permutation Tests
• Does not assume that tests are independent
• Procedure:
1. For the tests of the M CpG loci, order the unadjusted p-
values: p1 ≤ p2 ≤ … ≤ pm
2. Permute outcome within “exchangeable” sets, refit the
regression models
• Must permute within strata if any stratifying variables
3. Order the p-values from the M regression models: pr
1 ≤pr
2 ≤
… ≤ pr
m
4. Repeat steps 2 and 3 R many times (R=1000 permutations)
5. Adjusted p-value: 𝑝𝑝𝑗𝑗 =
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑗𝑗
𝑟𝑟
≤ 𝑝𝑝𝑗𝑗
𝑅𝑅
• NOTE: very computationally intensive

Additional Considerations:
Influential Points
Points that have exerted
undue influence on the
regression coefficient
estimates
x-outliers with the
potential to exert undue
influence on regression
coefficient estimates
Leverage Points Influential Points

• Important to check global distributions to see if there is
a potential outlier
• Among site specific tests
• Plotting the association, does the association appear to be
driven by only a few samples?
• If a parametric test was used, is the non-parametric test
significant?
• Is the association still significant when the outliers are
removed?
• Important follow-up question: does it seem to be a
technical outlier or biologic outlier?
Additional Considerations:
Influential Points

Prediction models
Goals
• Want the most parsimonious model
• Variance of the predictions increases as the number of
regressors increase
• Estimation problems may occur with too many variables
(multicollinearity)
• Do not want the model overly simplistic = biased
estimates

Prediction Models in the Context
of DNA Methylation Studies
• Number of CpGs>>>the number of individuals
• If trying to predict outcome based on methylation
profiles, too many sites to model each individually
• Many sites are correlated – either due to proximity
or shared regulation of biological pathway
• Become redundant information in a prediction model

Bias-variance tradeoff
As the model complexity increases the model becomes
more specific to the training set and less generalizable to
the test set
Hastie, Tibshirani & Friedman. Elements of
Statistical Learning (2013 ed.10)
𝑀𝑀𝑀𝑀𝑀𝑀 �𝜇𝜇 = 𝐸𝐸 �𝜇𝜇 − 𝜇𝜇 2
𝑀𝑀𝑀𝑀𝑀𝑀 �𝜇𝜇 = 𝑉𝑉𝑉𝑉𝑉𝑉 �𝜇𝜇 + 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 �𝜇𝜇 2

Variable Selection
• Forward Selection
• Start with model with only an intercept
• Add variable that lowers AIC the most
• Repeat until none of the remaining variables meet the minimum
requirement for inclusion
• Once a variable is included, it cannot be removed
• Backward Elimination
• Start with all of the variables in the model
• Eliminate the least statistically significant variable which does
not meet the criteria for remaining in the model
• Largest p-value or largest decrease in AIC
• Repeat until all the remaining variables meet the criterion for
inclusion
• Once a variable is excluded, it stays out of the model

Variable Selection
• Stepwise Selection
• Start with a model with only an intercept
• Include the variable with the smallest pvalue less than
the specified significance level
• or largest decrease in AIC
• Re-evaluated the contribution of each variable after
each step and delete any which no longer meet the
minimum criteria for staying in the model
• The stepwise selection process ends if:
• No further variable can be added to the model
• Or if the variable just entered into the model is the only
eliminated in the subsequent backward elimination

Shrinkage Methods
• Forward/backward/stepwise selection includes or
excludes a variable completely
• Provides more interpretable model, but possibly lower
prediction error than the full model
• Shrinkage methods constrain the size of regression
coefficients by imposing a penalty
• Penalty introduces bias into analysis to reduce variance

Shrinkage Methods
Ridge regression
̂𝛽𝛽𝑟𝑟𝑟𝑟 𝑟𝑟 𝑟𝑟𝑟𝑟
= min
𝛽𝛽
𝑦𝑦 − 𝑋𝑋𝑋𝑋 2
+ 𝜆𝜆 𝛽𝛽 2
2
• Bias increases as 𝜆𝜆 increases
• Variance decreases as 𝜆𝜆 increases
• Theory is that there is a 𝜆𝜆 such that
the MSE of the ridge regression is
less than MSE of the linear model

Shrinkage Methods
Lasso regression
̂𝛽𝛽𝑙𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
= min
𝛽𝛽
+ 𝜆𝜆 𝛽𝛽 1
• Similar to ridge except that penalty is the sum of the
absolute value of the parameters (𝑙𝑙1 penalty), whereas
the ridge uses the sum of squared parameters (𝑙𝑙2 penalty)
• For ridge none of the parameter estimates go to zero (unless 𝜆𝜆 =
∞), no variable selection
• Parameters go to zero in lasso, there is variable selection
• At most the number of parameters is equal to the number of subjects
• Does not perform grouped selection – will only select on of correlated
variables

Shrinkage Methods
Elastic Net regression
̂𝛽𝛽𝑙𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
= min
𝛽𝛽
+ 𝜆𝜆1 𝛽𝛽 1 + 𝜆𝜆2 𝛽𝛽 2
2
• Combination of the ridge and lasso penalties
• Removes limitation on the number of selected variables
• Allows for correlated variable to enter

Estimating Prediction Error
• If enough data, ideally split the data so there is a training
and test data set
• Often instead use cross-validation to estimate the prediction
error
• K-fold cross validation splits the data into approximately
K equal parts
• Fits the model to the other K-1 parts and then calculate the
prediction error of the model when predicting the Kth part of
the data
• Performed on all K parts and prediction error estimates combined
Hastie, Tibshirani & Friedman. Elements of
Statistical Learning (2013 ed.10)

Estimating Cellular Age with DNA
Methylation Data
• DNA methylation age of human tissues and cell types.
Horvath. Genome Biology (2013)
• In a training data set, chronological age was regressed on the CpGs
using elastic net
• Difference between methylation-predicted age and
chronological age (Δage) put forth as an index of
disproportionate ‘biological’ aging
• Δage has been found to be associated with all cause mortality
• Marioni et al. Genome Biology (2015)

Estimating Cellular Age with DNA
Methylation Data
Marioni et al. Genome Biology (2015)

Final Considerations for Prediction
Modeling
• Standard errors of regression coefficients are biased low
• Not taking into account that we sorted through other variable choices
• p-values that are too small
• severe multiple testing problems
• Choice of variables in the final model are heavily influenced by
sampling
• When variables are correlated, which one enters model (and which is
excluded) is relatively random
• HOWEVER, the purpose of prediction models is not causal
inference
• Should not care about the estimates of specific parameters or if a causal
variable or a correlated surrogate enters the model

Notes on Efficiency in R
• If there is a closed form for a model, vectorize rather than loop
• Slow method
storing<-matrix(NA,ncol=1,nrow=nrow(methyl))
for(i in 1:nrow(methyl)){
storing[i,]<-
summary(lm(methyl[i,]~exposure))$coef[2,4]
}
• Fast method
obj<-lm(t(methyl)~group)
modelM<-model.matrix(~group)
XXI<-solve(t(modelM)%*%modelM)
dof<-obj$df.residual
sigma<-sqrt(colSums(obj$residual^2)/dof)
est<-obj$coef[2,]
Pval<-2*pt(-abs(est/sqrt(XXI[2,2]*sigma)),dof)

Regional Analysis
Taking advantage of the correlation structure between loci

Differentially Methylated Regions
• Loci in close proximity tend to have
correlated levels of methylation
• We can exploit this correlation structure to
increase our power to detect changes in
methylation
• May have more confidence in our results if
we see a regional change vs a very site-
specific association
Why might this be?
PMID: 2147404

Summarizing Methylation Across
Region
• Implemented in IMA
• For each specific region, IMA will collect all the targeted loci
within it and derive an index of overall region-level methylation
value
• three different index metrics implemented in IMA: mean, median
and Tukey's Biweight robust average
• Between group comparisons then performed on region-level
methylation estimates
beta = dataf@bmatrix
betar = indexregionfunc(indexlist=dataf@TSS1500Ind,
beta=beta,indexmethod="median")
TSS1500testALL = testfunc(eset = betar,
testmethod="limma",Padj="BH",concov="OFF",groupinfo =
dataf2@groupinfo,gcase ="g2", gcontrol=c("g1","g3"),
paired = FALSE)
TSS1500test =
outputDMfunc(TSS1500testALL,rawpcut=0.05,adjustpcut=0.05,b
etadiffcut=0.14)
TSS1500test[10:20,]

Aggregate P-values Across
Predefined Regions
• Uncorrected, CpG-specific P values within a given
region are combined using an extension of Fisher's
method
• Uses weighted inverse chi-square method for correlated
significance tests
• Results in a single aggregate P value for each region
• Aggregate P values are subjected to multiple-testing
correction using the FDR method
• Implemented in RnBeads

“Bump Hunting”
Jaffe et al 2012 IJE

“Bump
hunting”
The general
workflow to
bump-
hunting

Identifying Candidate Regions

Significance of Bumps

Probe Lasso
Butcher and Beck (2015)
Methods
• Probe Lasso utilises a
flexible window
(“probe-lasso”) based
on probe density to
gather neighbouring
significant-signals to
define clear DMR
boundaries
• Motivation?
• Implemented in ChAMP
Probe Lasso calculates probe spacing for each
probe in the dataset; these data are binned
into one of the 28 genetic/epigenetic
categories (i.e., 7 gene features × 4 CGI
relations)

Probe Lasso
• Specify lassoStyle and
lassoRadius
• If lassoStyle = max, the
probe-lasso sizes will be at
most 2 × lassoRadius bp
• If lassoStyle = min, the
probe-lassos will be at least
2 × lassoRadius bp
• Probe Lasso identifies the
genetic/epigenetic category
that conforms to user-
specified maximum (or
minimum) lassoRadius and
derives the quantile at
which it occurs
• Derived quantile is then
applied to each
genetic/epigenetic
distribution of probe
spacings to create probe-
lassos that vary according
to genetic/epigenetic-
feature
An example quantile distribution of probe spacing for each gene/CGI
feature. The black horizontal and vertical dashed lines indicate the
quantile (43rd) that results from choosing a maximum lasso size of
2000 bp

Probe Lasso
• Results in 28 dynamic window
sizes (‘probe-lassos’) that are
thrown around each
significantly-associated probe
• If these lassos capture a user-
specified number of significant
probes, that probe’s lasso
boundaries are retained
(minSigProbesLasso)
• Overlapping- and neighbouring-
lasso boundaries less than a
user-specified distance apart
are then merged to define DMR
boundaries (minDmrSep)
• All probes in the dataset are
then binned into the DMRs and
their p-values combined for the
DMR, weighted by the
underlying correlation structure
of probe methylation values

Data analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data analysis

Similaire à Data analysis (20)

Dernier

Dernier (20)

Data analysis