SlideShare une entreprise Scribd logo
1  sur  62
Data Analysis
Day 3
Interrogating Methylation
Possible impacts on DNA methylation:
• Goal: capturing biologically meaningful variation
Joubert et al 2012 EHP
Site-specific Regional Global
Jaffe et al 2012 IJE Christensen et al 2012
PLoS Genet
Interrogating Methylation
• When might we expect:
• Global methylation changes?
• Regional methylation changes?
• Site-specific methylation changes?
Any good examples that people can think of?
Increase confidence Decrease confidence
Statistical
significance
Reaches genome-wide significance
Does not meet predefined significance
threshold that takes into account multiple
testing
Effect size Large (>10% difference) Small (<5% difference)
Bias and
confounding
Bias reduced by design or controlled
for in the analyses
Bias or uncontrolled confounding may exist
and explain the differences observed
Genomic location
Differential methylation is in a region
that may impact regulation of
transcription
Current knowledge cannot explain the
influence of the observed difference in
methylation at that locus on transcription
Functional
relevance
Affects expression Does not affect expression
Biological relevance
Gene codes for known biological
function
Biological relevance of DMR location unknown
Validation
Replicated in an independent human
cohort or animal model using a
different technique
No validation of results attempted or results
are not replicated in a validation study
Confidence in Methylation
Association
Michels et al. (2013) Nature Methods Reviews 4
Data Considerations: Methylation
Distribution
Example
histograms with
normal density
curves
● Distributions may be normal, skewed or other-modal
○ important at analysis stage
Illumina 27k Chip
Du et al. BMC Bioinformatics. 2010
Illumina 450k Chip
● Variance across the spectrum of methylation is not equal
● Extreme values of methylation show reduced variance
compared to intermediate values
Data Issues: Heteroscedasticity
Many Researchers opt for the M-value
M= log2 (β/1-β)
Du et al. BMC Bioinformatics. (2010)
● Simple to convert in R
● Some R analysis packages have this option built in e.g. CpGassoc
Arcsine (variance-stabilising transformation)
Y= arcsin(√Y)
Lin et al. Nucleic Acids Research (2008)
Beta-value Transformation
What to Use?
• Beta-value has a more intuitive
biological interpretation
• M-value is more statistically valid
for the differential analysis of
methylation levels
• One possibility is to use M-value
method for conducting
differential methylation analysis
and including the Beta-value
statistics when reporting the
results to investigators
• Drawback to M-value is that may
be capturing significant but very
small absolute changes in
methylation
Global Variation in
Methylation
Exploration of the Major Determinants of Methylation
Clustering Analysis
• Purpose of clustering is to organize objects into groups such
that the objects in a group are more similar to each other than
objects in different groups
• Unsupervised clustering of DNA methylation data is often used
for the identification:
• Methylation subgroups
• Groups of samples with a similar methylation profile across a collection
CpG
• Some options often used for methylation data
• Non-parametric clustering
• K-means (requires pre-specification of the number of classes)
• Principle Component Analysis (PCA)
• Semi-parametric
• Recursively partitioned mixture model (RPMM) (Houseman et al. 2008)
Clustering Analysis
• Clustering can use various linkage and distance
methods
• Distance quantifies dissimilarity between sample data
• The linkage method is used when deciding the distance
for observations that have already been merged
together
• i.e. choosing what point in a cluster to measure the inter-
cluster distance from
Distance Metrics
Distance quantifies dissimilarity
between sample data
• Euclidean: square root of sum
of squares of attribute
differences
• Shortest distance
• Manhattan: sum of the
differences of their
corresponding components
• the distance that would be
traveled to get from one data
point to the other if a grid-like
path is followed
Linkage
Types of linkage:
• Complete - defines the
cluster distance between
two clusters to be the
maximum distance between
their individual components
• Average – the mean
similarity of one cluster to
another
• Median – the median
similarity of one cluster to
another, going to be
relatively similar to
“average” linkage results
Average
Linkage
Complete
Linkage
Single
Linkage
dendrogram that displays a hierarchical
relationship
Evaluating Classifier Performance
• compare the labeled outcome of the supervised
classification algorithm with the known labeled targets
• e.g. Area under the curve, sensitivity and specificity
• How well have we labelled the input data according to the
target labels?
• Measure the relation between elements of each class
and not to the given labels
• Adjusted Rand Index (ARI) evaluates how well an algorithm
separates the elements belonging to different classes
• Rand indices near 1 indicate high agreement
• Rand indices near -1 indicate separation
• Can have any number of groups (unlike sensitivity and
specificity)
Principle Component Analysis
• Principal components are
found by calculating the
eigenvectors and
eigenvalues of the data
covariance matrix
• eigenvector with the
largest eigenvalue is the
direction of greatest
variation, the one with the
second largest eigenvalue
is the (orthogonal)
direction with the next
highest variation
The directions U and V are
principle components
• Orthogonal to each other
Principle Component Analysis
Genes mirror geography
within Europe.
Novembre et al. Nature
(2008)
First two principle
components of genetic
variation in a sample of
3,000 European
individuals genotyped at
over half a million
variable DNA
Recursively Partitioned Mixture
Model (RPMM)
• General procedure: divide samples based on
methylation profile using a mixture of beta distributions
to recursively split samples via 2-class models with
Bayesian information criterion (BIC) used at each
potential split to decide whether the split was to be
maintained or abandoned
• Result: K classes, representing K terminal nodes, and posterior
probabilities of class membership for the samples
library(RPMM)
data(IlluminaMethylation)
rpmm <- blcTree(IllumBeta)
ProbClassMembership = blcTreeLeafMatrix(rpmm)
Recursively Partitioned Mixture
Model (RPMM)
• Note: No closed form MLE for parameters for the beta-distribution -
computationally intensive
• M-value transformation and fitting a Gaussian RPMM tends produce similar results
expit2 <- function(x) log2(x)-log2(1-x)
RPMMSolution = glcTree(expit2(IllumBeta))
par(mfrow=c(2,1))
plotTree.blcTree(rpmm,
labelFunction=function(u,digits) table(as.character(tissue[u$index])))
title("Dendrogram using Beta Distribution")
plotTree.blcTree(RPMMSolution,
labelFunction=function(u,digits) table(as.character(tissue[u$index])))
title("Dendrogram using Gaussian Distribution")
Recursively Partitioned Mixture
Model (RPMM)
Recursively Partitioned Mixture
Model (RPMM)
Clustering Samples- identify similar
global methylation profiles
Tissue-specific DNA methylation
dependent upon CpG island context.
Christensen et al. PLoS Genetics (2009)
Methylation is yellow for unmethylated
and blue for methylated
Methylation profile classes significantly
differentiate all normal tissue types
(n = 217, P<0.0001)
Recursively Partitioned Mixture
Model (RPMM)
Clustering CpGs - examine classes of CpGs
with similar methylation profiles
Tissue-specific DNA methylation
dependent upon CpG island context.
Christensen et al. PLoS Genetics (2009)
Mean regression coefficients for age
associated methylation (by decade), and its
95% confidence interval from GEE for
each CpG RPMM class
• CpGs clustered with RPMM into eight
classes for each group of samples
• The bottom plot indicates the CpG
island status for each locus
Summary
• Benefits to analysis of global changes
• Identify major determinant of methylation profile
• Suggests something about the nature of epigenetic
regulation associated with the exposure or phenotype of
interest
• Drawbacks
• Major determinants may not be of biological interest
• Sources of major variation tend to be batch effects and
tissue-specific differences
Site-Specific Analysis
Objectives of Model Building
• Interest may be on the association between a
response and one or two important risk factors
• The estimates are not subject to confounding
• We are not oversimplifying these associations by ignoring
important effect modification
• Interest may be prediction
• The set of regressors that best minimize the prediction error
• Identifying the important independent predictors of an
outcome
Association Models
• Many ways to analyze methylation in R
• Linear, generalized linear models (logistic, poisson, etc),
mixed models, failure time models, Cox PH, etc
• Not going to expand on these models and modeling
assumptions, beyond the scope of this workshop
• For association models, covariates included should be
based on subject matter knowledge
• Goal is to reduce bias
• R packages have been developed for efficient linear
analysis of microarray data
Linear Models for each Gene
Model: E 𝑦𝑦𝑗𝑗 ~ ̂𝛽𝛽𝑗𝑗𝑗 + ̂𝛽𝛽𝑗𝑗1 𝑋𝑋
• Can consider linear model for CpG 𝑗𝑗 has residual
variance 𝜎𝜎𝑗𝑗
2
with sample value 𝑠𝑠𝑗𝑗
2
and degrees of
freedom 𝑓𝑓𝑗𝑗
• The unscaled standard deviation for the covariate of
interest is kth covariate is 𝑢𝑢𝑗𝑗𝑗𝑗
• Standard T statistic:
• 𝑡𝑡𝑗𝑗𝑗𝑗~
�𝛽𝛽𝑗𝑗𝑗𝑗
𝑢𝑢𝑗𝑗𝑗𝑗 𝑠𝑠𝑗𝑗
=
�𝛽𝛽𝑗𝑗𝑗𝑗
𝑠𝑠𝑠𝑠 �𝛽𝛽𝑗𝑗𝑗𝑗
with 𝑓𝑓𝑗𝑗 degrees of freedom
Limma and Empirical Bayes
• Limma is a package in R that takes advantage of the
information across genes (or CpGs) to calculate a
moderated t-statistic
• Uses a Bayesian approach to shrinkage of the estimated
sample variances towards a pooled estimate
• Results in more stable inference when the sample
size is small
• However, must consider the potential impact of samples
that seem to be a global outlier
• Microarray may have failed
Limma and Empirical Bayes
• Uses a Bayesian approach to shrinkage of the estimated
sample variances towards a pooled estimate
• Posterior residual standard deviations:
• ̃𝑠𝑠𝑗𝑗
2
=
𝑓𝑓0 𝑠𝑠0
2+𝑓𝑓𝑗𝑗 𝑠𝑠𝑗𝑗
2
𝑓𝑓0+𝑓𝑓𝑗𝑗
• Prior sample value 𝑠𝑠0 and degrees of freedom 𝑓𝑓0
• Moderated T-statistic:
• 𝑡𝑡𝑗𝑗𝑗𝑗~
�𝛽𝛽𝑗𝑗𝑗𝑗
𝑢𝑢𝑗𝑗𝑗𝑗 ̃𝑠𝑠𝑗𝑗
with 𝑓𝑓0 + 𝑓𝑓𝑗𝑗 degrees of freedom
• The extra 𝑓𝑓0degrees of freedom represent the extra information
borrowed from all the interrogated sites for inference about each
individual gene
Limma and Empirical Bayes
Running limma in R:
library(limma)
tempmod<-model.matrix(~group,data=pheno)
fit <- lmFit(methyldata, tempmod)
fit <- eBayes(fit)
topTable(fit)
Multiple Testing
• Assume we have 450,000 tests, each test is
independent, and we specify our type 1 error to be 0.05
• Type I Error: probability of rejecting H0 given H0 is true
• i.e. a false positive
• Under the null we would expect 450,000*0.05 loci to
have p<0.05
• This is 22,500 CpG loci
• Need to correct for multiple comparisons
Types of Errors
Null True Alternative
True
Total
Not called
Significant
U T m-R
Called
Significant
V S R
m0 m-m0 m
V = # of Type 1 errors (false positives)
Controlling for the Family Wise
Error Rate (FWER)
• Let T1…Tk be K independent tests of the null hypotheses
H1…Hk
• Family wise error rate (FWER) = the probability of
rejecting at least one Hi null hypothesis given it is true
• 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 𝑃𝑃(𝑉𝑉 ≥ 1)
• Bonferroni procedure: use significance level α/K
• Very conservative: can increase number of false negatives
• More efficient methods possible
adjustedpvals = p.adjust(Pvalues, method = "bonferroni")
False Discovery Rate (FDR)
• False discovery rate (FDR): the expected proportion
of Type I errors among the rejected hypotheses
• 𝐹𝐹𝐹𝐹𝐹𝐹 = 𝐸𝐸
𝑉𝑉
𝑅𝑅
𝑅𝑅 > 0 𝑃𝑃(𝑅𝑅 > 0)
• To control for FDR at level δ=0.05
• Order the unadjusted p-values: p1 ≤ p2 ≤ … ≤ pm
• Then find the test with the highest rank, j, for which the p-
value, pj , is less than or equal to (j/m) x δ
• 𝑝𝑝(𝑗𝑗) ≤ 𝛿𝛿
𝑗𝑗
𝑚𝑚
• Declare the tests of rank 1, 2, …, j as significant
False Discovery Rate (FDR)
• q-value: the minimum FDR that can be attained
when calling that “feature” significant
• Expected proportion of false positives incurred when
calling that feature significant
• If a CpG has a q-value of 0.04 it means that 4% of CpGs
that have a p-value at least as small as that CpG are false
positives
adjustedpvals = p.adjust(Pvalues, method = "fdr")
Permutation Tests
• Does not assume that tests are independent
• Procedure:
1. For the tests of the M CpG loci, order the unadjusted p-
values: p1 ≤ p2 ≤ … ≤ pm
2. Permute outcome within “exchangeable” sets, refit the
regression models
• Must permute within strata if any stratifying variables
3. Order the p-values from the M regression models: pr
1 ≤pr
2 ≤
… ≤ pr
m
4. Repeat steps 2 and 3 R many times (R=1000 permutations)
5. Adjusted p-value: 𝑝𝑝𝑗𝑗 =
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑗𝑗
𝑟𝑟
≤ 𝑝𝑝𝑗𝑗
𝑅𝑅
• NOTE: very computationally intensive
Additional Considerations:
Influential Points
Points that have exerted
undue influence on the
regression coefficient
estimates
x-outliers with the
potential to exert undue
influence on regression
coefficient estimates
Leverage Points Influential Points
• Important to check global distributions to see if there is
a potential outlier
• Among site specific tests
• Plotting the association, does the association appear to be
driven by only a few samples?
• If a parametric test was used, is the non-parametric test
significant?
• Is the association still significant when the outliers are
removed?
• Important follow-up question: does it seem to be a
technical outlier or biologic outlier?
Additional Considerations:
Influential Points
Prediction models
Goals
• Want the most parsimonious model
• Variance of the predictions increases as the number of
regressors increase
• Estimation problems may occur with too many variables
(multicollinearity)
• Do not want the model overly simplistic = biased
estimates
Prediction Models in the Context
of DNA Methylation Studies
• Number of CpGs>>>the number of individuals
• If trying to predict outcome based on methylation
profiles, too many sites to model each individually
• Many sites are correlated – either due to proximity
or shared regulation of biological pathway
• Become redundant information in a prediction model
Bias-variance tradeoff
As the model complexity increases the model becomes
more specific to the training set and less generalizable to
the test set
Hastie, Tibshirani & Friedman. Elements of
Statistical Learning (2013 ed.10)
𝑀𝑀𝑀𝑀𝑀𝑀 �𝜇𝜇 = 𝐸𝐸 �𝜇𝜇 − 𝜇𝜇 2
𝑀𝑀𝑀𝑀𝑀𝑀 �𝜇𝜇 = 𝑉𝑉𝑉𝑉𝑉𝑉 �𝜇𝜇 + 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 �𝜇𝜇 2
Variable Selection
• Forward Selection
• Start with model with only an intercept
• Add variable that lowers AIC the most
• Repeat until none of the remaining variables meet the minimum
requirement for inclusion
• Once a variable is included, it cannot be removed
• Backward Elimination
• Start with all of the variables in the model
• Eliminate the least statistically significant variable which does
not meet the criteria for remaining in the model
• Largest p-value or largest decrease in AIC
• Repeat until all the remaining variables meet the criterion for
inclusion
• Once a variable is excluded, it stays out of the model
Variable Selection
• Stepwise Selection
• Start with a model with only an intercept
• Include the variable with the smallest pvalue less than
the specified significance level
• or largest decrease in AIC
• Re-evaluated the contribution of each variable after
each step and delete any which no longer meet the
minimum criteria for staying in the model
• The stepwise selection process ends if:
• No further variable can be added to the model
• Or if the variable just entered into the model is the only
eliminated in the subsequent backward elimination
Shrinkage Methods
• Forward/backward/stepwise selection includes or
excludes a variable completely
• Provides more interpretable model, but possibly lower
prediction error than the full model
• Shrinkage methods constrain the size of regression
coefficients by imposing a penalty
• Penalty introduces bias into analysis to reduce variance
Shrinkage Methods
Ridge regression
̂𝛽𝛽𝑟𝑟𝑟𝑟 𝑟𝑟 𝑟𝑟𝑟𝑟
= min
𝛽𝛽
𝑦𝑦 − 𝑋𝑋𝑋𝑋 2
+ 𝜆𝜆 𝛽𝛽 2
2
• Bias increases as 𝜆𝜆 increases
• Variance decreases as 𝜆𝜆 increases
• Theory is that there is a 𝜆𝜆 such that
the MSE of the ridge regression is
less than MSE of the linear model
Shrinkage Methods
Lasso regression
̂𝛽𝛽𝑙𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
= min
𝛽𝛽
𝑦𝑦 − 𝑋𝑋𝑋𝑋 2
+ 𝜆𝜆 𝛽𝛽 1
• Similar to ridge except that penalty is the sum of the
absolute value of the parameters (𝑙𝑙1 penalty), whereas
the ridge uses the sum of squared parameters (𝑙𝑙2 penalty)
• For ridge none of the parameter estimates go to zero (unless 𝜆𝜆 =
∞), no variable selection
• Parameters go to zero in lasso, there is variable selection
• At most the number of parameters is equal to the number of subjects
• Does not perform grouped selection – will only select on of correlated
variables
Shrinkage Methods
Elastic Net regression
̂𝛽𝛽𝑙𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
= min
𝛽𝛽
𝑦𝑦 − 𝑋𝑋𝑋𝑋 2
+ 𝜆𝜆1 𝛽𝛽 1 + 𝜆𝜆2 𝛽𝛽 2
2
• Combination of the ridge and lasso penalties
• Removes limitation on the number of selected variables
• Allows for correlated variable to enter
Estimating Prediction Error
• If enough data, ideally split the data so there is a training
and test data set
• Often instead use cross-validation to estimate the prediction
error
• K-fold cross validation splits the data into approximately
K equal parts
• Fits the model to the other K-1 parts and then calculate the
prediction error of the model when predicting the Kth part of
the data
• Performed on all K parts and prediction error estimates combined
Hastie, Tibshirani & Friedman. Elements of
Statistical Learning (2013 ed.10)
Estimating Cellular Age with DNA
Methylation Data
• DNA methylation age of human tissues and cell types.
Horvath. Genome Biology (2013)
• In a training data set, chronological age was regressed on the CpGs
using elastic net
• Difference between methylation-predicted age and
chronological age (Δage) put forth as an index of
disproportionate ‘biological’ aging
• Δage has been found to be associated with all cause mortality
• Marioni et al. Genome Biology (2015)
Estimating Cellular Age with DNA
Methylation Data
Marioni et al. Genome Biology (2015)
Final Considerations for Prediction
Modeling
• Standard errors of regression coefficients are biased low
• Not taking into account that we sorted through other variable choices
• p-values that are too small
• severe multiple testing problems
• Choice of variables in the final model are heavily influenced by
sampling
• When variables are correlated, which one enters model (and which is
excluded) is relatively random
• HOWEVER, the purpose of prediction models is not causal
inference
• Should not care about the estimates of specific parameters or if a causal
variable or a correlated surrogate enters the model
Notes on Efficiency in R
• If there is a closed form for a model, vectorize rather than loop
• Slow method
storing<-matrix(NA,ncol=1,nrow=nrow(methyl))
for(i in 1:nrow(methyl)){
storing[i,]<-
summary(lm(methyl[i,]~exposure))$coef[2,4]
}
• Fast method
obj<-lm(t(methyl)~group)
modelM<-model.matrix(~group)
XXI<-solve(t(modelM)%*%modelM)
dof<-obj$df.residual
sigma<-sqrt(colSums(obj$residual^2)/dof)
est<-obj$coef[2,]
Pval<-2*pt(-abs(est/sqrt(XXI[2,2]*sigma)),dof)
Regional Analysis
Taking advantage of the correlation structure between loci
Differentially Methylated Regions
• Loci in close proximity tend to have
correlated levels of methylation
• We can exploit this correlation structure to
increase our power to detect changes in
methylation
• May have more confidence in our results if
we see a regional change vs a very site-
specific association
Why might this be?
PMID: 2147404
Summarizing Methylation Across
Region
• Implemented in IMA
• For each specific region, IMA will collect all the targeted loci
within it and derive an index of overall region-level methylation
value
• three different index metrics implemented in IMA: mean, median
and Tukey's Biweight robust average
• Between group comparisons then performed on region-level
methylation estimates
beta = dataf@bmatrix
betar = indexregionfunc(indexlist=dataf@TSS1500Ind,
beta=beta,indexmethod="median")
TSS1500testALL = testfunc(eset = betar,
testmethod="limma",Padj="BH",concov="OFF",groupinfo =
dataf2@groupinfo,gcase ="g2", gcontrol=c("g1","g3"),
paired = FALSE)
TSS1500test =
outputDMfunc(TSS1500testALL,rawpcut=0.05,adjustpcut=0.05,b
etadiffcut=0.14)
TSS1500test[10:20,]
Aggregate P-values Across
Predefined Regions
• Uncorrected, CpG-specific P values within a given
region are combined using an extension of Fisher's
method
• Uses weighted inverse chi-square method for correlated
significance tests
• Results in a single aggregate P value for each region
• Aggregate P values are subjected to multiple-testing
correction using the FDR method
• Implemented in RnBeads
“Bump Hunting”
Jaffe et al 2012 IJE
“Bump
hunting”
Jaffe et al 2012 IJE
The general
workflow to
bump-
hunting
Identifying Candidate Regions
Jaffe et al 2012 IJE
Significance of Bumps
Jaffe et al 2012 IJE
Probe Lasso
Butcher and Beck (2015)
Methods
• Probe Lasso utilises a
flexible window
(“probe-lasso”) based
on probe density to
gather neighbouring
significant-signals to
define clear DMR
boundaries
• Motivation?
• Implemented in ChAMP
Probe Lasso calculates probe spacing for each
probe in the dataset; these data are binned
into one of the 28 genetic/epigenetic
categories (i.e., 7 gene features × 4 CGI
relations)
Probe Lasso
• Specify lassoStyle and
lassoRadius
• If lassoStyle = max, the
probe-lasso sizes will be at
most 2 × lassoRadius bp
• If lassoStyle = min, the
probe-lassos will be at least
2 × lassoRadius bp
• Probe Lasso identifies the
genetic/epigenetic category
that conforms to user-
specified maximum (or
minimum) lassoRadius and
derives the quantile at
which it occurs
• Derived quantile is then
applied to each
genetic/epigenetic
distribution of probe
spacings to create probe-
lassos that vary according
to genetic/epigenetic-
feature
An example quantile distribution of probe spacing for each gene/CGI
feature. The black horizontal and vertical dashed lines indicate the
quantile (43rd) that results from choosing a maximum lasso size of
2000 bp
Probe Lasso
• Results in 28 dynamic window
sizes (‘probe-lassos’) that are
thrown around each
significantly-associated probe
• If these lassos capture a user-
specified number of significant
probes, that probe’s lasso
boundaries are retained
(minSigProbesLasso)
• Overlapping- and neighbouring-
lasso boundaries less than a
user-specified distance apart
are then merged to define DMR
boundaries (minDmrSep)
• All probes in the dataset are
then binned into the DMRs and
their p-values combined for the
DMR, weighted by the
underlying correlation structure
of probe methylation values

Contenu connexe

Tendances

Data analysis
Data analysisData analysis
Data analysisLizzyL1
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
What is the difference between research methodology and research design
What is the difference between research methodology and research designWhat is the difference between research methodology and research design
What is the difference between research methodology and research designPhD Assistance
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityIkbal Ahmed
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsAmira Talic
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining techniquePawneshwar Datt Rai
 
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...Stats Statswork
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsUmasree Raghunath
 
2012 data analysis
2012 data analysis2012 data analysis
2012 data analysischerylyap61
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsKapil Dev Ghante
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAiden Yeh
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data ManagementAmanda Whitmire
 
Quantitative Data Analysis
Quantitative Data AnalysisQuantitative Data Analysis
Quantitative Data AnalysisAsma Muhamad
 

Tendances (20)

Data analysis
Data analysisData analysis
Data analysis
 
Data analysis
Data analysisData analysis
Data analysis
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
What is the difference between research methodology and research design
What is the difference between research methodology and research designWhat is the difference between research methodology and research design
What is the difference between research methodology and research design
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Data analysis
Data analysisData analysis
Data analysis
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining technique
 
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data analysis
Data analysisData analysis
Data analysis
 
2012 data analysis
2012 data analysis2012 data analysis
2012 data analysis
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Quantitative Data analysis
Quantitative Data analysisQuantitative Data analysis
Quantitative Data analysis
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Quantitative Data Analysis
Quantitative Data AnalysisQuantitative Data Analysis
Quantitative Data Analysis
 

Similaire à Data analysis

Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-completeDr Hemant Sharma
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptxPriyadharshiniG41
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slidespannicle
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Sess03 Dimension Reduction Methods.pptx
Sess03 Dimension Reduction Methods.pptxSess03 Dimension Reduction Methods.pptx
Sess03 Dimension Reduction Methods.pptxSarthakKabi1
 
Are we really including all relevant evidence
Are we really including all relevant evidence Are we really including all relevant evidence
Are we really including all relevant evidence cheweb1
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 

Similaire à Data analysis (20)

Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-complete
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
0 introduction
0  introduction0  introduction
0 introduction
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Sess03 Dimension Reduction Methods.pptx
Sess03 Dimension Reduction Methods.pptxSess03 Dimension Reduction Methods.pptx
Sess03 Dimension Reduction Methods.pptx
 
Are we really including all relevant evidence
Are we really including all relevant evidence Are we really including all relevant evidence
Are we really including all relevant evidence
 
Data discretization
Data discretizationData discretization
Data discretization
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
 

Dernier

Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 

Dernier (20)

Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

Data analysis

  • 2. Interrogating Methylation Possible impacts on DNA methylation: • Goal: capturing biologically meaningful variation Joubert et al 2012 EHP Site-specific Regional Global Jaffe et al 2012 IJE Christensen et al 2012 PLoS Genet
  • 3. Interrogating Methylation • When might we expect: • Global methylation changes? • Regional methylation changes? • Site-specific methylation changes? Any good examples that people can think of?
  • 4. Increase confidence Decrease confidence Statistical significance Reaches genome-wide significance Does not meet predefined significance threshold that takes into account multiple testing Effect size Large (>10% difference) Small (<5% difference) Bias and confounding Bias reduced by design or controlled for in the analyses Bias or uncontrolled confounding may exist and explain the differences observed Genomic location Differential methylation is in a region that may impact regulation of transcription Current knowledge cannot explain the influence of the observed difference in methylation at that locus on transcription Functional relevance Affects expression Does not affect expression Biological relevance Gene codes for known biological function Biological relevance of DMR location unknown Validation Replicated in an independent human cohort or animal model using a different technique No validation of results attempted or results are not replicated in a validation study Confidence in Methylation Association Michels et al. (2013) Nature Methods Reviews 4
  • 5. Data Considerations: Methylation Distribution Example histograms with normal density curves ● Distributions may be normal, skewed or other-modal ○ important at analysis stage
  • 6. Illumina 27k Chip Du et al. BMC Bioinformatics. 2010 Illumina 450k Chip ● Variance across the spectrum of methylation is not equal ● Extreme values of methylation show reduced variance compared to intermediate values Data Issues: Heteroscedasticity
  • 7. Many Researchers opt for the M-value M= log2 (β/1-β) Du et al. BMC Bioinformatics. (2010) ● Simple to convert in R ● Some R analysis packages have this option built in e.g. CpGassoc Arcsine (variance-stabilising transformation) Y= arcsin(√Y) Lin et al. Nucleic Acids Research (2008) Beta-value Transformation
  • 8. What to Use? • Beta-value has a more intuitive biological interpretation • M-value is more statistically valid for the differential analysis of methylation levels • One possibility is to use M-value method for conducting differential methylation analysis and including the Beta-value statistics when reporting the results to investigators • Drawback to M-value is that may be capturing significant but very small absolute changes in methylation
  • 9. Global Variation in Methylation Exploration of the Major Determinants of Methylation
  • 10. Clustering Analysis • Purpose of clustering is to organize objects into groups such that the objects in a group are more similar to each other than objects in different groups • Unsupervised clustering of DNA methylation data is often used for the identification: • Methylation subgroups • Groups of samples with a similar methylation profile across a collection CpG • Some options often used for methylation data • Non-parametric clustering • K-means (requires pre-specification of the number of classes) • Principle Component Analysis (PCA) • Semi-parametric • Recursively partitioned mixture model (RPMM) (Houseman et al. 2008)
  • 11. Clustering Analysis • Clustering can use various linkage and distance methods • Distance quantifies dissimilarity between sample data • The linkage method is used when deciding the distance for observations that have already been merged together • i.e. choosing what point in a cluster to measure the inter- cluster distance from
  • 12. Distance Metrics Distance quantifies dissimilarity between sample data • Euclidean: square root of sum of squares of attribute differences • Shortest distance • Manhattan: sum of the differences of their corresponding components • the distance that would be traveled to get from one data point to the other if a grid-like path is followed
  • 13. Linkage Types of linkage: • Complete - defines the cluster distance between two clusters to be the maximum distance between their individual components • Average – the mean similarity of one cluster to another • Median – the median similarity of one cluster to another, going to be relatively similar to “average” linkage results Average Linkage Complete Linkage Single Linkage dendrogram that displays a hierarchical relationship
  • 14. Evaluating Classifier Performance • compare the labeled outcome of the supervised classification algorithm with the known labeled targets • e.g. Area under the curve, sensitivity and specificity • How well have we labelled the input data according to the target labels? • Measure the relation between elements of each class and not to the given labels • Adjusted Rand Index (ARI) evaluates how well an algorithm separates the elements belonging to different classes • Rand indices near 1 indicate high agreement • Rand indices near -1 indicate separation • Can have any number of groups (unlike sensitivity and specificity)
  • 15. Principle Component Analysis • Principal components are found by calculating the eigenvectors and eigenvalues of the data covariance matrix • eigenvector with the largest eigenvalue is the direction of greatest variation, the one with the second largest eigenvalue is the (orthogonal) direction with the next highest variation The directions U and V are principle components • Orthogonal to each other
  • 16. Principle Component Analysis Genes mirror geography within Europe. Novembre et al. Nature (2008) First two principle components of genetic variation in a sample of 3,000 European individuals genotyped at over half a million variable DNA
  • 17. Recursively Partitioned Mixture Model (RPMM) • General procedure: divide samples based on methylation profile using a mixture of beta distributions to recursively split samples via 2-class models with Bayesian information criterion (BIC) used at each potential split to decide whether the split was to be maintained or abandoned • Result: K classes, representing K terminal nodes, and posterior probabilities of class membership for the samples library(RPMM) data(IlluminaMethylation) rpmm <- blcTree(IllumBeta) ProbClassMembership = blcTreeLeafMatrix(rpmm)
  • 18. Recursively Partitioned Mixture Model (RPMM) • Note: No closed form MLE for parameters for the beta-distribution - computationally intensive • M-value transformation and fitting a Gaussian RPMM tends produce similar results expit2 <- function(x) log2(x)-log2(1-x) RPMMSolution = glcTree(expit2(IllumBeta)) par(mfrow=c(2,1)) plotTree.blcTree(rpmm, labelFunction=function(u,digits) table(as.character(tissue[u$index]))) title("Dendrogram using Beta Distribution") plotTree.blcTree(RPMMSolution, labelFunction=function(u,digits) table(as.character(tissue[u$index]))) title("Dendrogram using Gaussian Distribution")
  • 20. Recursively Partitioned Mixture Model (RPMM) Clustering Samples- identify similar global methylation profiles Tissue-specific DNA methylation dependent upon CpG island context. Christensen et al. PLoS Genetics (2009) Methylation is yellow for unmethylated and blue for methylated Methylation profile classes significantly differentiate all normal tissue types (n = 217, P<0.0001)
  • 21. Recursively Partitioned Mixture Model (RPMM) Clustering CpGs - examine classes of CpGs with similar methylation profiles Tissue-specific DNA methylation dependent upon CpG island context. Christensen et al. PLoS Genetics (2009) Mean regression coefficients for age associated methylation (by decade), and its 95% confidence interval from GEE for each CpG RPMM class • CpGs clustered with RPMM into eight classes for each group of samples • The bottom plot indicates the CpG island status for each locus
  • 22. Summary • Benefits to analysis of global changes • Identify major determinant of methylation profile • Suggests something about the nature of epigenetic regulation associated with the exposure or phenotype of interest • Drawbacks • Major determinants may not be of biological interest • Sources of major variation tend to be batch effects and tissue-specific differences
  • 24. Objectives of Model Building • Interest may be on the association between a response and one or two important risk factors • The estimates are not subject to confounding • We are not oversimplifying these associations by ignoring important effect modification • Interest may be prediction • The set of regressors that best minimize the prediction error • Identifying the important independent predictors of an outcome
  • 25. Association Models • Many ways to analyze methylation in R • Linear, generalized linear models (logistic, poisson, etc), mixed models, failure time models, Cox PH, etc • Not going to expand on these models and modeling assumptions, beyond the scope of this workshop • For association models, covariates included should be based on subject matter knowledge • Goal is to reduce bias • R packages have been developed for efficient linear analysis of microarray data
  • 26. Linear Models for each Gene Model: E 𝑦𝑦𝑗𝑗 ~ ̂𝛽𝛽𝑗𝑗𝑗 + ̂𝛽𝛽𝑗𝑗1 𝑋𝑋 • Can consider linear model for CpG 𝑗𝑗 has residual variance 𝜎𝜎𝑗𝑗 2 with sample value 𝑠𝑠𝑗𝑗 2 and degrees of freedom 𝑓𝑓𝑗𝑗 • The unscaled standard deviation for the covariate of interest is kth covariate is 𝑢𝑢𝑗𝑗𝑗𝑗 • Standard T statistic: • 𝑡𝑡𝑗𝑗𝑗𝑗~ �𝛽𝛽𝑗𝑗𝑗𝑗 𝑢𝑢𝑗𝑗𝑗𝑗 𝑠𝑠𝑗𝑗 = �𝛽𝛽𝑗𝑗𝑗𝑗 𝑠𝑠𝑠𝑠 �𝛽𝛽𝑗𝑗𝑗𝑗 with 𝑓𝑓𝑗𝑗 degrees of freedom
  • 27. Limma and Empirical Bayes • Limma is a package in R that takes advantage of the information across genes (or CpGs) to calculate a moderated t-statistic • Uses a Bayesian approach to shrinkage of the estimated sample variances towards a pooled estimate • Results in more stable inference when the sample size is small • However, must consider the potential impact of samples that seem to be a global outlier • Microarray may have failed
  • 28. Limma and Empirical Bayes • Uses a Bayesian approach to shrinkage of the estimated sample variances towards a pooled estimate • Posterior residual standard deviations: • ̃𝑠𝑠𝑗𝑗 2 = 𝑓𝑓0 𝑠𝑠0 2+𝑓𝑓𝑗𝑗 𝑠𝑠𝑗𝑗 2 𝑓𝑓0+𝑓𝑓𝑗𝑗 • Prior sample value 𝑠𝑠0 and degrees of freedom 𝑓𝑓0 • Moderated T-statistic: • 𝑡𝑡𝑗𝑗𝑗𝑗~ �𝛽𝛽𝑗𝑗𝑗𝑗 𝑢𝑢𝑗𝑗𝑗𝑗 ̃𝑠𝑠𝑗𝑗 with 𝑓𝑓0 + 𝑓𝑓𝑗𝑗 degrees of freedom • The extra 𝑓𝑓0degrees of freedom represent the extra information borrowed from all the interrogated sites for inference about each individual gene
  • 29. Limma and Empirical Bayes Running limma in R: library(limma) tempmod<-model.matrix(~group,data=pheno) fit <- lmFit(methyldata, tempmod) fit <- eBayes(fit) topTable(fit)
  • 30. Multiple Testing • Assume we have 450,000 tests, each test is independent, and we specify our type 1 error to be 0.05 • Type I Error: probability of rejecting H0 given H0 is true • i.e. a false positive • Under the null we would expect 450,000*0.05 loci to have p<0.05 • This is 22,500 CpG loci • Need to correct for multiple comparisons
  • 31. Types of Errors Null True Alternative True Total Not called Significant U T m-R Called Significant V S R m0 m-m0 m V = # of Type 1 errors (false positives)
  • 32. Controlling for the Family Wise Error Rate (FWER) • Let T1…Tk be K independent tests of the null hypotheses H1…Hk • Family wise error rate (FWER) = the probability of rejecting at least one Hi null hypothesis given it is true • 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 𝑃𝑃(𝑉𝑉 ≥ 1) • Bonferroni procedure: use significance level α/K • Very conservative: can increase number of false negatives • More efficient methods possible adjustedpvals = p.adjust(Pvalues, method = "bonferroni")
  • 33. False Discovery Rate (FDR) • False discovery rate (FDR): the expected proportion of Type I errors among the rejected hypotheses • 𝐹𝐹𝐹𝐹𝐹𝐹 = 𝐸𝐸 𝑉𝑉 𝑅𝑅 𝑅𝑅 > 0 𝑃𝑃(𝑅𝑅 > 0) • To control for FDR at level δ=0.05 • Order the unadjusted p-values: p1 ≤ p2 ≤ … ≤ pm • Then find the test with the highest rank, j, for which the p- value, pj , is less than or equal to (j/m) x δ • 𝑝𝑝(𝑗𝑗) ≤ 𝛿𝛿 𝑗𝑗 𝑚𝑚 • Declare the tests of rank 1, 2, …, j as significant
  • 34. False Discovery Rate (FDR) • q-value: the minimum FDR that can be attained when calling that “feature” significant • Expected proportion of false positives incurred when calling that feature significant • If a CpG has a q-value of 0.04 it means that 4% of CpGs that have a p-value at least as small as that CpG are false positives adjustedpvals = p.adjust(Pvalues, method = "fdr")
  • 35. Permutation Tests • Does not assume that tests are independent • Procedure: 1. For the tests of the M CpG loci, order the unadjusted p- values: p1 ≤ p2 ≤ … ≤ pm 2. Permute outcome within “exchangeable” sets, refit the regression models • Must permute within strata if any stratifying variables 3. Order the p-values from the M regression models: pr 1 ≤pr 2 ≤ … ≤ pr m 4. Repeat steps 2 and 3 R many times (R=1000 permutations) 5. Adjusted p-value: 𝑝𝑝𝑗𝑗 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑗𝑗 𝑟𝑟 ≤ 𝑝𝑝𝑗𝑗 𝑅𝑅 • NOTE: very computationally intensive
  • 36. Additional Considerations: Influential Points Points that have exerted undue influence on the regression coefficient estimates x-outliers with the potential to exert undue influence on regression coefficient estimates Leverage Points Influential Points
  • 37. • Important to check global distributions to see if there is a potential outlier • Among site specific tests • Plotting the association, does the association appear to be driven by only a few samples? • If a parametric test was used, is the non-parametric test significant? • Is the association still significant when the outliers are removed? • Important follow-up question: does it seem to be a technical outlier or biologic outlier? Additional Considerations: Influential Points
  • 38. Prediction models Goals • Want the most parsimonious model • Variance of the predictions increases as the number of regressors increase • Estimation problems may occur with too many variables (multicollinearity) • Do not want the model overly simplistic = biased estimates
  • 39. Prediction Models in the Context of DNA Methylation Studies • Number of CpGs>>>the number of individuals • If trying to predict outcome based on methylation profiles, too many sites to model each individually • Many sites are correlated – either due to proximity or shared regulation of biological pathway • Become redundant information in a prediction model
  • 40. Bias-variance tradeoff As the model complexity increases the model becomes more specific to the training set and less generalizable to the test set Hastie, Tibshirani & Friedman. Elements of Statistical Learning (2013 ed.10) 𝑀𝑀𝑀𝑀𝑀𝑀 �𝜇𝜇 = 𝐸𝐸 �𝜇𝜇 − 𝜇𝜇 2 𝑀𝑀𝑀𝑀𝑀𝑀 �𝜇𝜇 = 𝑉𝑉𝑉𝑉𝑉𝑉 �𝜇𝜇 + 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 �𝜇𝜇 2
  • 41. Variable Selection • Forward Selection • Start with model with only an intercept • Add variable that lowers AIC the most • Repeat until none of the remaining variables meet the minimum requirement for inclusion • Once a variable is included, it cannot be removed • Backward Elimination • Start with all of the variables in the model • Eliminate the least statistically significant variable which does not meet the criteria for remaining in the model • Largest p-value or largest decrease in AIC • Repeat until all the remaining variables meet the criterion for inclusion • Once a variable is excluded, it stays out of the model
  • 42. Variable Selection • Stepwise Selection • Start with a model with only an intercept • Include the variable with the smallest pvalue less than the specified significance level • or largest decrease in AIC • Re-evaluated the contribution of each variable after each step and delete any which no longer meet the minimum criteria for staying in the model • The stepwise selection process ends if: • No further variable can be added to the model • Or if the variable just entered into the model is the only eliminated in the subsequent backward elimination
  • 43. Shrinkage Methods • Forward/backward/stepwise selection includes or excludes a variable completely • Provides more interpretable model, but possibly lower prediction error than the full model • Shrinkage methods constrain the size of regression coefficients by imposing a penalty • Penalty introduces bias into analysis to reduce variance
  • 44. Shrinkage Methods Ridge regression ̂𝛽𝛽𝑟𝑟𝑟𝑟 𝑟𝑟 𝑟𝑟𝑟𝑟 = min 𝛽𝛽 𝑦𝑦 − 𝑋𝑋𝑋𝑋 2 + 𝜆𝜆 𝛽𝛽 2 2 • Bias increases as 𝜆𝜆 increases • Variance decreases as 𝜆𝜆 increases • Theory is that there is a 𝜆𝜆 such that the MSE of the ridge regression is less than MSE of the linear model
  • 45. Shrinkage Methods Lasso regression ̂𝛽𝛽𝑙𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = min 𝛽𝛽 𝑦𝑦 − 𝑋𝑋𝑋𝑋 2 + 𝜆𝜆 𝛽𝛽 1 • Similar to ridge except that penalty is the sum of the absolute value of the parameters (𝑙𝑙1 penalty), whereas the ridge uses the sum of squared parameters (𝑙𝑙2 penalty) • For ridge none of the parameter estimates go to zero (unless 𝜆𝜆 = ∞), no variable selection • Parameters go to zero in lasso, there is variable selection • At most the number of parameters is equal to the number of subjects • Does not perform grouped selection – will only select on of correlated variables
  • 46. Shrinkage Methods Elastic Net regression ̂𝛽𝛽𝑙𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = min 𝛽𝛽 𝑦𝑦 − 𝑋𝑋𝑋𝑋 2 + 𝜆𝜆1 𝛽𝛽 1 + 𝜆𝜆2 𝛽𝛽 2 2 • Combination of the ridge and lasso penalties • Removes limitation on the number of selected variables • Allows for correlated variable to enter
  • 47. Estimating Prediction Error • If enough data, ideally split the data so there is a training and test data set • Often instead use cross-validation to estimate the prediction error • K-fold cross validation splits the data into approximately K equal parts • Fits the model to the other K-1 parts and then calculate the prediction error of the model when predicting the Kth part of the data • Performed on all K parts and prediction error estimates combined Hastie, Tibshirani & Friedman. Elements of Statistical Learning (2013 ed.10)
  • 48. Estimating Cellular Age with DNA Methylation Data • DNA methylation age of human tissues and cell types. Horvath. Genome Biology (2013) • In a training data set, chronological age was regressed on the CpGs using elastic net • Difference between methylation-predicted age and chronological age (Δage) put forth as an index of disproportionate ‘biological’ aging • Δage has been found to be associated with all cause mortality • Marioni et al. Genome Biology (2015)
  • 49. Estimating Cellular Age with DNA Methylation Data Marioni et al. Genome Biology (2015)
  • 50. Final Considerations for Prediction Modeling • Standard errors of regression coefficients are biased low • Not taking into account that we sorted through other variable choices • p-values that are too small • severe multiple testing problems • Choice of variables in the final model are heavily influenced by sampling • When variables are correlated, which one enters model (and which is excluded) is relatively random • HOWEVER, the purpose of prediction models is not causal inference • Should not care about the estimates of specific parameters or if a causal variable or a correlated surrogate enters the model
  • 51. Notes on Efficiency in R • If there is a closed form for a model, vectorize rather than loop • Slow method storing<-matrix(NA,ncol=1,nrow=nrow(methyl)) for(i in 1:nrow(methyl)){ storing[i,]<- summary(lm(methyl[i,]~exposure))$coef[2,4] } • Fast method obj<-lm(t(methyl)~group) modelM<-model.matrix(~group) XXI<-solve(t(modelM)%*%modelM) dof<-obj$df.residual sigma<-sqrt(colSums(obj$residual^2)/dof) est<-obj$coef[2,] Pval<-2*pt(-abs(est/sqrt(XXI[2,2]*sigma)),dof)
  • 52. Regional Analysis Taking advantage of the correlation structure between loci
  • 53. Differentially Methylated Regions • Loci in close proximity tend to have correlated levels of methylation • We can exploit this correlation structure to increase our power to detect changes in methylation • May have more confidence in our results if we see a regional change vs a very site- specific association Why might this be? PMID: 2147404
  • 54. Summarizing Methylation Across Region • Implemented in IMA • For each specific region, IMA will collect all the targeted loci within it and derive an index of overall region-level methylation value • three different index metrics implemented in IMA: mean, median and Tukey's Biweight robust average • Between group comparisons then performed on region-level methylation estimates beta = dataf@bmatrix betar = indexregionfunc(indexlist=dataf@TSS1500Ind, beta=beta,indexmethod="median") TSS1500testALL = testfunc(eset = betar, testmethod="limma",Padj="BH",concov="OFF",groupinfo = dataf2@groupinfo,gcase ="g2", gcontrol=c("g1","g3"), paired = FALSE) TSS1500test = outputDMfunc(TSS1500testALL,rawpcut=0.05,adjustpcut=0.05,b etadiffcut=0.14) TSS1500test[10:20,]
  • 55. Aggregate P-values Across Predefined Regions • Uncorrected, CpG-specific P values within a given region are combined using an extension of Fisher's method • Uses weighted inverse chi-square method for correlated significance tests • Results in a single aggregate P value for each region • Aggregate P values are subjected to multiple-testing correction using the FDR method • Implemented in RnBeads
  • 57. “Bump hunting” Jaffe et al 2012 IJE The general workflow to bump- hunting
  • 59. Significance of Bumps Jaffe et al 2012 IJE
  • 60. Probe Lasso Butcher and Beck (2015) Methods • Probe Lasso utilises a flexible window (“probe-lasso”) based on probe density to gather neighbouring significant-signals to define clear DMR boundaries • Motivation? • Implemented in ChAMP Probe Lasso calculates probe spacing for each probe in the dataset; these data are binned into one of the 28 genetic/epigenetic categories (i.e., 7 gene features × 4 CGI relations)
  • 61. Probe Lasso • Specify lassoStyle and lassoRadius • If lassoStyle = max, the probe-lasso sizes will be at most 2 × lassoRadius bp • If lassoStyle = min, the probe-lassos will be at least 2 × lassoRadius bp • Probe Lasso identifies the genetic/epigenetic category that conforms to user- specified maximum (or minimum) lassoRadius and derives the quantile at which it occurs • Derived quantile is then applied to each genetic/epigenetic distribution of probe spacings to create probe- lassos that vary according to genetic/epigenetic- feature An example quantile distribution of probe spacing for each gene/CGI feature. The black horizontal and vertical dashed lines indicate the quantile (43rd) that results from choosing a maximum lasso size of 2000 bp
  • 62. Probe Lasso • Results in 28 dynamic window sizes (‘probe-lassos’) that are thrown around each significantly-associated probe • If these lassos capture a user- specified number of significant probes, that probe’s lasso boundaries are retained (minSigProbesLasso) • Overlapping- and neighbouring- lasso boundaries less than a user-specified distance apart are then merged to define DMR boundaries (minDmrSep) • All probes in the dataset are then binned into the DMRs and their p-values combined for the DMR, weighted by the underlying correlation structure of probe methylation values