SlideShare une entreprise Scribd logo
1  sur  96
O.M.GSEA
An introduction to ‘classic’
Gene Set Enrichment Analysis methodology
Shana White
2016
Overview
• Preliminaries
• Genes & gene sets
• Gene expression and enrichment
• GSEA
• Introduction & Example from Publication
• Experimental conditions
• Purpose of GSEA
• Background on methodology
• Gene ranking
• Enrichment Scores & Plots
• Assessing significance
Fundamentals of Genes, Genomes
• Genome = Collection of DNA sequences [across all
chromosomes of a species].
• Genes = Subset of genome that ‘codes’ for a protein
Image: http://cs.stanford.edu/people/eroberts/cs181/projects/2010-11/Genomics/accuracy.html
Genes & Genotype
• Individuals of the same species share the
same genome; have the same set of genes
• SNPs (single nucleotide polymorphisms) allow for
genetic variation
• Genotype = the amino acid sequence
corresponding to a [an individual] gene
• “Central Dogma” of genetics:
Image 1: http://cs.stanford.edu/people/eroberts/cs181/projects/2010-11/Genomics/accuracy.html
Image 2: http://www.lhsc.on.ca/Patients_Families_Visitors/Genetics/Inherited_Metabolic/Mitochondria/DiseasesattheMolecularLevel.htm
Genotype vs. Gene Expression
• Genotype is the same in all somatic cells of an organism
Genotype vs. Gene Expression
• Genotype is the same in all somatic cells of an organism
• Gene expression [rate at which genes are transcribed] is
different to varying degrees
• Random fluctuations within same types of cells are expected
• Significant differences  significantly different cells
Genotype vs. Gene Expression
• Random fluctuations within same
types of cells are expected
• Significant differences 
significantly different cells
Genotype is the same in all
somatic cells of an organism
Gene expression [rate at
which genes are transcribed]
is different to varying degrees
Gene Expression Data
• RNA-seq (“Next-Generation sequencing”) simultaneously
generates data for both genotype and gene expression
• Experimental set-up is typically cases vs. controls
• Cases: phenotype is induced experimentally [pure experiment]
• Controls: represent ‘baseline’ gene expression for comparison
• Quantify transcriptome of each sample
• Compare ‘fold change’ for expression of cases compared to controls
• FC = (expression in cases)/(expression in controls)
• Determine significance of individual gene up/down regulation
• Results of differential expression analysis or additional analyses often
visualized with a “heatmap”…
Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
• Heatmap for purposes of this discussion:
Cases Controls
• Columns  Samples
• Rows  Genes
• Color  Direction/Intensity of expression
• Red: Higher than row average
• Blue [or green]: Lower than row average
“Enrichment” & GSEA
• Results of individual genes
• Dictionary(.com) definition of enrichment:
• “act of making fuller or more meaningful or rewarding”
• Gene set enrichment
• Gene sets are predefined in the literature and/or in database:
• Grouped by information regarding gene function, pathway membership, etc.
• Gene sets are ‘enriched’ if experimental findings are in accordance with
the set of interest [with hope of adding meaning to results]
• Definition not always obvious
• Good resource: “An introduction to effective use of enrichment analysis software”
• Gene Set Enrichment Analysis
• Statistical methods determine significance of enrichment for gene set by
comparing distribution of genes in set to ‘background distribution’
First Use in Publication (2003)
Paper describing methodology (2005)
Citations as of:
November 2015 - 8,349
November 2016 - 10,057
Quick review…
• GSEA is a common ‘secondary analysis’ after gene
expression data has been collected
• Gene sets can be determined a-priori specific to an experiment (as
in example that follows) or
• Multiple gene-sets from databases can be used in a data-mining
fashion to support or generate hypotheses
• Implications of multiple testing (beyond scope of presentation)
• Good to know the basics
• GSEA still a common request of bioinformaticians
• “Newer/better” methods build on or refer to GSEA
• Goal for remainder of presentation:
• Use example from recent publication to elucidate basic concepts
and terminology
• Go into further detail for statistical methodology related to GSEA
SPEM Article (Published Sep. 2016)
• Experimental setup:
• 2 groups of mice, balanced design (ni = 4, i = 1, 2)
• Mice are sacrificed, samples have been collected/processed, and
RNAseq data is available.
• Hypothesis: The stomachs of the mice in Group 1 (the treatment
group) are undergoing SPEM-mediated repair
• Our lab is asked to conduct GSEA of SPEM genes to support hypothesis
GSEA for SPEM
x 4
Group 1 = Ulcerated
x 4
Group 2 = Uninjured
SPEM Lists
• There is no curated SPEM pathway per se, but there are gene
sets corresponding to SPEM that have been published in the
literature.
• I have 2 gene-set lists as follows:
• SPEM [as generally observed] (list name = “SPEM”)
• SPEM in response to inflammation (list name = “SPEM_with_Inflammation”)
• Both of these lists contain genes that were previously found to
be up-regulated during SPEM
• GSEA will support the research hypothesis if upregulated expression of
SPEM-related genes is evident in the ulcerated samples compared to the
uninjured control samples.
• **Note: this presentation is intended to shed light on basic features of
GSEA and does not consider the effects of cross-talk between pathways.
Summary of basic analysis
• Methods are specific to 2-group experiment
• Inputs:
• 1 – Gene expression for ALL genes
• 2 – Phenotype information (define groups for comparison)
• 3 – List(s) of genes of interest
• Intermediary step:
• Rank genes based on differential expression
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
A note on ranking…
• Default ranking mechanism is “signal to noise ratio” (s2n)
• Reflects correlation of gene with phenotype (size and direction)
A note on ranking…
• Default ranking mechanism (signal to noise ratio, s2n)
• Formula: 𝑆2𝑁𝑖 =
𝜇 𝑖
𝐺𝑟𝑜𝑢𝑝1
−𝜇 𝑖
𝐺𝑟𝑜𝑢𝑝2
𝜎𝑖
𝐺𝑟𝑜𝑢𝑝1
+𝜎𝑖
𝐺𝑟𝑜𝑢𝑝2
• Reflects correlation (association) of gene with phenotype in
terms of size and direction
• Rank ≠ Significance of differential expression
• Likely that that genes ranking very high or very low are
significantly differentially expressed
Summary of basic analysis (cntd.)
• S = List of genes belonging to defined gene set (independent of data)
• R = Ranked list of genes (dependent on data & method of ranking genes)
• “Given an a priori defined set of genes S, …, the goal of GSEA is to
determine whether the members of S are randomly distributed throughout
R or primarily found at the top or bottom. We expect that sets related to
the phenotypic distinction will tend to show the latter distribution.”
• Null hypothesis: Membership in S → Location in R
• Alternative: Membership in S → Location in R (high or low)
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
What would ideal rank for SPEM genes ‘look’ like?
Example
List1
Example
List2
Example
List3
Uninformative
High
‘correlation’
with uninjured
group
High
‘correlation’
with ulcerated
group
Low
expression
in cases
Gene
Rank
High
expression
in cases
Summary of basic analysis
• Methods are specific to 2-group experiment
• Inputs:
• 1 – Gene expression for ALL genes
• 2 – Phenotype information (define groups for comparison)
• 3 – List(s) of genes of interest
• Intermediary step:
• Rank genes based on differential expression
• Outputs:
• 1 – Enrichment scores and p-values (summary information)
• 2 – Enrichment plots (graphical summary)
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
Enrichment Score (ES)
(A) An expression data set sorted
by correlation with phenotype,
the corresponding heat map,
and the “gene tags,” i.e.,
location of genes from a set S
within the sorted list.
(B) Plot of the running sum for S in
the data set, including the
location of the maximum
enrichment score (ES) and the
leading-edge subset.
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
Calculation of Running Enrichment Score (RES)
• Ranked genes: 𝑅1, … , 𝑅 𝑁
GENE_ID
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6
Gene7
Gene8
Gene9
Gene10
Gene11
Gene12
Gene13
Gene14
Gene15
Gene16
Gene17
Gene18
Gene19
Gene20
Calculation of Running Enrichment Score (RES)
• Ranked genes: 𝑅1, … , 𝑅 𝑁
• Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁
GENE_ID S2N
Gene1 5.4
Gene2 5.1
Gene3 4.7
Gene4 4.3
Gene5 3.6
Gene6 3.3
Gene7 2.9
Gene8 2.2
Gene9 1.6
Gene10 0.9
Gene11 -0.1
Gene12 -0.8
Gene13 -1.2
Gene14 -1.9
Gene15 -2.6
Gene16 -2.8
Gene17 -3.3
Gene18 -3.7
Gene19 -4.2
Gene20 -4.5
Calculation of Running Enrichment Score (RES)
• Ranked genes: 𝑅1, … , 𝑅 𝑁
• Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁
• Gene set S with k elements: 𝑠1, … , 𝑠 𝑘
• Tagi =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
, No.Tagi = =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
GENE_ID S2N TAG NO.TAG
Gene1 5.4 0 1
Gene2 5.1 1 0
Gene3 4.7 0 1
Gene4 4.3 1 0
Gene5 3.6 1 0
Gene6 3.3 0 1
Gene7 2.9 1 0
Gene8 2.2 0 1
Gene9 1.6 0 1
Gene10 0.9 0 1
Gene11 -0.1 0 1
Gene12 -0.8 0 1
Gene13 -1.2 0 1
Gene14 -1.9 0 1
Gene15 -2.6 0 1
Gene16 -2.8 0 1
Gene17 -3.3 0 1
Gene18 -3.7 0 1
Gene19 -4.2 0 1
Gene20 -4.5 0 1
Calculation of Running Enrichment Score (RES)
• Ranked genes: 𝑅1, … , 𝑅 𝑁
• Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁
• Gene set S with k elements: 𝑠1, … , 𝑠 𝑘
• Tagi =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
, No.Tagi = =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
• M = 𝑖=1
𝑘
𝑚𝑖 where Tagi = 1 (sum of ranking metric for genes in set)
• T = 𝑖=1
N
[No.Tagi] (N – k; # of genes in R but not S)
GENE_ID S2N TAG NO.TAG
Gene1 5.4 0 1
Gene2 5.1 1 0
Gene3 4.7 0 1
Gene4 4.3 1 0
Gene5 3.6 1 0
Gene6 3.3 0 1
Gene7 2.9 1 0
Gene8 2.2 0 1
Gene9 1.6 0 1
Gene10 0.9 0 1
Gene11 -0.1 0 1
Gene12 -0.8 0 1
Gene13 -1.2 0 1
Gene14 -1.9 0 1
Gene15 -2.6 0 1
Gene16 -2.8 0 1
Gene17 -3.3 0 1
Gene18 -3.7 0 1
Gene19 -4.2 0 1
Gene20 -4.5 0 1
M = 15.9
T = 16
Calculation of Running Enrichment Score (RES)
• Ranked genes: 𝑅1, … , 𝑅 𝑁
• Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁
• Gene set S with k elements: 𝑠1, … , 𝑠 𝑘
• Tagi =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
, No.Tagi = =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
• M = 𝑖=1
𝑘
𝑚𝑖 where Tagi = 1 (sum of ranking metric for genes in set)
• T = 𝑖=1
N
[No.Tagi] (N – k; # of genes in R but not S)
• Start at 𝑅1. If Tag1 = 1, then RES1 = 𝑚1 ∗ 1
𝑀 ,
else RES1 = −(1
𝑇)
• Move to 𝑅2. If Tag2 = 1, then RES2 = RES1 + 𝑚2∗ 1
𝑀 ,
else RES1 = RES1 −(1
𝑇)
• For a given 𝑅𝑗,
RESj = 𝑖=1
𝑗
( 𝑚𝑗 ∗ 1
𝑀) ∗ Tagj ) − 𝑖=1
𝑗
((1
T) ∗ No.Tagj )
* At the final 𝑅 𝑁, we have 𝑀
𝑀 − T
T = 0
Determining ES
• RESj = 𝑖=1
𝑗
( 𝑚𝑗 ∗ 1
𝑀) ∗ Tagj ) − 𝑖=1
𝑗
((1
T) ∗ No.Tagj )
• After going through all ranked genes N, you are left with a
vector of RES’s. Then, for a given gene set,
ES = max |RES|
𝑖=1
𝑗
Tagj  Unweighted
𝑖=1
𝑗
𝑚𝑖  Weighted; α = 1
𝑖=1
𝑗
𝑚𝑖 ∗ 𝛼  Weighted; α = α*
“The enrichment score is the maximum deviation from zero
encountered in the random walk; it corresponds to a weighted
Kolmogorov–Smirnov-like statistic”
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
GENE_ID S2N TAG NO.TAG α S2N*TAG*α/M No.Tag/T RES
Gene1 5.4 0 1 1 0 0.0625 -0.0625
Gene2 5.1 1 0 1 0.320754717 0 0.258255
Gene3 4.7 0 1 1 0 0.0625 0.195755
Gene4 4.3 1 0 1 0.270440252 0 0.466195
Gene5 3.6 1 0 1 0.226415094 0 0.69261
Gene6 3.3 0 1 1 0 0.0625 0.63011
Gene7 2.9 1 0 1 0.182389937 0 0.8125
Gene8 2.2 0 1 1 0 0.0625 0.75
Gene9 1.6 0 1 1 0 0.0625 0.6875
Gene10 0.9 0 1 1 0 0.0625 0.625
Gene11 -0.1 0 1 1 0 0.0625 0.5625
Gene12 -0.8 0 1 1 0 0.0625 0.5
Gene13 -1.2 0 1 1 0 0.0625 0.4375
Gene14 -1.9 0 1 1 0 0.0625 0.375
Gene15 -2.6 0 1 1 0 0.0625 0.3125
Gene16 -2.8 0 1 1 0 0.0625 0.25
Gene17 -3.3 0 1 1 0 0.0625 0.1875
Gene18 -3.7 0 1 1 0 0.0625 0.125
Gene19 -4.2 0 1 1 0 0.0625 0.0625
Gene20 -4.5 0 1 1 0 0.0625 0
M = 15.9
T = 16
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 3 5 7 9 11 13 15 17 19
RES
RES
Published Results of GSEA for SPEM…
• “The positive ES values and low P
values suggest that these genes, as a
set, are up-regulated significantly in
the ulcerated samples.”
• Both lists show ‘enrichment’
• General comparison:
• SPEM_WITH_INFLAMMATION (SWI) gene
set is ‘more enriched’ than SPEM gene set
• Higher enrichment score
• Lower p-value
• Better defined ‘leading edge’ on plot
• Corresponds with accompanying differential
expression analysis…
Upregulated in class ulcerated
Enrichment Score (ES) 0.554442
p-value (random genes) < 0.05
Upregulated in class ulcerated
Enrichment Score (ES) 0.7697754
p-value (random genes) < 0.001
Upregulated in class ulcerated
Enrichment Score (ES) 0.554442
p-value (random genes) < 0.05
Upregulated in class ulcerated
Enrichment Score (ES) 0.7697754
p-value (random genes) < 0.001
• For the SWI gene set:
• P-values for individual
genes tend to be lower
• Fold changes tend to be
higher
• Colors more intense on
corresponding heatmap
Where do the p-values come from?
• Permutation-based calculations are implemented in order
to assess the significance of a particular gene set.
• Based on samples
• Based on genes
Sample-based Permutation
High
Negative
Signal to
Noise
High
Positive
Randomly
‘shuffle’ samples
Recalculate S2N
[shown re-ranked]
Calculation of ES
based on
‘observed’ S2N of
predefined genes
High
Negative
Signal to
Noise
High
Positive
High
Negative
Signal to
Noise
High
Positive
Original Samples Permuted Sample 1
Calculation of ES
based on
‘permuted’ S2N of
predefined genes
Permutation 1 Permutation 2 Permutation 3 Permutation 4
Histogram of EScores
Enrichment Score
Density
-1.0 -0.5 0.0 0.5 1.0
0.00.40.8
Nominal P-value
0.975 quantile for
permuted scores
Original ES Score
y <- permuted.ES.scores
obs<- 0.76
sum(y>obs)/(1000+1)
[1] 0.01598402
Problem with Sample-based
• Limited number of available ordering.
• Different orderings can result in identical grouping
• For example, if we have:
• Then ESp1 = ESp2
• Reduced ability to estimate true variability in sampling
distribution [of ES]
• If either group has n < 7, it is advised that gene
permutation is carried out instead
Order for permutation #1:
T1 T2 C3 C4 C1 C2 T4 T4
Order for permutation #2:
T1 C3 T2 C4 C1 T2 C2 T4
Gene-based Permutation
High
Negative
Signal to
Noise
High
Positive
ES based on S2N
of original genes
High
Negative
Signal to
Noise
High
Positive
ES based on S2N
of RANDOM gene
set of same size
Original
Permuted Gene
Labels
Problem with Gene-based
• Does not take into account correlation structure of genes .
• Gene coexpression:
• Groups of genes with underlying similarity (for example, genes
associated with common transcription factors or biological
processes) should move up/down in rank together
Permuting labels at random
does not represent outcomes
that biologically make sense
Example coexpression network from: http://bioinfow.dep.usal.es/coexpression/
According to the authors…
• “Genes may be ranked based on the differences seen in a
small data set, with too few samples to allow rigorous
evaluation of significance levels by permuting the class
labels. In these cases, a P value can be estimated by
permuting the genes, with the result that genes are
randomly assigned to the sets while maintaining their
size. This approach is not strictly accurate: because it
ignores gene-gene correlations, it will overestimate the
significance levels and may lead to false positives.”
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
Recap on permutations…
• Permutation-based calculations are implemented in order to
assess the significance of a particular gene set.
• Based on samples (need large sample size)
• 1) Permute samples and re-compute ranked list of all genes
• Variability in ranking is dependent on variability of samples
• 2) Re-calculate ES score for gene set based on new rank
• Based on genes (may not preserve correlation structure of genes)
• 1) Permute genes to get a random set of genes
• 2) Re-calculate ES score for gene set with ‘random ranks’
Effect of permutation type on ES distribution for SPEM example
ES: 0.7697
P-val: 0.2262
ES: 0.7697
P-val: “0”
Sample-Based
Permutation
Gene-Based
Permutation
What if I reverse the direction of my hypothesis
– ie my genes are downregulated?
Upregulated in class ulcerated
Enrichment Score (ES) 0.76985
Normalized ES (NES) 1.5405532
Nominal p-value 0.16831683
Upregulated in class uninjured
Enrichment Score (ES) -0.76985
Normalized ES (NES) -1.5026467
Nominal p-value 0.17635658
‘Typical’ Up-and-Down-Regulated
RandomCenter-Clustered
ES P-value
0.80795 3.99E-11
ES P-value
0.565999 0.055373
ES P-value
-0.46202 0.012155
ES P-value
-0.25653 0.770885
Revisiting SPEM results…
• The published ES p-values were generated via
permutation of gene-label
• Follows guidelines based on sample-size
• As bioinformatician, should work with, rather than work around, sample
size limitations – and be clear when writing methods.
• GSEA could not be conducted for different part of experiment
• Each ‘group’ consisted of one sample
• Fair to say that results of GSEA and differential
expression analysis support hypothesis of SPEM-
mediated processes
• Both suggest significant up-regulation of SPEM-related genes in
the treatment (ulcerated) group
GSEA Take-aways
• Quantitative measurements and visual output
• Data may already be out there, just needs to be analyzed!
• Variety of R-packages implement GSEA; also a GUI software
application developed by BROAD institute
• Databases with gene expression data and gene lists are becoming
increasingly common and even user friendly… such as ilincs.org
• Interpretation and follow-up specific to experiment.
• Pay attention to sample size
• Comprehensive statistical analysis should be included if results are
intended to be published.
• More room for interpretation in exploratory setting.
Questions?
Image: https://www.buzzfeed.com/christianzamora/jeans-or-genes?utm_term=.mn3A55J0Pk#.ldO3LLm7gA
References
• Planet E (2013). phenoTest: Tools to test association between
gene expression and phenotype in a way that is efficient,
structured, fast and scalable. We also provide tools to do
GSEA (Gene set enrichment analysis) and copy number
variation.. R package version 1.16.0.
• Mootha VK et al. PGC-1-responsive genes involved in
oxidative phosphorylation are coordinately downregulated in
human diabetes. Nature Genetics 34, 267 - 273 (2003)
• Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert
BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander
ES, Mesirov JP. Gene set enrichment analysis: A knowledge-
based approach for interpreting genome-wide expression
profiles. Gene set enrichment analysis: A knowledge-based
approach for interpreting genome-wide expression profiles.
PNAS 2005 102 (43) 15545-15550; published ahead of print
September 30, 2005.
Additional Slides
• Different Implementations
• Explaining example heatmaps
• References
Normalized Expression Score (NES)
• Takes into account the size of each gene set list and
adjusts the original ES
• Important when many lists are taken into consideration
• Used to calculate false discovery rate (FDR)
• “The FDR is the estimated probability that a set with a given NES
represents a false positive finding; it is computed by comparing the
tails of the observed and null distributions for the NES”
• Old method, use FWER. BUT:
• “Because our primary goal is to generate hypotheses, we chose to use
the FDR to focus on controlling the probability that each reported result
is a false positive.”
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
A note on ranking…
• Pre-ranked gene-lists may be used
• Should incorporate magnitude and direction for purposes of
interpretation
Default Pre-ranked by edgeR p-value
Example 1: SNP data
Title: “Using Environmental Correlations to Identify Loci Underlying Local Adaptation”
• From article:
• [This study will demonstrate]
“covariance in allele
frequencies between
populations from a set of
markers”
• “These matrices reveal the
close genetic relationship of
populations from the same
geographic region”
• Personal observation:
• “broad geographic area” not
defined
• Color key would be nice
“The matrices are displayed as heat maps with lighter colors corresponding to
higher values. The rows and columns of these matrices have been arranged by
broad geographic label.”
Graham Coop, David Witonsky, Anna Di Rienzo, Jonathan K. Pritchard. Genetics. August 1, 2010 vol. 185 no. 4, 1411-1423;
DOI: 10.1534/genetics.110.114819
• From Article
• “To explore the functional and mechanistic
implications of the somatic mutations, we
performed pathway analysis by integrating
mutation and gene expression data from AITL
cases”
• “After controlling for platform differences, the
tumor and normal samples separated into
distinct clusters”
• Notes:
• Color key would still be nice
Example 2: Gene Expression and SNP data
Title: “A recurrent inactivating mutation in RHOA GTPase in angioimmunoblastic T cell lymphoma”
“Heat map from hierarchical clustering of differentially expressed genes. […]
Gene clusters from hierarchical classification were subjected to the DAVID web
server for Gene Ontology analysis, and the most enriched term for each cluster
was determined using the q value from the FDR test”.
Yoo HY et. al. Nature Genetics 46, 371–375 (2014) DOI:10.1038/ng.2916
3 Methods for Discussion
• 1:R implementation of GSEA (BROAD)
• Is supplied as a function, not a package
• 2: Java desktop application (BROAD)
• Graphical user interface (point and click)
• 3: R package “phenoTest” (Planet, E (2013))
• An R package for carrying out GSEA
GSEA – R function
• Many analytical tools meant for R implementation are
provided as R packages…but not this one!
• Download file from BROAD institute website, unzip, open text file
“GSEA.1.0.R” in R and run code to get functions.
• Problem #1: code is not maintained, some operating systems will not
tolerate particular syntax and code must be edited
• Solution = Jacek
• Problem #2: the data ‘needs’ to be in a very specific format:
• .res or .gct for expression data
• .cls for phenotype/class data
• .gmt or .gmx for gene lists
GSEA – R: getGSEAready functions
• Solution 2a: Expression Data
• Just need a correctly formatted data.frame object to pass to the function.
#New Function converts 'all.gene.result' to necessary format
getGSEAready.expression<- function(expression.data, list.format){
#Use gene geneid or symbols as row names
dataset<- read.delim(expression.data, stringsAsFactors = FALSE)
if (list.format == "geneid") {
row.names(dataset)<- dataset$geneid
}
else if (list.format == "symbol"){
nas<- dataset[is.na(dataset$symbol), ]
for (i in 1:nrow(nas)){
nas$symbol[i] <- sprintf("unnamed%d", i) #Change NA to 'unnamed[#]'
}
notna<- dataset[!is.na(dataset$symbol),]
dataset<- rbind(notna, nas)
row.names(dataset)<- dataset$symbol
}
dataset<- dataset[,-(1:10)] #Remove all columns except for expression data
return(dataset)
}
New data.frame object
GSEA – R: getGSEAready functions
• Solution 2b: Phenotype data
• Similar story; R needs a list object with 2 vectors –
• phen: a character vector with the class labels
• class.v: a numeric [1,2] vector to indicate class for each sample
# New function Uses 'Sample.Info' to generate phenotype data
(simple design: treatment vs. control)
getGSEAready.phenotype<- function(phenotype.data){
pheno<- read.delim(phenotype.data)
phen<- as.character(unique(pheno$Group))
class.v<- rep(0, nrow(pheno))
for (i in 1:length(pheno$Group)){
if (pheno$Group[i] == phen[1]){ #Label your 'group of interest' as '1', the other '2'
class.v[i] = 1
}
else {
class.v[i] = 2
}
}
classdata = list(phen = phen, class.v = class.v)
return(classdata)
}
New ‘classdata’ list object
SampleInfo="E:SampleInfo.txt"
classdata = getGSEAready.phenotype(SampleInfo)
classdata
## $phen
## [1] "ulcerated" "uninjured"
##
## $class.v
## [1] 1 1 1 1 2 2 2 2
GSEA – R function
• Solution 2c: Gene Lists
• Set up in Excel with specified format and save as .gmt file
(use quotation marks around filename when saving)
• na’s may be replaced with description of gene list
• Note: There are ‘pre-packaged’ gene lists available from
MSigDB but that is for another discussion
GSEA– R function
1: Get data ready
GSEA – R function
2: Input parameters for main function
GSEA – R function
3: Run main function
GSEA – R function
4: Check output
Note: I asked for data to be output to
“/home/vandersm/Documents/gseaRexample/zavros”
but it gets output to the parent directory (with the intended directory name
prefixed to the prefix)
GSEA – R function
Example output for one gene list (1)
GSEA – R function
Example output for one gene list (2)
GSEA – R function
Example of global plots (all lists considered)
GSEA – R function
Example of report for one gene list
Is this gene in the
leading edge subset?
GSEA – R function
Need to check the parameters you used?
GSEA – R function
Pros:
• Once the data is formatted correctly, the analysis is rather straight-
forward.
• Provides graphical output for individual lists as well as global reports
Cons:
• Lacks options that are easy to change in GUI version
• No option for presorted list
• No options for how to sort list
• Difficult to obtain sorted list
• Creates a matrix of gene ranks corresponding to permuted sample
labels regardless of whether or not sample permutation is chosen
• P-values have potential to be zero [at least if permuting by gene]
GSEA with JAVA GUI
GSEA with JAVA GUI
• Similar to the R script/functions, much of the hassle
concerns correctly formatted data. If anyone is
interested in those formats, let me know, for now let’s
not dwell on them
GSEA with JAVA GUI
GSEA with JAVA GUI
GSEA with JAVA GUI
GSEA with JAVA GUI
file:///C:/Users/Shana/gsea_home/output/aug17/my_analysi
s.Gsea.1439827638553/index.html
GSEA with JAVA GUI
GSEA with JAVA GUI
• Pros:
• Once data is loaded, the ‘point-and-click’ environment is generally
convenient and flexible
• Graphs are somewhat ‘prettier’ than from the R-code
• Results can be used from an easily-navigable page
• Very easy to obtain ranked list of genes
• Cons
• Not necessarily an option for incorporation into our pipeline
• Cannot test functionality in interactive mode
• Crashes if #permutations increases 10-fold
• May yield p-values of 0.
GSEA with phenoTest
• phenoTest – R package by Planet E (2013)
• Implements GSEA in a manner that is rather flexible
(once formatting issues are taken care of!)
• Input dataset is in form of an eset
• Use Jacek’s code/my function for creating an eset from the
raw.data and SampleInfo file obtained during analysis
• Should # of genes match that of ‘processed’ data?
• EX) In my experiment, my ‘all_genes_result’ has data for
16,400 genes but my eset has info for 20,640
• Has modest effect on results
• To compare across implementations, I restricted my eset to
only include genes that were in my ‘all_genes_result’
GSEA with phenoTest
• Calculation of ‘observed’ ES score is the same
but ONLY permutes genes
• Calls permuted ES scores ‘Simulated Scores’; these can
easily be accessed after the analysis is run
• Automatically creates an NES plot – could be better
alternative for publication if many lists are taken into
consideration.
• Has option to create plot using Wilcoxon test as
discussed by Virtaneva(2001)
GSEA with phenoTest –
Example Graphs
ES Plot NES Plot Wilcoxon-ES Plot
GSEA with phenoTest
• Creation of epheno object makes ‘playing around’
with the data much easier
• See how GSEA will behave when particular patterns among
the ranked genes are artificially created
• Obviously, once you re-create the gene-ranks as they are in the
GSEA BROAD implementation, this can be done with the
aforementioned options; but this saves some computational steps.
‘Typical’ Up-and-Down-Regulated
RandomCenter-Clustered
ES P-value
0.80795 3.99E-11
ES P-value
0.565999 0.055373
ES P-value
-0.46202 0.012155
ES P-value
-0.25653 0.770885
GSEA with phenoTest
• Pros:
• Has a few different options as far as plots are concerned
• Function is relatively easy to run [compared to the BROAD script]
and the epheno object is useful
• Run-time is fast (doesn’t ever re-compute permuted S2N)
• Cons
• No option to permute by sample
• Some of the functions that are described in the reference paper do
not work (not a big deal, just annoying)
• Have not figured out how to implement different ranking
mechanism.
Comparison of Output
Is there a better, perhaps more
‘statistical’ approach?

Contenu connexe

Tendances

Gene Set Analysis and Visualization Workshop. Part II: Visualization
Gene Set Analysis and Visualization Workshop. Part II: VisualizationGene Set Analysis and Visualization Workshop. Part II: Visualization
Gene Set Analysis and Visualization Workshop. Part II: VisualizationSvetlana Frenkel
 
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / BioinformaticsIntroduction to Data Mining / Bioinformatics
Introduction to Data Mining / BioinformaticsGerald Lushington
 
Gene regulatory networks
Gene regulatory networksGene regulatory networks
Gene regulatory networksMadiheh
 
Survey of softwares for phylogenetic analysis
Survey of softwares for phylogenetic analysisSurvey of softwares for phylogenetic analysis
Survey of softwares for phylogenetic analysisArindam Ghosh
 
Protein array, protein chip by kk sahu sir
Protein array, protein chip by kk sahu sirProtein array, protein chip by kk sahu sir
Protein array, protein chip by kk sahu sirKAUSHAL SAHU
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-SeqcursoNGS
 
Algorithm research project neighbor joining
Algorithm research project neighbor joiningAlgorithm research project neighbor joining
Algorithm research project neighbor joiningJay Mehta
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...VHIR Vall d’Hebron Institut de Recerca
 
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...Prasenjit Mitra
 
Structural genomics
Structural genomicsStructural genomics
Structural genomicsAshfaq Ahmad
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformaticsavrilcoghlan
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biologylemberger
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation DetectionJennifer Shelton
 

Tendances (20)

Network components and biological network construction methods
Network components and biological network construction methodsNetwork components and biological network construction methods
Network components and biological network construction methods
 
Gene Set Analysis and Visualization Workshop. Part II: Visualization
Gene Set Analysis and Visualization Workshop. Part II: VisualizationGene Set Analysis and Visualization Workshop. Part II: Visualization
Gene Set Analysis and Visualization Workshop. Part II: Visualization
 
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / BioinformaticsIntroduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Gene regulatory networks
Gene regulatory networksGene regulatory networks
Gene regulatory networks
 
Survey of softwares for phylogenetic analysis
Survey of softwares for phylogenetic analysisSurvey of softwares for phylogenetic analysis
Survey of softwares for phylogenetic analysis
 
Pathway and network analysis
Pathway and network analysisPathway and network analysis
Pathway and network analysis
 
Protein array, protein chip by kk sahu sir
Protein array, protein chip by kk sahu sirProtein array, protein chip by kk sahu sir
Protein array, protein chip by kk sahu sir
 
Proteomics
ProteomicsProteomics
Proteomics
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Algorithm research project neighbor joining
Algorithm research project neighbor joiningAlgorithm research project neighbor joining
Algorithm research project neighbor joining
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
 
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
SNP Genotyping Technologies
SNP Genotyping TechnologiesSNP Genotyping Technologies
SNP Genotyping Technologies
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation Detection
 

Similaire à GSEA of SPEM Genes

RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)r-kor
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Functional Genomics Data Society
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...Golden Helix Inc
 
150224 giab 30 min generic slides
150224 giab 30 min generic slides150224 giab 30 min generic slides
150224 giab 30 min generic slidesGenomeInABottle
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedJonathan Eisen
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
EST Clustering.ppt
EST Clustering.pptEST Clustering.ppt
EST Clustering.pptMedhavi27
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration StrategiesDmitry Grapov
 
Analysis of gene expression microarray data of patients with Spinal Muscular ...
Analysis of gene expression microarray data of patients with Spinal Muscular ...Analysis of gene expression microarray data of patients with Spinal Muscular ...
Analysis of gene expression microarray data of patients with Spinal Muscular ...Anton Yuryev
 
Evolutionary Algorithms
Evolutionary AlgorithmsEvolutionary Algorithms
Evolutionary AlgorithmsReem Alattas
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Prof. Wim Van Criekinge
 

Similaire à GSEA of SPEM Genes (20)

RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
 
Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
 
genomic comparison
genomic comparison genomic comparison
genomic comparison
 
150224 giab 30 min generic slides
150224 giab 30 min generic slides150224 giab 30 min generic slides
150224 giab 30 min generic slides
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
EST Clustering.ppt
EST Clustering.pptEST Clustering.ppt
EST Clustering.ppt
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
 
Analysis of gene expression microarray data of patients with Spinal Muscular ...
Analysis of gene expression microarray data of patients with Spinal Muscular ...Analysis of gene expression microarray data of patients with Spinal Muscular ...
Analysis of gene expression microarray data of patients with Spinal Muscular ...
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
phy prAC.pptx
phy prAC.pptxphy prAC.pptx
phy prAC.pptx
 
ANOVA 2023 aa 2564896.pptx
ANOVA 2023  aa 2564896.pptxANOVA 2023  aa 2564896.pptx
ANOVA 2023 aa 2564896.pptx
 
Evolutionary Algorithms
Evolutionary AlgorithmsEvolutionary Algorithms
Evolutionary Algorithms
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 

Dernier

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Dernier (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

GSEA of SPEM Genes

  • 1. O.M.GSEA An introduction to ‘classic’ Gene Set Enrichment Analysis methodology Shana White 2016
  • 2. Overview • Preliminaries • Genes & gene sets • Gene expression and enrichment • GSEA • Introduction & Example from Publication • Experimental conditions • Purpose of GSEA • Background on methodology • Gene ranking • Enrichment Scores & Plots • Assessing significance
  • 3. Fundamentals of Genes, Genomes • Genome = Collection of DNA sequences [across all chromosomes of a species]. • Genes = Subset of genome that ‘codes’ for a protein Image: http://cs.stanford.edu/people/eroberts/cs181/projects/2010-11/Genomics/accuracy.html
  • 4. Genes & Genotype • Individuals of the same species share the same genome; have the same set of genes • SNPs (single nucleotide polymorphisms) allow for genetic variation • Genotype = the amino acid sequence corresponding to a [an individual] gene • “Central Dogma” of genetics: Image 1: http://cs.stanford.edu/people/eroberts/cs181/projects/2010-11/Genomics/accuracy.html Image 2: http://www.lhsc.on.ca/Patients_Families_Visitors/Genetics/Inherited_Metabolic/Mitochondria/DiseasesattheMolecularLevel.htm
  • 5. Genotype vs. Gene Expression • Genotype is the same in all somatic cells of an organism
  • 6. Genotype vs. Gene Expression • Genotype is the same in all somatic cells of an organism • Gene expression [rate at which genes are transcribed] is different to varying degrees • Random fluctuations within same types of cells are expected • Significant differences  significantly different cells
  • 7. Genotype vs. Gene Expression • Random fluctuations within same types of cells are expected • Significant differences  significantly different cells Genotype is the same in all somatic cells of an organism Gene expression [rate at which genes are transcribed] is different to varying degrees
  • 8. Gene Expression Data • RNA-seq (“Next-Generation sequencing”) simultaneously generates data for both genotype and gene expression • Experimental set-up is typically cases vs. controls • Cases: phenotype is induced experimentally [pure experiment] • Controls: represent ‘baseline’ gene expression for comparison • Quantify transcriptome of each sample • Compare ‘fold change’ for expression of cases compared to controls • FC = (expression in cases)/(expression in controls) • Determine significance of individual gene up/down regulation • Results of differential expression analysis or additional analyses often visualized with a “heatmap”…
  • 9. Quick teaching note on heatmaps… • Very common visualization tool in genetic research • Incorporate genotype and/or gene expression data • Support and/or generate hypotheses • Organization of rows/columns determined by experimental design • Range from relatively simple to complex [in terms of interpretation]
  • 10. Quick teaching note on heatmaps… • Very common visualization tool in genetic research • Incorporate genotype and/or gene expression data • Support and/or generate hypotheses • Organization of rows/columns determined by experimental design • Range from relatively simple to complex [in terms of interpretation]
  • 11. Quick teaching note on heatmaps… • Very common visualization tool in genetic research • Incorporate genotype and/or gene expression data • Support and/or generate hypotheses • Organization of rows/columns determined by experimental design • Range from relatively simple to complex [in terms of interpretation]
  • 12. Quick teaching note on heatmaps… • Very common visualization tool in genetic research • Incorporate genotype and/or gene expression data • Support and/or generate hypotheses • Organization of rows/columns determined by experimental design • Range from relatively simple to complex [in terms of interpretation] • Heatmap for purposes of this discussion: Cases Controls • Columns  Samples • Rows  Genes • Color  Direction/Intensity of expression • Red: Higher than row average • Blue [or green]: Lower than row average
  • 13. “Enrichment” & GSEA • Results of individual genes • Dictionary(.com) definition of enrichment: • “act of making fuller or more meaningful or rewarding” • Gene set enrichment • Gene sets are predefined in the literature and/or in database: • Grouped by information regarding gene function, pathway membership, etc. • Gene sets are ‘enriched’ if experimental findings are in accordance with the set of interest [with hope of adding meaning to results] • Definition not always obvious • Good resource: “An introduction to effective use of enrichment analysis software” • Gene Set Enrichment Analysis • Statistical methods determine significance of enrichment for gene set by comparing distribution of genes in set to ‘background distribution’
  • 14. First Use in Publication (2003)
  • 15. Paper describing methodology (2005) Citations as of: November 2015 - 8,349 November 2016 - 10,057
  • 16. Quick review… • GSEA is a common ‘secondary analysis’ after gene expression data has been collected • Gene sets can be determined a-priori specific to an experiment (as in example that follows) or • Multiple gene-sets from databases can be used in a data-mining fashion to support or generate hypotheses • Implications of multiple testing (beyond scope of presentation) • Good to know the basics • GSEA still a common request of bioinformaticians • “Newer/better” methods build on or refer to GSEA • Goal for remainder of presentation: • Use example from recent publication to elucidate basic concepts and terminology • Go into further detail for statistical methodology related to GSEA
  • 18. • Experimental setup: • 2 groups of mice, balanced design (ni = 4, i = 1, 2) • Mice are sacrificed, samples have been collected/processed, and RNAseq data is available. • Hypothesis: The stomachs of the mice in Group 1 (the treatment group) are undergoing SPEM-mediated repair • Our lab is asked to conduct GSEA of SPEM genes to support hypothesis GSEA for SPEM x 4 Group 1 = Ulcerated x 4 Group 2 = Uninjured
  • 19. SPEM Lists • There is no curated SPEM pathway per se, but there are gene sets corresponding to SPEM that have been published in the literature. • I have 2 gene-set lists as follows: • SPEM [as generally observed] (list name = “SPEM”) • SPEM in response to inflammation (list name = “SPEM_with_Inflammation”) • Both of these lists contain genes that were previously found to be up-regulated during SPEM • GSEA will support the research hypothesis if upregulated expression of SPEM-related genes is evident in the ulcerated samples compared to the uninjured control samples. • **Note: this presentation is intended to shed light on basic features of GSEA and does not consider the effects of cross-talk between pathways.
  • 20. Summary of basic analysis • Methods are specific to 2-group experiment • Inputs: • 1 – Gene expression for ALL genes • 2 – Phenotype information (define groups for comparison) • 3 – List(s) of genes of interest • Intermediary step: • Rank genes based on differential expression Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
  • 21. A note on ranking… • Default ranking mechanism is “signal to noise ratio” (s2n) • Reflects correlation of gene with phenotype (size and direction)
  • 22. A note on ranking… • Default ranking mechanism (signal to noise ratio, s2n) • Formula: 𝑆2𝑁𝑖 = 𝜇 𝑖 𝐺𝑟𝑜𝑢𝑝1 −𝜇 𝑖 𝐺𝑟𝑜𝑢𝑝2 𝜎𝑖 𝐺𝑟𝑜𝑢𝑝1 +𝜎𝑖 𝐺𝑟𝑜𝑢𝑝2 • Reflects correlation (association) of gene with phenotype in terms of size and direction • Rank ≠ Significance of differential expression • Likely that that genes ranking very high or very low are significantly differentially expressed
  • 23. Summary of basic analysis (cntd.) • S = List of genes belonging to defined gene set (independent of data) • R = Ranked list of genes (dependent on data & method of ranking genes) • “Given an a priori defined set of genes S, …, the goal of GSEA is to determine whether the members of S are randomly distributed throughout R or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.” • Null hypothesis: Membership in S → Location in R • Alternative: Membership in S → Location in R (high or low) Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
  • 24. What would ideal rank for SPEM genes ‘look’ like? Example List1 Example List2 Example List3 Uninformative High ‘correlation’ with uninjured group High ‘correlation’ with ulcerated group Low expression in cases Gene Rank High expression in cases
  • 25. Summary of basic analysis • Methods are specific to 2-group experiment • Inputs: • 1 – Gene expression for ALL genes • 2 – Phenotype information (define groups for comparison) • 3 – List(s) of genes of interest • Intermediary step: • Rank genes based on differential expression • Outputs: • 1 – Enrichment scores and p-values (summary information) • 2 – Enrichment plots (graphical summary) Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
  • 26. Enrichment Score (ES) (A) An expression data set sorted by correlation with phenotype, the corresponding heat map, and the “gene tags,” i.e., location of genes from a set S within the sorted list. (B) Plot of the running sum for S in the data set, including the location of the maximum enrichment score (ES) and the leading-edge subset. Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
  • 27. Calculation of Running Enrichment Score (RES) • Ranked genes: 𝑅1, … , 𝑅 𝑁
  • 29. Calculation of Running Enrichment Score (RES) • Ranked genes: 𝑅1, … , 𝑅 𝑁 • Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁
  • 30. GENE_ID S2N Gene1 5.4 Gene2 5.1 Gene3 4.7 Gene4 4.3 Gene5 3.6 Gene6 3.3 Gene7 2.9 Gene8 2.2 Gene9 1.6 Gene10 0.9 Gene11 -0.1 Gene12 -0.8 Gene13 -1.2 Gene14 -1.9 Gene15 -2.6 Gene16 -2.8 Gene17 -3.3 Gene18 -3.7 Gene19 -4.2 Gene20 -4.5
  • 31. Calculation of Running Enrichment Score (RES) • Ranked genes: 𝑅1, … , 𝑅 𝑁 • Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁 • Gene set S with k elements: 𝑠1, … , 𝑠 𝑘 • Tagi = 1 𝑖𝑓 𝑅𝑖 ∈ 𝑆 0 𝑒𝑙𝑠𝑒 , No.Tagi = = 1 𝑖𝑓 𝑅𝑖 ∈ 𝑆 0 𝑒𝑙𝑠𝑒
  • 32. GENE_ID S2N TAG NO.TAG Gene1 5.4 0 1 Gene2 5.1 1 0 Gene3 4.7 0 1 Gene4 4.3 1 0 Gene5 3.6 1 0 Gene6 3.3 0 1 Gene7 2.9 1 0 Gene8 2.2 0 1 Gene9 1.6 0 1 Gene10 0.9 0 1 Gene11 -0.1 0 1 Gene12 -0.8 0 1 Gene13 -1.2 0 1 Gene14 -1.9 0 1 Gene15 -2.6 0 1 Gene16 -2.8 0 1 Gene17 -3.3 0 1 Gene18 -3.7 0 1 Gene19 -4.2 0 1 Gene20 -4.5 0 1
  • 33. Calculation of Running Enrichment Score (RES) • Ranked genes: 𝑅1, … , 𝑅 𝑁 • Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁 • Gene set S with k elements: 𝑠1, … , 𝑠 𝑘 • Tagi = 1 𝑖𝑓 𝑅𝑖 ∈ 𝑆 0 𝑒𝑙𝑠𝑒 , No.Tagi = = 1 𝑖𝑓 𝑅𝑖 ∈ 𝑆 0 𝑒𝑙𝑠𝑒 • M = 𝑖=1 𝑘 𝑚𝑖 where Tagi = 1 (sum of ranking metric for genes in set) • T = 𝑖=1 N [No.Tagi] (N – k; # of genes in R but not S)
  • 34. GENE_ID S2N TAG NO.TAG Gene1 5.4 0 1 Gene2 5.1 1 0 Gene3 4.7 0 1 Gene4 4.3 1 0 Gene5 3.6 1 0 Gene6 3.3 0 1 Gene7 2.9 1 0 Gene8 2.2 0 1 Gene9 1.6 0 1 Gene10 0.9 0 1 Gene11 -0.1 0 1 Gene12 -0.8 0 1 Gene13 -1.2 0 1 Gene14 -1.9 0 1 Gene15 -2.6 0 1 Gene16 -2.8 0 1 Gene17 -3.3 0 1 Gene18 -3.7 0 1 Gene19 -4.2 0 1 Gene20 -4.5 0 1 M = 15.9 T = 16
  • 35. Calculation of Running Enrichment Score (RES) • Ranked genes: 𝑅1, … , 𝑅 𝑁 • Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁 • Gene set S with k elements: 𝑠1, … , 𝑠 𝑘 • Tagi = 1 𝑖𝑓 𝑅𝑖 ∈ 𝑆 0 𝑒𝑙𝑠𝑒 , No.Tagi = = 1 𝑖𝑓 𝑅𝑖 ∈ 𝑆 0 𝑒𝑙𝑠𝑒 • M = 𝑖=1 𝑘 𝑚𝑖 where Tagi = 1 (sum of ranking metric for genes in set) • T = 𝑖=1 N [No.Tagi] (N – k; # of genes in R but not S) • Start at 𝑅1. If Tag1 = 1, then RES1 = 𝑚1 ∗ 1 𝑀 , else RES1 = −(1 𝑇) • Move to 𝑅2. If Tag2 = 1, then RES2 = RES1 + 𝑚2∗ 1 𝑀 , else RES1 = RES1 −(1 𝑇) • For a given 𝑅𝑗, RESj = 𝑖=1 𝑗 ( 𝑚𝑗 ∗ 1 𝑀) ∗ Tagj ) − 𝑖=1 𝑗 ((1 T) ∗ No.Tagj ) * At the final 𝑅 𝑁, we have 𝑀 𝑀 − T T = 0
  • 36. Determining ES • RESj = 𝑖=1 𝑗 ( 𝑚𝑗 ∗ 1 𝑀) ∗ Tagj ) − 𝑖=1 𝑗 ((1 T) ∗ No.Tagj ) • After going through all ranked genes N, you are left with a vector of RES’s. Then, for a given gene set, ES = max |RES| 𝑖=1 𝑗 Tagj  Unweighted 𝑖=1 𝑗 𝑚𝑖  Weighted; α = 1 𝑖=1 𝑗 𝑚𝑖 ∗ 𝛼  Weighted; α = α* “The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic” Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
  • 37. GENE_ID S2N TAG NO.TAG α S2N*TAG*α/M No.Tag/T RES Gene1 5.4 0 1 1 0 0.0625 -0.0625 Gene2 5.1 1 0 1 0.320754717 0 0.258255 Gene3 4.7 0 1 1 0 0.0625 0.195755 Gene4 4.3 1 0 1 0.270440252 0 0.466195 Gene5 3.6 1 0 1 0.226415094 0 0.69261 Gene6 3.3 0 1 1 0 0.0625 0.63011 Gene7 2.9 1 0 1 0.182389937 0 0.8125 Gene8 2.2 0 1 1 0 0.0625 0.75 Gene9 1.6 0 1 1 0 0.0625 0.6875 Gene10 0.9 0 1 1 0 0.0625 0.625 Gene11 -0.1 0 1 1 0 0.0625 0.5625 Gene12 -0.8 0 1 1 0 0.0625 0.5 Gene13 -1.2 0 1 1 0 0.0625 0.4375 Gene14 -1.9 0 1 1 0 0.0625 0.375 Gene15 -2.6 0 1 1 0 0.0625 0.3125 Gene16 -2.8 0 1 1 0 0.0625 0.25 Gene17 -3.3 0 1 1 0 0.0625 0.1875 Gene18 -3.7 0 1 1 0 0.0625 0.125 Gene19 -4.2 0 1 1 0 0.0625 0.0625 Gene20 -4.5 0 1 1 0 0.0625 0 M = 15.9 T = 16 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3 5 7 9 11 13 15 17 19 RES RES
  • 38. Published Results of GSEA for SPEM…
  • 39. • “The positive ES values and low P values suggest that these genes, as a set, are up-regulated significantly in the ulcerated samples.” • Both lists show ‘enrichment’ • General comparison: • SPEM_WITH_INFLAMMATION (SWI) gene set is ‘more enriched’ than SPEM gene set • Higher enrichment score • Lower p-value • Better defined ‘leading edge’ on plot • Corresponds with accompanying differential expression analysis… Upregulated in class ulcerated Enrichment Score (ES) 0.554442 p-value (random genes) < 0.05 Upregulated in class ulcerated Enrichment Score (ES) 0.7697754 p-value (random genes) < 0.001
  • 40. Upregulated in class ulcerated Enrichment Score (ES) 0.554442 p-value (random genes) < 0.05 Upregulated in class ulcerated Enrichment Score (ES) 0.7697754 p-value (random genes) < 0.001 • For the SWI gene set: • P-values for individual genes tend to be lower • Fold changes tend to be higher • Colors more intense on corresponding heatmap
  • 41. Where do the p-values come from? • Permutation-based calculations are implemented in order to assess the significance of a particular gene set. • Based on samples • Based on genes
  • 43. Calculation of ES based on ‘observed’ S2N of predefined genes High Negative Signal to Noise High Positive High Negative Signal to Noise High Positive Original Samples Permuted Sample 1 Calculation of ES based on ‘permuted’ S2N of predefined genes
  • 44. Permutation 1 Permutation 2 Permutation 3 Permutation 4
  • 45. Histogram of EScores Enrichment Score Density -1.0 -0.5 0.0 0.5 1.0 0.00.40.8 Nominal P-value 0.975 quantile for permuted scores Original ES Score y <- permuted.ES.scores obs<- 0.76 sum(y>obs)/(1000+1) [1] 0.01598402
  • 46. Problem with Sample-based • Limited number of available ordering. • Different orderings can result in identical grouping • For example, if we have: • Then ESp1 = ESp2 • Reduced ability to estimate true variability in sampling distribution [of ES] • If either group has n < 7, it is advised that gene permutation is carried out instead Order for permutation #1: T1 T2 C3 C4 C1 C2 T4 T4 Order for permutation #2: T1 C3 T2 C4 C1 T2 C2 T4
  • 47. Gene-based Permutation High Negative Signal to Noise High Positive ES based on S2N of original genes High Negative Signal to Noise High Positive ES based on S2N of RANDOM gene set of same size Original Permuted Gene Labels
  • 48. Problem with Gene-based • Does not take into account correlation structure of genes . • Gene coexpression: • Groups of genes with underlying similarity (for example, genes associated with common transcription factors or biological processes) should move up/down in rank together Permuting labels at random does not represent outcomes that biologically make sense Example coexpression network from: http://bioinfow.dep.usal.es/coexpression/
  • 49. According to the authors… • “Genes may be ranked based on the differences seen in a small data set, with too few samples to allow rigorous evaluation of significance levels by permuting the class labels. In these cases, a P value can be estimated by permuting the genes, with the result that genes are randomly assigned to the sets while maintaining their size. This approach is not strictly accurate: because it ignores gene-gene correlations, it will overestimate the significance levels and may lead to false positives.” Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
  • 50. Recap on permutations… • Permutation-based calculations are implemented in order to assess the significance of a particular gene set. • Based on samples (need large sample size) • 1) Permute samples and re-compute ranked list of all genes • Variability in ranking is dependent on variability of samples • 2) Re-calculate ES score for gene set based on new rank • Based on genes (may not preserve correlation structure of genes) • 1) Permute genes to get a random set of genes • 2) Re-calculate ES score for gene set with ‘random ranks’
  • 51. Effect of permutation type on ES distribution for SPEM example ES: 0.7697 P-val: 0.2262 ES: 0.7697 P-val: “0” Sample-Based Permutation Gene-Based Permutation
  • 52. What if I reverse the direction of my hypothesis – ie my genes are downregulated? Upregulated in class ulcerated Enrichment Score (ES) 0.76985 Normalized ES (NES) 1.5405532 Nominal p-value 0.16831683 Upregulated in class uninjured Enrichment Score (ES) -0.76985 Normalized ES (NES) -1.5026467 Nominal p-value 0.17635658
  • 53. ‘Typical’ Up-and-Down-Regulated RandomCenter-Clustered ES P-value 0.80795 3.99E-11 ES P-value 0.565999 0.055373 ES P-value -0.46202 0.012155 ES P-value -0.25653 0.770885
  • 54. Revisiting SPEM results… • The published ES p-values were generated via permutation of gene-label • Follows guidelines based on sample-size • As bioinformatician, should work with, rather than work around, sample size limitations – and be clear when writing methods. • GSEA could not be conducted for different part of experiment • Each ‘group’ consisted of one sample • Fair to say that results of GSEA and differential expression analysis support hypothesis of SPEM- mediated processes • Both suggest significant up-regulation of SPEM-related genes in the treatment (ulcerated) group
  • 55. GSEA Take-aways • Quantitative measurements and visual output • Data may already be out there, just needs to be analyzed! • Variety of R-packages implement GSEA; also a GUI software application developed by BROAD institute • Databases with gene expression data and gene lists are becoming increasingly common and even user friendly… such as ilincs.org • Interpretation and follow-up specific to experiment. • Pay attention to sample size • Comprehensive statistical analysis should be included if results are intended to be published. • More room for interpretation in exploratory setting.
  • 57. References • Planet E (2013). phenoTest: Tools to test association between gene expression and phenotype in a way that is efficient, structured, fast and scalable. We also provide tools to do GSEA (Gene set enrichment analysis) and copy number variation.. R package version 1.16.0. • Mootha VK et al. PGC-1-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 34, 267 - 273 (2003) • Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge- based approach for interpreting genome-wide expression profiles. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 2005 102 (43) 15545-15550; published ahead of print September 30, 2005.
  • 58. Additional Slides • Different Implementations • Explaining example heatmaps • References
  • 59. Normalized Expression Score (NES) • Takes into account the size of each gene set list and adjusts the original ES • Important when many lists are taken into consideration • Used to calculate false discovery rate (FDR) • “The FDR is the estimated probability that a set with a given NES represents a false positive finding; it is computed by comparing the tails of the observed and null distributions for the NES” • Old method, use FWER. BUT: • “Because our primary goal is to generate hypotheses, we chose to use the FDR to focus on controlling the probability that each reported result is a false positive.” Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
  • 60. A note on ranking… • Pre-ranked gene-lists may be used • Should incorporate magnitude and direction for purposes of interpretation Default Pre-ranked by edgeR p-value
  • 61. Example 1: SNP data Title: “Using Environmental Correlations to Identify Loci Underlying Local Adaptation” • From article: • [This study will demonstrate] “covariance in allele frequencies between populations from a set of markers” • “These matrices reveal the close genetic relationship of populations from the same geographic region” • Personal observation: • “broad geographic area” not defined • Color key would be nice “The matrices are displayed as heat maps with lighter colors corresponding to higher values. The rows and columns of these matrices have been arranged by broad geographic label.” Graham Coop, David Witonsky, Anna Di Rienzo, Jonathan K. Pritchard. Genetics. August 1, 2010 vol. 185 no. 4, 1411-1423; DOI: 10.1534/genetics.110.114819
  • 62. • From Article • “To explore the functional and mechanistic implications of the somatic mutations, we performed pathway analysis by integrating mutation and gene expression data from AITL cases” • “After controlling for platform differences, the tumor and normal samples separated into distinct clusters” • Notes: • Color key would still be nice Example 2: Gene Expression and SNP data Title: “A recurrent inactivating mutation in RHOA GTPase in angioimmunoblastic T cell lymphoma” “Heat map from hierarchical clustering of differentially expressed genes. […] Gene clusters from hierarchical classification were subjected to the DAVID web server for Gene Ontology analysis, and the most enriched term for each cluster was determined using the q value from the FDR test”. Yoo HY et. al. Nature Genetics 46, 371–375 (2014) DOI:10.1038/ng.2916
  • 63. 3 Methods for Discussion • 1:R implementation of GSEA (BROAD) • Is supplied as a function, not a package • 2: Java desktop application (BROAD) • Graphical user interface (point and click) • 3: R package “phenoTest” (Planet, E (2013)) • An R package for carrying out GSEA
  • 64. GSEA – R function • Many analytical tools meant for R implementation are provided as R packages…but not this one! • Download file from BROAD institute website, unzip, open text file “GSEA.1.0.R” in R and run code to get functions. • Problem #1: code is not maintained, some operating systems will not tolerate particular syntax and code must be edited • Solution = Jacek • Problem #2: the data ‘needs’ to be in a very specific format: • .res or .gct for expression data • .cls for phenotype/class data • .gmt or .gmx for gene lists
  • 65. GSEA – R: getGSEAready functions • Solution 2a: Expression Data • Just need a correctly formatted data.frame object to pass to the function. #New Function converts 'all.gene.result' to necessary format getGSEAready.expression<- function(expression.data, list.format){ #Use gene geneid or symbols as row names dataset<- read.delim(expression.data, stringsAsFactors = FALSE) if (list.format == "geneid") { row.names(dataset)<- dataset$geneid } else if (list.format == "symbol"){ nas<- dataset[is.na(dataset$symbol), ] for (i in 1:nrow(nas)){ nas$symbol[i] <- sprintf("unnamed%d", i) #Change NA to 'unnamed[#]' } notna<- dataset[!is.na(dataset$symbol),] dataset<- rbind(notna, nas) row.names(dataset)<- dataset$symbol } dataset<- dataset[,-(1:10)] #Remove all columns except for expression data return(dataset) }
  • 67. GSEA – R: getGSEAready functions • Solution 2b: Phenotype data • Similar story; R needs a list object with 2 vectors – • phen: a character vector with the class labels • class.v: a numeric [1,2] vector to indicate class for each sample # New function Uses 'Sample.Info' to generate phenotype data (simple design: treatment vs. control) getGSEAready.phenotype<- function(phenotype.data){ pheno<- read.delim(phenotype.data) phen<- as.character(unique(pheno$Group)) class.v<- rep(0, nrow(pheno)) for (i in 1:length(pheno$Group)){ if (pheno$Group[i] == phen[1]){ #Label your 'group of interest' as '1', the other '2' class.v[i] = 1 } else { class.v[i] = 2 } } classdata = list(phen = phen, class.v = class.v) return(classdata) }
  • 68. New ‘classdata’ list object SampleInfo="E:SampleInfo.txt" classdata = getGSEAready.phenotype(SampleInfo) classdata ## $phen ## [1] "ulcerated" "uninjured" ## ## $class.v ## [1] 1 1 1 1 2 2 2 2
  • 69. GSEA – R function • Solution 2c: Gene Lists • Set up in Excel with specified format and save as .gmt file (use quotation marks around filename when saving) • na’s may be replaced with description of gene list • Note: There are ‘pre-packaged’ gene lists available from MSigDB but that is for another discussion
  • 70. GSEA– R function 1: Get data ready
  • 71. GSEA – R function 2: Input parameters for main function
  • 72. GSEA – R function 3: Run main function
  • 73. GSEA – R function 4: Check output Note: I asked for data to be output to “/home/vandersm/Documents/gseaRexample/zavros” but it gets output to the parent directory (with the intended directory name prefixed to the prefix)
  • 74. GSEA – R function Example output for one gene list (1)
  • 75. GSEA – R function Example output for one gene list (2)
  • 76. GSEA – R function Example of global plots (all lists considered)
  • 77. GSEA – R function Example of report for one gene list Is this gene in the leading edge subset?
  • 78. GSEA – R function Need to check the parameters you used?
  • 79. GSEA – R function Pros: • Once the data is formatted correctly, the analysis is rather straight- forward. • Provides graphical output for individual lists as well as global reports Cons: • Lacks options that are easy to change in GUI version • No option for presorted list • No options for how to sort list • Difficult to obtain sorted list • Creates a matrix of gene ranks corresponding to permuted sample labels regardless of whether or not sample permutation is chosen • P-values have potential to be zero [at least if permuting by gene]
  • 81. GSEA with JAVA GUI • Similar to the R script/functions, much of the hassle concerns correctly formatted data. If anyone is interested in those formats, let me know, for now let’s not dwell on them
  • 87.
  • 88. GSEA with JAVA GUI • Pros: • Once data is loaded, the ‘point-and-click’ environment is generally convenient and flexible • Graphs are somewhat ‘prettier’ than from the R-code • Results can be used from an easily-navigable page • Very easy to obtain ranked list of genes • Cons • Not necessarily an option for incorporation into our pipeline • Cannot test functionality in interactive mode • Crashes if #permutations increases 10-fold • May yield p-values of 0.
  • 89. GSEA with phenoTest • phenoTest – R package by Planet E (2013) • Implements GSEA in a manner that is rather flexible (once formatting issues are taken care of!) • Input dataset is in form of an eset • Use Jacek’s code/my function for creating an eset from the raw.data and SampleInfo file obtained during analysis • Should # of genes match that of ‘processed’ data? • EX) In my experiment, my ‘all_genes_result’ has data for 16,400 genes but my eset has info for 20,640 • Has modest effect on results • To compare across implementations, I restricted my eset to only include genes that were in my ‘all_genes_result’
  • 90. GSEA with phenoTest • Calculation of ‘observed’ ES score is the same but ONLY permutes genes • Calls permuted ES scores ‘Simulated Scores’; these can easily be accessed after the analysis is run • Automatically creates an NES plot – could be better alternative for publication if many lists are taken into consideration. • Has option to create plot using Wilcoxon test as discussed by Virtaneva(2001)
  • 91. GSEA with phenoTest – Example Graphs ES Plot NES Plot Wilcoxon-ES Plot
  • 92. GSEA with phenoTest • Creation of epheno object makes ‘playing around’ with the data much easier • See how GSEA will behave when particular patterns among the ranked genes are artificially created • Obviously, once you re-create the gene-ranks as they are in the GSEA BROAD implementation, this can be done with the aforementioned options; but this saves some computational steps.
  • 93. ‘Typical’ Up-and-Down-Regulated RandomCenter-Clustered ES P-value 0.80795 3.99E-11 ES P-value 0.565999 0.055373 ES P-value -0.46202 0.012155 ES P-value -0.25653 0.770885
  • 94. GSEA with phenoTest • Pros: • Has a few different options as far as plots are concerned • Function is relatively easy to run [compared to the BROAD script] and the epheno object is useful • Run-time is fast (doesn’t ever re-compute permuted S2N) • Cons • No option to permute by sample • Some of the functions that are described in the reference paper do not work (not a big deal, just annoying) • Have not figured out how to implement different ranking mechanism.
  • 96. Is there a better, perhaps more ‘statistical’ approach?

Notes de l'éditeur

  1. Chances are that you have seen a plot like the one I have on the background of this slide during a Wednesday or Thursday seminar – after this presentation my hope is that you will have a slightly better idea of what those plots mean and when they are used
  2. mention
  3. Mention that there is not a specific spem gene
  4. Zero-crossing: where the correlation (signal to noise) crosses zero - correlations with genes are now negative
  5. Zero-crossing: where the correlation (signal to noise) crosses zero - correlations with genes are now negative
  6. Zero-crossing: where the correlation (signal to noise) crosses zero - correlations with genes are now negative