The document provides an introduction to gene set enrichment analysis (GSEA) methodology. It describes how GSEA analyzes gene expression data to determine whether a particular set of genes, defined a priori, shows statistically significant differences between biological conditions (e.g. case vs. control groups). The key steps are: ranking genes based on their correlation with the conditions, calculating a running enrichment score to quantify overrepresentation of the gene set at the top or bottom of the ranking, and assessing significance through permutation testing. An example analysis compares gene expression from ulcerated vs. uninjured mouse stomachs to test for enrichment of genes related to stomach epithelial metaplasia.
2. Overview
• Preliminaries
• Genes & gene sets
• Gene expression and enrichment
• GSEA
• Introduction & Example from Publication
• Experimental conditions
• Purpose of GSEA
• Background on methodology
• Gene ranking
• Enrichment Scores & Plots
• Assessing significance
3. Fundamentals of Genes, Genomes
• Genome = Collection of DNA sequences [across all
chromosomes of a species].
• Genes = Subset of genome that ‘codes’ for a protein
Image: http://cs.stanford.edu/people/eroberts/cs181/projects/2010-11/Genomics/accuracy.html
4. Genes & Genotype
• Individuals of the same species share the
same genome; have the same set of genes
• SNPs (single nucleotide polymorphisms) allow for
genetic variation
• Genotype = the amino acid sequence
corresponding to a [an individual] gene
• “Central Dogma” of genetics:
Image 1: http://cs.stanford.edu/people/eroberts/cs181/projects/2010-11/Genomics/accuracy.html
Image 2: http://www.lhsc.on.ca/Patients_Families_Visitors/Genetics/Inherited_Metabolic/Mitochondria/DiseasesattheMolecularLevel.htm
5. Genotype vs. Gene Expression
• Genotype is the same in all somatic cells of an organism
6. Genotype vs. Gene Expression
• Genotype is the same in all somatic cells of an organism
• Gene expression [rate at which genes are transcribed] is
different to varying degrees
• Random fluctuations within same types of cells are expected
• Significant differences significantly different cells
7. Genotype vs. Gene Expression
• Random fluctuations within same
types of cells are expected
• Significant differences
significantly different cells
Genotype is the same in all
somatic cells of an organism
Gene expression [rate at
which genes are transcribed]
is different to varying degrees
8. Gene Expression Data
• RNA-seq (“Next-Generation sequencing”) simultaneously
generates data for both genotype and gene expression
• Experimental set-up is typically cases vs. controls
• Cases: phenotype is induced experimentally [pure experiment]
• Controls: represent ‘baseline’ gene expression for comparison
• Quantify transcriptome of each sample
• Compare ‘fold change’ for expression of cases compared to controls
• FC = (expression in cases)/(expression in controls)
• Determine significance of individual gene up/down regulation
• Results of differential expression analysis or additional analyses often
visualized with a “heatmap”…
9. Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
10. Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
11. Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
12. Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
• Heatmap for purposes of this discussion:
Cases Controls
• Columns Samples
• Rows Genes
• Color Direction/Intensity of expression
• Red: Higher than row average
• Blue [or green]: Lower than row average
13. “Enrichment” & GSEA
• Results of individual genes
• Dictionary(.com) definition of enrichment:
• “act of making fuller or more meaningful or rewarding”
• Gene set enrichment
• Gene sets are predefined in the literature and/or in database:
• Grouped by information regarding gene function, pathway membership, etc.
• Gene sets are ‘enriched’ if experimental findings are in accordance with
the set of interest [with hope of adding meaning to results]
• Definition not always obvious
• Good resource: “An introduction to effective use of enrichment analysis software”
• Gene Set Enrichment Analysis
• Statistical methods determine significance of enrichment for gene set by
comparing distribution of genes in set to ‘background distribution’
16. Quick review…
• GSEA is a common ‘secondary analysis’ after gene
expression data has been collected
• Gene sets can be determined a-priori specific to an experiment (as
in example that follows) or
• Multiple gene-sets from databases can be used in a data-mining
fashion to support or generate hypotheses
• Implications of multiple testing (beyond scope of presentation)
• Good to know the basics
• GSEA still a common request of bioinformaticians
• “Newer/better” methods build on or refer to GSEA
• Goal for remainder of presentation:
• Use example from recent publication to elucidate basic concepts
and terminology
• Go into further detail for statistical methodology related to GSEA
18. • Experimental setup:
• 2 groups of mice, balanced design (ni = 4, i = 1, 2)
• Mice are sacrificed, samples have been collected/processed, and
RNAseq data is available.
• Hypothesis: The stomachs of the mice in Group 1 (the treatment
group) are undergoing SPEM-mediated repair
• Our lab is asked to conduct GSEA of SPEM genes to support hypothesis
GSEA for SPEM
x 4
Group 1 = Ulcerated
x 4
Group 2 = Uninjured
19. SPEM Lists
• There is no curated SPEM pathway per se, but there are gene
sets corresponding to SPEM that have been published in the
literature.
• I have 2 gene-set lists as follows:
• SPEM [as generally observed] (list name = “SPEM”)
• SPEM in response to inflammation (list name = “SPEM_with_Inflammation”)
• Both of these lists contain genes that were previously found to
be up-regulated during SPEM
• GSEA will support the research hypothesis if upregulated expression of
SPEM-related genes is evident in the ulcerated samples compared to the
uninjured control samples.
• **Note: this presentation is intended to shed light on basic features of
GSEA and does not consider the effects of cross-talk between pathways.
20. Summary of basic analysis
• Methods are specific to 2-group experiment
• Inputs:
• 1 – Gene expression for ALL genes
• 2 – Phenotype information (define groups for comparison)
• 3 – List(s) of genes of interest
• Intermediary step:
• Rank genes based on differential expression
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
21. A note on ranking…
• Default ranking mechanism is “signal to noise ratio” (s2n)
• Reflects correlation of gene with phenotype (size and direction)
22. A note on ranking…
• Default ranking mechanism (signal to noise ratio, s2n)
• Formula: 𝑆2𝑁𝑖 =
𝜇 𝑖
𝐺𝑟𝑜𝑢𝑝1
−𝜇 𝑖
𝐺𝑟𝑜𝑢𝑝2
𝜎𝑖
𝐺𝑟𝑜𝑢𝑝1
+𝜎𝑖
𝐺𝑟𝑜𝑢𝑝2
• Reflects correlation (association) of gene with phenotype in
terms of size and direction
• Rank ≠ Significance of differential expression
• Likely that that genes ranking very high or very low are
significantly differentially expressed
23. Summary of basic analysis (cntd.)
• S = List of genes belonging to defined gene set (independent of data)
• R = Ranked list of genes (dependent on data & method of ranking genes)
• “Given an a priori defined set of genes S, …, the goal of GSEA is to
determine whether the members of S are randomly distributed throughout
R or primarily found at the top or bottom. We expect that sets related to
the phenotypic distinction will tend to show the latter distribution.”
• Null hypothesis: Membership in S → Location in R
• Alternative: Membership in S → Location in R (high or low)
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
24. What would ideal rank for SPEM genes ‘look’ like?
Example
List1
Example
List2
Example
List3
Uninformative
High
‘correlation’
with uninjured
group
High
‘correlation’
with ulcerated
group
Low
expression
in cases
Gene
Rank
High
expression
in cases
25. Summary of basic analysis
• Methods are specific to 2-group experiment
• Inputs:
• 1 – Gene expression for ALL genes
• 2 – Phenotype information (define groups for comparison)
• 3 – List(s) of genes of interest
• Intermediary step:
• Rank genes based on differential expression
• Outputs:
• 1 – Enrichment scores and p-values (summary information)
• 2 – Enrichment plots (graphical summary)
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
26. Enrichment Score (ES)
(A) An expression data set sorted
by correlation with phenotype,
the corresponding heat map,
and the “gene tags,” i.e.,
location of genes from a set S
within the sorted list.
(B) Plot of the running sum for S in
the data set, including the
location of the maximum
enrichment score (ES) and the
leading-edge subset.
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
33. Calculation of Running Enrichment Score (RES)
• Ranked genes: 𝑅1, … , 𝑅 𝑁
• Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁
• Gene set S with k elements: 𝑠1, … , 𝑠 𝑘
• Tagi =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
, No.Tagi = =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
• M = 𝑖=1
𝑘
𝑚𝑖 where Tagi = 1 (sum of ranking metric for genes in set)
• T = 𝑖=1
N
[No.Tagi] (N – k; # of genes in R but not S)
35. Calculation of Running Enrichment Score (RES)
• Ranked genes: 𝑅1, … , 𝑅 𝑁
• Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁
• Gene set S with k elements: 𝑠1, … , 𝑠 𝑘
• Tagi =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
, No.Tagi = =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
• M = 𝑖=1
𝑘
𝑚𝑖 where Tagi = 1 (sum of ranking metric for genes in set)
• T = 𝑖=1
N
[No.Tagi] (N – k; # of genes in R but not S)
• Start at 𝑅1. If Tag1 = 1, then RES1 = 𝑚1 ∗ 1
𝑀 ,
else RES1 = −(1
𝑇)
• Move to 𝑅2. If Tag2 = 1, then RES2 = RES1 + 𝑚2∗ 1
𝑀 ,
else RES1 = RES1 −(1
𝑇)
• For a given 𝑅𝑗,
RESj = 𝑖=1
𝑗
( 𝑚𝑗 ∗ 1
𝑀) ∗ Tagj ) − 𝑖=1
𝑗
((1
T) ∗ No.Tagj )
* At the final 𝑅 𝑁, we have 𝑀
𝑀 − T
T = 0
36. Determining ES
• RESj = 𝑖=1
𝑗
( 𝑚𝑗 ∗ 1
𝑀) ∗ Tagj ) − 𝑖=1
𝑗
((1
T) ∗ No.Tagj )
• After going through all ranked genes N, you are left with a
vector of RES’s. Then, for a given gene set,
ES = max |RES|
𝑖=1
𝑗
Tagj Unweighted
𝑖=1
𝑗
𝑚𝑖 Weighted; α = 1
𝑖=1
𝑗
𝑚𝑖 ∗ 𝛼 Weighted; α = α*
“The enrichment score is the maximum deviation from zero
encountered in the random walk; it corresponds to a weighted
Kolmogorov–Smirnov-like statistic”
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
39. • “The positive ES values and low P
values suggest that these genes, as a
set, are up-regulated significantly in
the ulcerated samples.”
• Both lists show ‘enrichment’
• General comparison:
• SPEM_WITH_INFLAMMATION (SWI) gene
set is ‘more enriched’ than SPEM gene set
• Higher enrichment score
• Lower p-value
• Better defined ‘leading edge’ on plot
• Corresponds with accompanying differential
expression analysis…
Upregulated in class ulcerated
Enrichment Score (ES) 0.554442
p-value (random genes) < 0.05
Upregulated in class ulcerated
Enrichment Score (ES) 0.7697754
p-value (random genes) < 0.001
40. Upregulated in class ulcerated
Enrichment Score (ES) 0.554442
p-value (random genes) < 0.05
Upregulated in class ulcerated
Enrichment Score (ES) 0.7697754
p-value (random genes) < 0.001
• For the SWI gene set:
• P-values for individual
genes tend to be lower
• Fold changes tend to be
higher
• Colors more intense on
corresponding heatmap
41. Where do the p-values come from?
• Permutation-based calculations are implemented in order
to assess the significance of a particular gene set.
• Based on samples
• Based on genes
43. Calculation of ES
based on
‘observed’ S2N of
predefined genes
High
Negative
Signal to
Noise
High
Positive
High
Negative
Signal to
Noise
High
Positive
Original Samples Permuted Sample 1
Calculation of ES
based on
‘permuted’ S2N of
predefined genes
45. Histogram of EScores
Enrichment Score
Density
-1.0 -0.5 0.0 0.5 1.0
0.00.40.8
Nominal P-value
0.975 quantile for
permuted scores
Original ES Score
y <- permuted.ES.scores
obs<- 0.76
sum(y>obs)/(1000+1)
[1] 0.01598402
46. Problem with Sample-based
• Limited number of available ordering.
• Different orderings can result in identical grouping
• For example, if we have:
• Then ESp1 = ESp2
• Reduced ability to estimate true variability in sampling
distribution [of ES]
• If either group has n < 7, it is advised that gene
permutation is carried out instead
Order for permutation #1:
T1 T2 C3 C4 C1 C2 T4 T4
Order for permutation #2:
T1 C3 T2 C4 C1 T2 C2 T4
48. Problem with Gene-based
• Does not take into account correlation structure of genes .
• Gene coexpression:
• Groups of genes with underlying similarity (for example, genes
associated with common transcription factors or biological
processes) should move up/down in rank together
Permuting labels at random
does not represent outcomes
that biologically make sense
Example coexpression network from: http://bioinfow.dep.usal.es/coexpression/
49. According to the authors…
• “Genes may be ranked based on the differences seen in a
small data set, with too few samples to allow rigorous
evaluation of significance levels by permuting the class
labels. In these cases, a P value can be estimated by
permuting the genes, with the result that genes are
randomly assigned to the sets while maintaining their
size. This approach is not strictly accurate: because it
ignores gene-gene correlations, it will overestimate the
significance levels and may lead to false positives.”
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
50. Recap on permutations…
• Permutation-based calculations are implemented in order to
assess the significance of a particular gene set.
• Based on samples (need large sample size)
• 1) Permute samples and re-compute ranked list of all genes
• Variability in ranking is dependent on variability of samples
• 2) Re-calculate ES score for gene set based on new rank
• Based on genes (may not preserve correlation structure of genes)
• 1) Permute genes to get a random set of genes
• 2) Re-calculate ES score for gene set with ‘random ranks’
51. Effect of permutation type on ES distribution for SPEM example
ES: 0.7697
P-val: 0.2262
ES: 0.7697
P-val: “0”
Sample-Based
Permutation
Gene-Based
Permutation
52. What if I reverse the direction of my hypothesis
– ie my genes are downregulated?
Upregulated in class ulcerated
Enrichment Score (ES) 0.76985
Normalized ES (NES) 1.5405532
Nominal p-value 0.16831683
Upregulated in class uninjured
Enrichment Score (ES) -0.76985
Normalized ES (NES) -1.5026467
Nominal p-value 0.17635658
54. Revisiting SPEM results…
• The published ES p-values were generated via
permutation of gene-label
• Follows guidelines based on sample-size
• As bioinformatician, should work with, rather than work around, sample
size limitations – and be clear when writing methods.
• GSEA could not be conducted for different part of experiment
• Each ‘group’ consisted of one sample
• Fair to say that results of GSEA and differential
expression analysis support hypothesis of SPEM-
mediated processes
• Both suggest significant up-regulation of SPEM-related genes in
the treatment (ulcerated) group
55. GSEA Take-aways
• Quantitative measurements and visual output
• Data may already be out there, just needs to be analyzed!
• Variety of R-packages implement GSEA; also a GUI software
application developed by BROAD institute
• Databases with gene expression data and gene lists are becoming
increasingly common and even user friendly… such as ilincs.org
• Interpretation and follow-up specific to experiment.
• Pay attention to sample size
• Comprehensive statistical analysis should be included if results are
intended to be published.
• More room for interpretation in exploratory setting.
57. References
• Planet E (2013). phenoTest: Tools to test association between
gene expression and phenotype in a way that is efficient,
structured, fast and scalable. We also provide tools to do
GSEA (Gene set enrichment analysis) and copy number
variation.. R package version 1.16.0.
• Mootha VK et al. PGC-1-responsive genes involved in
oxidative phosphorylation are coordinately downregulated in
human diabetes. Nature Genetics 34, 267 - 273 (2003)
• Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert
BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander
ES, Mesirov JP. Gene set enrichment analysis: A knowledge-
based approach for interpreting genome-wide expression
profiles. Gene set enrichment analysis: A knowledge-based
approach for interpreting genome-wide expression profiles.
PNAS 2005 102 (43) 15545-15550; published ahead of print
September 30, 2005.
59. Normalized Expression Score (NES)
• Takes into account the size of each gene set list and
adjusts the original ES
• Important when many lists are taken into consideration
• Used to calculate false discovery rate (FDR)
• “The FDR is the estimated probability that a set with a given NES
represents a false positive finding; it is computed by comparing the
tails of the observed and null distributions for the NES”
• Old method, use FWER. BUT:
• “Because our primary goal is to generate hypotheses, we chose to use
the FDR to focus on controlling the probability that each reported result
is a false positive.”
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
60. A note on ranking…
• Pre-ranked gene-lists may be used
• Should incorporate magnitude and direction for purposes of
interpretation
Default Pre-ranked by edgeR p-value
61. Example 1: SNP data
Title: “Using Environmental Correlations to Identify Loci Underlying Local Adaptation”
• From article:
• [This study will demonstrate]
“covariance in allele
frequencies between
populations from a set of
markers”
• “These matrices reveal the
close genetic relationship of
populations from the same
geographic region”
• Personal observation:
• “broad geographic area” not
defined
• Color key would be nice
“The matrices are displayed as heat maps with lighter colors corresponding to
higher values. The rows and columns of these matrices have been arranged by
broad geographic label.”
Graham Coop, David Witonsky, Anna Di Rienzo, Jonathan K. Pritchard. Genetics. August 1, 2010 vol. 185 no. 4, 1411-1423;
DOI: 10.1534/genetics.110.114819
62. • From Article
• “To explore the functional and mechanistic
implications of the somatic mutations, we
performed pathway analysis by integrating
mutation and gene expression data from AITL
cases”
• “After controlling for platform differences, the
tumor and normal samples separated into
distinct clusters”
• Notes:
• Color key would still be nice
Example 2: Gene Expression and SNP data
Title: “A recurrent inactivating mutation in RHOA GTPase in angioimmunoblastic T cell lymphoma”
“Heat map from hierarchical clustering of differentially expressed genes. […]
Gene clusters from hierarchical classification were subjected to the DAVID web
server for Gene Ontology analysis, and the most enriched term for each cluster
was determined using the q value from the FDR test”.
Yoo HY et. al. Nature Genetics 46, 371–375 (2014) DOI:10.1038/ng.2916
63. 3 Methods for Discussion
• 1:R implementation of GSEA (BROAD)
• Is supplied as a function, not a package
• 2: Java desktop application (BROAD)
• Graphical user interface (point and click)
• 3: R package “phenoTest” (Planet, E (2013))
• An R package for carrying out GSEA
64. GSEA – R function
• Many analytical tools meant for R implementation are
provided as R packages…but not this one!
• Download file from BROAD institute website, unzip, open text file
“GSEA.1.0.R” in R and run code to get functions.
• Problem #1: code is not maintained, some operating systems will not
tolerate particular syntax and code must be edited
• Solution = Jacek
• Problem #2: the data ‘needs’ to be in a very specific format:
• .res or .gct for expression data
• .cls for phenotype/class data
• .gmt or .gmx for gene lists
65. GSEA – R: getGSEAready functions
• Solution 2a: Expression Data
• Just need a correctly formatted data.frame object to pass to the function.
#New Function converts 'all.gene.result' to necessary format
getGSEAready.expression<- function(expression.data, list.format){
#Use gene geneid or symbols as row names
dataset<- read.delim(expression.data, stringsAsFactors = FALSE)
if (list.format == "geneid") {
row.names(dataset)<- dataset$geneid
}
else if (list.format == "symbol"){
nas<- dataset[is.na(dataset$symbol), ]
for (i in 1:nrow(nas)){
nas$symbol[i] <- sprintf("unnamed%d", i) #Change NA to 'unnamed[#]'
}
notna<- dataset[!is.na(dataset$symbol),]
dataset<- rbind(notna, nas)
row.names(dataset)<- dataset$symbol
}
dataset<- dataset[,-(1:10)] #Remove all columns except for expression data
return(dataset)
}
67. GSEA – R: getGSEAready functions
• Solution 2b: Phenotype data
• Similar story; R needs a list object with 2 vectors –
• phen: a character vector with the class labels
• class.v: a numeric [1,2] vector to indicate class for each sample
# New function Uses 'Sample.Info' to generate phenotype data
(simple design: treatment vs. control)
getGSEAready.phenotype<- function(phenotype.data){
pheno<- read.delim(phenotype.data)
phen<- as.character(unique(pheno$Group))
class.v<- rep(0, nrow(pheno))
for (i in 1:length(pheno$Group)){
if (pheno$Group[i] == phen[1]){ #Label your 'group of interest' as '1', the other '2'
class.v[i] = 1
}
else {
class.v[i] = 2
}
}
classdata = list(phen = phen, class.v = class.v)
return(classdata)
}
69. GSEA – R function
• Solution 2c: Gene Lists
• Set up in Excel with specified format and save as .gmt file
(use quotation marks around filename when saving)
• na’s may be replaced with description of gene list
• Note: There are ‘pre-packaged’ gene lists available from
MSigDB but that is for another discussion
73. GSEA – R function
4: Check output
Note: I asked for data to be output to
“/home/vandersm/Documents/gseaRexample/zavros”
but it gets output to the parent directory (with the intended directory name
prefixed to the prefix)
74. GSEA – R function
Example output for one gene list (1)
75. GSEA – R function
Example output for one gene list (2)
76. GSEA – R function
Example of global plots (all lists considered)
77. GSEA – R function
Example of report for one gene list
Is this gene in the
leading edge subset?
78. GSEA – R function
Need to check the parameters you used?
79. GSEA – R function
Pros:
• Once the data is formatted correctly, the analysis is rather straight-
forward.
• Provides graphical output for individual lists as well as global reports
Cons:
• Lacks options that are easy to change in GUI version
• No option for presorted list
• No options for how to sort list
• Difficult to obtain sorted list
• Creates a matrix of gene ranks corresponding to permuted sample
labels regardless of whether or not sample permutation is chosen
• P-values have potential to be zero [at least if permuting by gene]
81. GSEA with JAVA GUI
• Similar to the R script/functions, much of the hassle
concerns correctly formatted data. If anyone is
interested in those formats, let me know, for now let’s
not dwell on them
88. GSEA with JAVA GUI
• Pros:
• Once data is loaded, the ‘point-and-click’ environment is generally
convenient and flexible
• Graphs are somewhat ‘prettier’ than from the R-code
• Results can be used from an easily-navigable page
• Very easy to obtain ranked list of genes
• Cons
• Not necessarily an option for incorporation into our pipeline
• Cannot test functionality in interactive mode
• Crashes if #permutations increases 10-fold
• May yield p-values of 0.
89. GSEA with phenoTest
• phenoTest – R package by Planet E (2013)
• Implements GSEA in a manner that is rather flexible
(once formatting issues are taken care of!)
• Input dataset is in form of an eset
• Use Jacek’s code/my function for creating an eset from the
raw.data and SampleInfo file obtained during analysis
• Should # of genes match that of ‘processed’ data?
• EX) In my experiment, my ‘all_genes_result’ has data for
16,400 genes but my eset has info for 20,640
• Has modest effect on results
• To compare across implementations, I restricted my eset to
only include genes that were in my ‘all_genes_result’
90. GSEA with phenoTest
• Calculation of ‘observed’ ES score is the same
but ONLY permutes genes
• Calls permuted ES scores ‘Simulated Scores’; these can
easily be accessed after the analysis is run
• Automatically creates an NES plot – could be better
alternative for publication if many lists are taken into
consideration.
• Has option to create plot using Wilcoxon test as
discussed by Virtaneva(2001)
92. GSEA with phenoTest
• Creation of epheno object makes ‘playing around’
with the data much easier
• See how GSEA will behave when particular patterns among
the ranked genes are artificially created
• Obviously, once you re-create the gene-ranks as they are in the
GSEA BROAD implementation, this can be done with the
aforementioned options; but this saves some computational steps.
94. GSEA with phenoTest
• Pros:
• Has a few different options as far as plots are concerned
• Function is relatively easy to run [compared to the BROAD script]
and the epheno object is useful
• Run-time is fast (doesn’t ever re-compute permuted S2N)
• Cons
• No option to permute by sample
• Some of the functions that are described in the reference paper do
not work (not a big deal, just annoying)
• Have not figured out how to implement different ranking
mechanism.
96. Is there a better, perhaps more
‘statistical’ approach?
Notes de l'éditeur
Chances are that you have seen a plot like the one I have on the background of this slide during a Wednesday or Thursday seminar – after this presentation my hope is that you will have a slightly better idea of what those plots mean and when they are used
mention
Mention that there is not a specific spem gene
Zero-crossing: where the correlation (signal to noise) crosses zero
- correlations with genes are now negative
Zero-crossing: where the correlation (signal to noise) crosses zero
- correlations with genes are now negative
Zero-crossing: where the correlation (signal to noise) crosses zero
- correlations with genes are now negative