GSEA of SPEM Genes

O.M.GSEA
An introduction to ‘classic’
Gene Set Enrichment Analysis methodology
Shana White
2016

Overview
• Preliminaries
• Genes & gene sets
• Gene expression and enrichment
• GSEA
• Introduction & Example from Publication
• Experimental conditions
• Purpose of GSEA
• Background on methodology
• Gene ranking
• Enrichment Scores & Plots
• Assessing significance

Fundamentals of Genes, Genomes
• Genome = Collection of DNA sequences [across all
chromosomes of a species].
• Genes = Subset of genome that ‘codes’ for a protein
Image: http://cs.stanford.edu/people/eroberts/cs181/projects/2010-11/Genomics/accuracy.html

Genes & Genotype
• Individuals of the same species share the
same genome; have the same set of genes
• SNPs (single nucleotide polymorphisms) allow for
genetic variation
• Genotype = the amino acid sequence
corresponding to a [an individual] gene
• “Central Dogma” of genetics:
Image 1: http://cs.stanford.edu/people/eroberts/cs181/projects/2010-11/Genomics/accuracy.html
Image 2: http://www.lhsc.on.ca/Patients_Families_Visitors/Genetics/Inherited_Metabolic/Mitochondria/DiseasesattheMolecularLevel.htm

Genotype vs. Gene Expression
• Genotype is the same in all somatic cells of an organism

• Genotype is the same in all somatic cells of an organism
• Gene expression [rate at which genes are transcribed] is
different to varying degrees
• Random fluctuations within same types of cells are expected
• Significant differences  significantly different cells

• Random fluctuations within same
types of cells are expected
• Significant differences 
significantly different cells
Genotype is the same in all
somatic cells of an organism
Gene expression [rate at
which genes are transcribed]
is different to varying degrees

Gene Expression Data
• RNA-seq (“Next-Generation sequencing”) simultaneously
generates data for both genotype and gene expression
• Experimental set-up is typically cases vs. controls
• Cases: phenotype is induced experimentally [pure experiment]
• Controls: represent ‘baseline’ gene expression for comparison
• Quantify transcriptome of each sample
• Compare ‘fold change’ for expression of cases compared to controls
• FC = (expression in cases)/(expression in controls)
• Determine significance of individual gene up/down regulation
• Results of differential expression analysis or additional analyses often
visualized with a “heatmap”…

Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]

Quick teaching note on heatmaps…
• Very common visualization tool in genetic research
• Incorporate genotype and/or gene expression data
• Support and/or generate hypotheses
• Organization of rows/columns determined by experimental design
• Range from relatively simple to complex [in terms of interpretation]
• Heatmap for purposes of this discussion:
Cases Controls
• Columns  Samples
• Rows  Genes
• Color  Direction/Intensity of expression
• Red: Higher than row average
• Blue [or green]: Lower than row average

“Enrichment” & GSEA
• Results of individual genes
• Dictionary(.com) definition of enrichment:
• “act of making fuller or more meaningful or rewarding”
• Gene set enrichment
• Gene sets are predefined in the literature and/or in database:
• Grouped by information regarding gene function, pathway membership, etc.
• Gene sets are ‘enriched’ if experimental findings are in accordance with
the set of interest [with hope of adding meaning to results]
• Definition not always obvious
• Good resource: “An introduction to effective use of enrichment analysis software”
• Gene Set Enrichment Analysis
• Statistical methods determine significance of enrichment for gene set by
comparing distribution of genes in set to ‘background distribution’

First Use in Publication (2003)

Paper describing methodology (2005)
Citations as of:
November 2015 - 8,349
November 2016 - 10,057

Quick review…
• GSEA is a common ‘secondary analysis’ after gene
expression data has been collected
• Gene sets can be determined a-priori specific to an experiment (as
in example that follows) or
• Multiple gene-sets from databases can be used in a data-mining
fashion to support or generate hypotheses
• Implications of multiple testing (beyond scope of presentation)
• Good to know the basics
• GSEA still a common request of bioinformaticians
• “Newer/better” methods build on or refer to GSEA
• Goal for remainder of presentation:
• Use example from recent publication to elucidate basic concepts
and terminology
• Go into further detail for statistical methodology related to GSEA

SPEM Article (Published Sep. 2016)

• Experimental setup:
• 2 groups of mice, balanced design (ni = 4, i = 1, 2)
• Mice are sacrificed, samples have been collected/processed, and
RNAseq data is available.
• Hypothesis: The stomachs of the mice in Group 1 (the treatment
group) are undergoing SPEM-mediated repair
• Our lab is asked to conduct GSEA of SPEM genes to support hypothesis
GSEA for SPEM
x 4
Group 1 = Ulcerated
x 4
Group 2 = Uninjured

SPEM Lists
• There is no curated SPEM pathway per se, but there are gene
sets corresponding to SPEM that have been published in the
literature.
• I have 2 gene-set lists as follows:
• SPEM [as generally observed] (list name = “SPEM”)
• SPEM in response to inflammation (list name = “SPEM_with_Inflammation”)
• Both of these lists contain genes that were previously found to
be up-regulated during SPEM
• GSEA will support the research hypothesis if upregulated expression of
SPEM-related genes is evident in the ulcerated samples compared to the
uninjured control samples.
• **Note: this presentation is intended to shed light on basic features of
GSEA and does not consider the effects of cross-talk between pathways.

Summary of basic analysis
• Methods are specific to 2-group experiment
• Inputs:
• 1 – Gene expression for ALL genes
• 2 – Phenotype information (define groups for comparison)
• 3 – List(s) of genes of interest
• Intermediary step:
• Rank genes based on differential expression
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)

A note on ranking…
• Default ranking mechanism is “signal to noise ratio” (s2n)
• Reflects correlation of gene with phenotype (size and direction)

• Default ranking mechanism (signal to noise ratio, s2n)
• Formula: 𝑆2𝑁𝑖 =
𝜇 𝑖
𝐺𝑟𝑜𝑢𝑝1
−𝜇 𝑖
𝜎𝑖
+𝜎𝑖
• Reflects correlation (association) of gene with phenotype in
terms of size and direction
• Rank ≠ Significance of differential expression
• Likely that that genes ranking very high or very low are
significantly differentially expressed

Summary of basic analysis (cntd.)
• S = List of genes belonging to defined gene set (independent of data)
• R = Ranked list of genes (dependent on data & method of ranking genes)
• “Given an a priori defined set of genes S, …, the goal of GSEA is to
determine whether the members of S are randomly distributed throughout
R or primarily found at the top or bottom. We expect that sets related to
the phenotypic distinction will tend to show the latter distribution.”
• Null hypothesis: Membership in S → Location in R
• Alternative: Membership in S → Location in R (high or low)

What would ideal rank for SPEM genes ‘look’ like?
Example
List1
Example
List2
Example
List3
Uninformative
High
‘correlation’
with uninjured
group
High
‘correlation’
with ulcerated
group
Low
expression
in cases
Gene
Rank
High
expression
in cases

Summary of basic analysis
• Methods are specific to 2-group experiment
• Inputs:
• 1 – Gene expression for ALL genes
• 2 – Phenotype information (define groups for comparison)
• 3 – List(s) of genes of interest
• Intermediary step:
• Rank genes based on differential expression
• Outputs:
• 1 – Enrichment scores and p-values (summary information)
• 2 – Enrichment plots (graphical summary)

Enrichment Score (ES)
(A) An expression data set sorted
by correlation with phenotype,
the corresponding heat map,
and the “gene tags,” i.e.,
location of genes from a set S
within the sorted list.
(B) Plot of the running sum for S in
the data set, including the
location of the maximum
enrichment score (ES) and the
leading-edge subset.

Calculation of Running Enrichment Score (RES)
• Ranked genes: 𝑅1, … , 𝑅 𝑁

GENE_ID
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6
Gene7
Gene8
Gene9
Gene10
Gene11
Gene12
Gene13
Gene14
Gene15
Gene16
Gene17
Gene18
Gene19
Gene20

• Ranking metric for each gene: 𝑚1, … , 𝑚 𝑁

GENE_ID S2N
Gene1 5.4
Gene2 5.1
Gene3 4.7
Gene4 4.3
Gene5 3.6
Gene6 3.3
Gene7 2.9
Gene8 2.2
Gene9 1.6
Gene10 0.9
Gene11 -0.1
Gene12 -0.8
Gene13 -1.2
Gene14 -1.9
Gene15 -2.6
Gene16 -2.8
Gene17 -3.3
Gene18 -3.7
Gene19 -4.2
Gene20 -4.5

• Gene set S with k elements: 𝑠1, … , 𝑠 𝑘
• Tagi =
1 𝑖𝑓 𝑅𝑖 ∈ 𝑆
0 𝑒𝑙𝑠𝑒
, No.Tagi = =
0 𝑒𝑙𝑠𝑒

GENE_ID S2N TAG NO.TAG
Gene1 5.4 0 1
Gene2 5.1 1 0
Gene3 4.7 0 1
Gene4 4.3 1 0
Gene5 3.6 1 0
Gene6 3.3 0 1
Gene7 2.9 1 0
Gene8 2.2 0 1
Gene9 1.6 0 1
Gene10 0.9 0 1
Gene11 -0.1 0 1
Gene12 -0.8 0 1
Gene13 -1.2 0 1
Gene14 -1.9 0 1
Gene15 -2.6 0 1
Gene16 -2.8 0 1
Gene17 -3.3 0 1
Gene18 -3.7 0 1
Gene19 -4.2 0 1
Gene20 -4.5 0 1

• Tagi =
0 𝑒𝑙𝑠𝑒
, No.Tagi = =
0 𝑒𝑙𝑠𝑒
• M = 𝑖=1
𝑘
𝑚𝑖 where Tagi = 1 (sum of ranking metric for genes in set)
• T = 𝑖=1
N
[No.Tagi] (N – k; # of genes in R but not S)

GENE_ID S2N TAG NO.TAG
Gene1 5.4 0 1
Gene2 5.1 1 0
Gene3 4.7 0 1
Gene4 4.3 1 0
Gene5 3.6 1 0
Gene6 3.3 0 1
Gene7 2.9 1 0
Gene8 2.2 0 1
Gene9 1.6 0 1
Gene10 0.9 0 1
Gene11 -0.1 0 1
Gene12 -0.8 0 1
Gene13 -1.2 0 1
Gene14 -1.9 0 1
Gene15 -2.6 0 1
Gene16 -2.8 0 1
Gene17 -3.3 0 1
Gene18 -3.7 0 1
Gene19 -4.2 0 1
Gene20 -4.5 0 1
M = 15.9
T = 16

• Tagi =
0 𝑒𝑙𝑠𝑒
, No.Tagi = =
0 𝑒𝑙𝑠𝑒
• M = 𝑖=1
𝑘
𝑚𝑖 where Tagi = 1 (sum of ranking metric for genes in set)
• T = 𝑖=1
N
[No.Tagi] (N – k; # of genes in R but not S)
• Start at 𝑅1. If Tag1 = 1, then RES1 = 𝑚1 ∗ 1
𝑀 ,
else RES1 = −(1
𝑇)
• Move to 𝑅2. If Tag2 = 1, then RES2 = RES1 + 𝑚2∗ 1
𝑀 ,
else RES1 = RES1 −(1
𝑇)
• For a given 𝑅𝑗,
RESj = 𝑖=1
𝑗
( 𝑚𝑗 ∗ 1
𝑀) ∗ Tagj ) − 𝑖=1
𝑗
((1
T) ∗ No.Tagj )
* At the final 𝑅 𝑁, we have 𝑀
𝑀 − T
T = 0

Determining ES
• RESj = 𝑖=1
𝑗
( 𝑚𝑗 ∗ 1
𝑀) ∗ Tagj ) − 𝑖=1
𝑗
((1
T) ∗ No.Tagj )
• After going through all ranked genes N, you are left with a
vector of RES’s. Then, for a given gene set,
ES = max |RES|
𝑖=1
𝑗
Tagj  Unweighted
𝑖=1
𝑗
𝑚𝑖  Weighted; α = 1
𝑖=1
𝑗
𝑚𝑖 ∗ 𝛼  Weighted; α = α*
“The enrichment score is the maximum deviation from zero
encountered in the random walk; it corresponds to a weighted
Kolmogorov–Smirnov-like statistic”

GENE_ID S2N TAG NO.TAG α S2N*TAG*α/M No.Tag/T RES
Gene1 5.4 0 1 1 0 0.0625 -0.0625
Gene2 5.1 1 0 1 0.320754717 0 0.258255
Gene3 4.7 0 1 1 0 0.0625 0.195755
Gene4 4.3 1 0 1 0.270440252 0 0.466195
Gene5 3.6 1 0 1 0.226415094 0 0.69261
Gene6 3.3 0 1 1 0 0.0625 0.63011
Gene7 2.9 1 0 1 0.182389937 0 0.8125
Gene8 2.2 0 1 1 0 0.0625 0.75
Gene9 1.6 0 1 1 0 0.0625 0.6875
Gene10 0.9 0 1 1 0 0.0625 0.625
Gene11 -0.1 0 1 1 0 0.0625 0.5625
Gene12 -0.8 0 1 1 0 0.0625 0.5
Gene13 -1.2 0 1 1 0 0.0625 0.4375
Gene14 -1.9 0 1 1 0 0.0625 0.375
Gene15 -2.6 0 1 1 0 0.0625 0.3125
Gene16 -2.8 0 1 1 0 0.0625 0.25
Gene17 -3.3 0 1 1 0 0.0625 0.1875
Gene18 -3.7 0 1 1 0 0.0625 0.125
Gene19 -4.2 0 1 1 0 0.0625 0.0625
Gene20 -4.5 0 1 1 0 0.0625 0
M = 15.9
T = 16
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 3 5 7 9 11 13 15 17 19
RES
RES

Published Results of GSEA for SPEM…

• “The positive ES values and low P
values suggest that these genes, as a
set, are up-regulated significantly in
the ulcerated samples.”
• Both lists show ‘enrichment’
• General comparison:
• SPEM_WITH_INFLAMMATION (SWI) gene
set is ‘more enriched’ than SPEM gene set
• Higher enrichment score
• Lower p-value
• Better defined ‘leading edge’ on plot
• Corresponds with accompanying differential
expression analysis…
Upregulated in class ulcerated
Enrichment Score (ES) 0.554442
p-value (random genes) < 0.05

• For the SWI gene set:
• P-values for individual
genes tend to be lower
• Fold changes tend to be
higher
• Colors more intense on
corresponding heatmap

Where do the p-values come from?
• Permutation-based calculations are implemented in order
to assess the significance of a particular gene set.
• Based on samples
• Based on genes

Sample-based Permutation
High
Negative
Signal to
Noise
High
Positive
Randomly
‘shuffle’ samples
Recalculate S2N
[shown re-ranked]

Calculation of ES
based on
‘observed’ S2N of
predefined genes
High
Negative
Signal to
Noise
High
Positive
High
Negative
Signal to
Noise
High
Positive
Original Samples Permuted Sample 1
Calculation of ES
based on
‘permuted’ S2N of
predefined genes

Permutation 1 Permutation 2 Permutation 3 Permutation 4

Histogram of EScores
Enrichment Score
Density
-1.0 -0.5 0.0 0.5 1.0
0.00.40.8
Nominal P-value
0.975 quantile for
permuted scores
Original ES Score
y <- permuted.ES.scores
obs<- 0.76
sum(y>obs)/(1000+1)
[1] 0.01598402

Problem with Sample-based
• Limited number of available ordering.
• Different orderings can result in identical grouping
• For example, if we have:
• Then ESp1 = ESp2
• Reduced ability to estimate true variability in sampling
distribution [of ES]
• If either group has n < 7, it is advised that gene
permutation is carried out instead
Order for permutation #1:
T1 T2 C3 C4 C1 C2 T4 T4
Order for permutation #2:
T1 C3 T2 C4 C1 T2 C2 T4

Gene-based Permutation
High
Negative
Signal to
Noise
High
Positive
ES based on S2N
of original genes
High
Negative
Signal to
Noise
High
Positive
ES based on S2N
of RANDOM gene
set of same size
Original
Permuted Gene
Labels

Problem with Gene-based
• Does not take into account correlation structure of genes .
• Gene coexpression:
• Groups of genes with underlying similarity (for example, genes
associated with common transcription factors or biological
processes) should move up/down in rank together
Permuting labels at random
does not represent outcomes
that biologically make sense
Example coexpression network from: http://bioinfow.dep.usal.es/coexpression/

According to the authors…
• “Genes may be ranked based on the differences seen in a
small data set, with too few samples to allow rigorous
evaluation of significance levels by permuting the class
labels. In these cases, a P value can be estimated by
permuting the genes, with the result that genes are
randomly assigned to the sets while maintaining their
size. This approach is not strictly accurate: because it
ignores gene-gene correlations, it will overestimate the
significance levels and may lead to false positives.”

Recap on permutations…
• Permutation-based calculations are implemented in order to
assess the significance of a particular gene set.
• Based on samples (need large sample size)
• 1) Permute samples and re-compute ranked list of all genes
• Variability in ranking is dependent on variability of samples
• 2) Re-calculate ES score for gene set based on new rank
• Based on genes (may not preserve correlation structure of genes)
• 1) Permute genes to get a random set of genes
• 2) Re-calculate ES score for gene set with ‘random ranks’

Effect of permutation type on ES distribution for SPEM example
ES: 0.7697
P-val: 0.2262
ES: 0.7697
P-val: “0”
Sample-Based
Permutation
Gene-Based
Permutation

What if I reverse the direction of my hypothesis
– ie my genes are downregulated?
Normalized ES (NES) 1.5405532
Nominal p-value 0.16831683
Upregulated in class uninjured
Enrichment Score (ES) -0.76985
Normalized ES (NES) -1.5026467
Nominal p-value 0.17635658

‘Typical’ Up-and-Down-Regulated
RandomCenter-Clustered
ES P-value
0.80795 3.99E-11
ES P-value
0.565999 0.055373
ES P-value
-0.46202 0.012155
ES P-value
-0.25653 0.770885

Revisiting SPEM results…
• The published ES p-values were generated via
permutation of gene-label
• Follows guidelines based on sample-size
• As bioinformatician, should work with, rather than work around, sample
size limitations – and be clear when writing methods.
• GSEA could not be conducted for different part of experiment
• Each ‘group’ consisted of one sample
• Fair to say that results of GSEA and differential
expression analysis support hypothesis of SPEM-
mediated processes
• Both suggest significant up-regulation of SPEM-related genes in
the treatment (ulcerated) group

GSEA Take-aways
• Quantitative measurements and visual output
• Data may already be out there, just needs to be analyzed!
• Variety of R-packages implement GSEA; also a GUI software
application developed by BROAD institute
• Databases with gene expression data and gene lists are becoming
increasingly common and even user friendly… such as ilincs.org
• Interpretation and follow-up specific to experiment.
• Pay attention to sample size
• Comprehensive statistical analysis should be included if results are
intended to be published.
• More room for interpretation in exploratory setting.

Questions?
Image: https://www.buzzfeed.com/christianzamora/jeans-or-genes?utm_term=.mn3A55J0Pk#.ldO3LLm7gA

References
• Planet E (2013). phenoTest: Tools to test association between
gene expression and phenotype in a way that is efficient,
structured, fast and scalable. We also provide tools to do
GSEA (Gene set enrichment analysis) and copy number
variation.. R package version 1.16.0.
• Mootha VK et al. PGC-1-responsive genes involved in
oxidative phosphorylation are coordinately downregulated in
human diabetes. Nature Genetics 34, 267 - 273 (2003)
• Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert
BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander
ES, Mesirov JP. Gene set enrichment analysis: A knowledge-
based approach for interpreting genome-wide expression
profiles. Gene set enrichment analysis: A knowledge-based
approach for interpreting genome-wide expression profiles.
PNAS 2005 102 (43) 15545-15550; published ahead of print
September 30, 2005.

Additional Slides
• Different Implementations
• Explaining example heatmaps
• References

Normalized Expression Score (NES)
• Takes into account the size of each gene set list and
adjusts the original ES
• Important when many lists are taken into consideration
• Used to calculate false discovery rate (FDR)
• “The FDR is the estimated probability that a set with a given NES
represents a false positive finding; it is computed by comparing the
tails of the observed and null distributions for the NES”
• Old method, use FWER. BUT:
• “Because our primary goal is to generate hypotheses, we chose to use
the FDR to focus on controlling the probability that each reported result
is a false positive.”

• Pre-ranked gene-lists may be used
• Should incorporate magnitude and direction for purposes of
interpretation
Default Pre-ranked by edgeR p-value

Example 1: SNP data
Title: “Using Environmental Correlations to Identify Loci Underlying Local Adaptation”
• From article:
• [This study will demonstrate]
“covariance in allele
frequencies between
populations from a set of
markers”
• “These matrices reveal the
close genetic relationship of
populations from the same
geographic region”
• Personal observation:
• “broad geographic area” not
defined
• Color key would be nice
“The matrices are displayed as heat maps with lighter colors corresponding to
higher values. The rows and columns of these matrices have been arranged by
broad geographic label.”
Graham Coop, David Witonsky, Anna Di Rienzo, Jonathan K. Pritchard. Genetics. August 1, 2010 vol. 185 no. 4, 1411-1423;
DOI: 10.1534/genetics.110.114819

• From Article
• “To explore the functional and mechanistic
implications of the somatic mutations, we
performed pathway analysis by integrating
mutation and gene expression data from AITL
cases”
• “After controlling for platform differences, the
tumor and normal samples separated into
distinct clusters”
• Notes:
• Color key would still be nice
Example 2: Gene Expression and SNP data
Title: “A recurrent inactivating mutation in RHOA GTPase in angioimmunoblastic T cell lymphoma”
“Heat map from hierarchical clustering of differentially expressed genes. […]
Gene clusters from hierarchical classification were subjected to the DAVID web
server for Gene Ontology analysis, and the most enriched term for each cluster
was determined using the q value from the FDR test”.
Yoo HY et. al. Nature Genetics 46, 371–375 (2014) DOI:10.1038/ng.2916

3 Methods for Discussion
• 1:R implementation of GSEA (BROAD)
• Is supplied as a function, not a package
• 2: Java desktop application (BROAD)
• Graphical user interface (point and click)
• 3: R package “phenoTest” (Planet, E (2013))
• An R package for carrying out GSEA

GSEA – R function
• Many analytical tools meant for R implementation are
provided as R packages…but not this one!
• Download file from BROAD institute website, unzip, open text file
“GSEA.1.0.R” in R and run code to get functions.
• Problem #1: code is not maintained, some operating systems will not
tolerate particular syntax and code must be edited
• Solution = Jacek
• Problem #2: the data ‘needs’ to be in a very specific format:
• .res or .gct for expression data
• .cls for phenotype/class data
• .gmt or .gmx for gene lists

GSEA – R: getGSEAready functions
• Solution 2a: Expression Data
• Just need a correctly formatted data.frame object to pass to the function.
#New Function converts 'all.gene.result' to necessary format
getGSEAready.expression<- function(expression.data, list.format){
#Use gene geneid or symbols as row names
dataset<- read.delim(expression.data, stringsAsFactors = FALSE)
if (list.format == "geneid") {
row.names(dataset)<- dataset$geneid
}
else if (list.format == "symbol"){
nas<- dataset[is.na(dataset$symbol), ]
for (i in 1:nrow(nas)){
nas$symbol[i] <- sprintf("unnamed%d", i) #Change NA to 'unnamed[#]'
}
notna<- dataset[!is.na(dataset$symbol),]
dataset<- rbind(notna, nas)
row.names(dataset)<- dataset$symbol
}
dataset<- dataset[,-(1:10)] #Remove all columns except for expression data
return(dataset)
}

GSEA – R: getGSEAready functions
• Solution 2b: Phenotype data
• Similar story; R needs a list object with 2 vectors –
• phen: a character vector with the class labels
• class.v: a numeric [1,2] vector to indicate class for each sample
# New function Uses 'Sample.Info' to generate phenotype data
(simple design: treatment vs. control)
getGSEAready.phenotype<- function(phenotype.data){
pheno<- read.delim(phenotype.data)
phen<- as.character(unique(pheno$Group))
class.v<- rep(0, nrow(pheno))
for (i in 1:length(pheno$Group)){
if (pheno$Group[i] == phen[1]){ #Label your 'group of interest' as '1', the other '2'
class.v[i] = 1
}
else {
class.v[i] = 2
}
}
classdata = list(phen = phen, class.v = class.v)
return(classdata)
}

New ‘classdata’ list object
SampleInfo="E:SampleInfo.txt"
classdata = getGSEAready.phenotype(SampleInfo)
classdata
## $phen
## [1] "ulcerated" "uninjured"
##
## $class.v
## [1] 1 1 1 1 2 2 2 2

GSEA – R function
• Solution 2c: Gene Lists
• Set up in Excel with specified format and save as .gmt file
(use quotation marks around filename when saving)
• na’s may be replaced with description of gene list
• Note: There are ‘pre-packaged’ gene lists available from
MSigDB but that is for another discussion

GSEA– R function
1: Get data ready

GSEA – R function
2: Input parameters for main function

GSEA – R function
3: Run main function

GSEA – R function
4: Check output
Note: I asked for data to be output to
“/home/vandersm/Documents/gseaRexample/zavros”
but it gets output to the parent directory (with the intended directory name
prefixed to the prefix)

GSEA – R function
Example output for one gene list (1)

GSEA – R function
Example output for one gene list (2)

GSEA – R function
Example of global plots (all lists considered)

GSEA – R function
Example of report for one gene list
Is this gene in the
leading edge subset?

GSEA – R function
Need to check the parameters you used?

GSEA – R function
Pros:
• Once the data is formatted correctly, the analysis is rather straight-
forward.
• Provides graphical output for individual lists as well as global reports
Cons:
• Lacks options that are easy to change in GUI version
• No option for presorted list
• No options for how to sort list
• Difficult to obtain sorted list
• Creates a matrix of gene ranks corresponding to permuted sample
labels regardless of whether or not sample permutation is chosen
• P-values have potential to be zero [at least if permuting by gene]

GSEA with JAVA GUI
• Similar to the R script/functions, much of the hassle
concerns correctly formatted data. If anyone is
interested in those formats, let me know, for now let’s
not dwell on them

file:///C:/Users/Shana/gsea_home/output/aug17/my_analysi
s.Gsea.1439827638553/index.html
GSEA with JAVA GUI

GSEA with JAVA GUI
• Pros:
• Once data is loaded, the ‘point-and-click’ environment is generally
convenient and flexible
• Graphs are somewhat ‘prettier’ than from the R-code
• Results can be used from an easily-navigable page
• Very easy to obtain ranked list of genes
• Cons
• Not necessarily an option for incorporation into our pipeline
• Cannot test functionality in interactive mode
• Crashes if #permutations increases 10-fold
• May yield p-values of 0.

GSEA with phenoTest
• phenoTest – R package by Planet E (2013)
• Implements GSEA in a manner that is rather flexible
(once formatting issues are taken care of!)
• Input dataset is in form of an eset
• Use Jacek’s code/my function for creating an eset from the
raw.data and SampleInfo file obtained during analysis
• Should # of genes match that of ‘processed’ data?
• EX) In my experiment, my ‘all_genes_result’ has data for
16,400 genes but my eset has info for 20,640
• Has modest effect on results
• To compare across implementations, I restricted my eset to
only include genes that were in my ‘all_genes_result’

GSEA with phenoTest
• Calculation of ‘observed’ ES score is the same
but ONLY permutes genes
• Calls permuted ES scores ‘Simulated Scores’; these can
easily be accessed after the analysis is run
• Automatically creates an NES plot – could be better
alternative for publication if many lists are taken into
consideration.
• Has option to create plot using Wilcoxon test as
discussed by Virtaneva(2001)

GSEA with phenoTest –
Example Graphs
ES Plot NES Plot Wilcoxon-ES Plot

GSEA with phenoTest
• Creation of epheno object makes ‘playing around’
with the data much easier
• See how GSEA will behave when particular patterns among
the ranked genes are artificially created
• Obviously, once you re-create the gene-ranks as they are in the
GSEA BROAD implementation, this can be done with the
aforementioned options; but this saves some computational steps.

GSEA with phenoTest
• Pros:
• Has a few different options as far as plots are concerned
• Function is relatively easy to run [compared to the BROAD script]
and the epheno object is useful
• Run-time is fast (doesn’t ever re-compute permuted S2N)
• Cons
• No option to permute by sample
• Some of the functions that are described in the reference paper do
not work (not a big deal, just annoying)
• Have not figured out how to implement different ranking
mechanism.

Is there a better, perhaps more
‘statistical’ approach?

GSEA of SPEM Genes

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à GSEA of SPEM Genes

Similaire à GSEA of SPEM Genes (20)

Dernier

Dernier (20)

GSEA of SPEM Genes

Notes de l'éditeur