This document discusses how changes over time to the Gene Ontology (GO) and GO annotations can impact genomic data analysis and enrichment results. The author analyzed over 2,500 gene lists from past studies and found that enrichment results become less semantically similar over time, with 47% having less similar results after 11 years on average compared to the initial time of publication. While objective changes may occur, subjective impressions of results can remain the same. Researchers are encouraged to use the GOtrack database to evaluate how changes may affect their own data and results.
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Paul Pavlidis at #ICG13: Monitoring changes in the Gene Ontology and their impact on genomic data analysis
1. Monitoring changes in the Gene
Ontology
and their impact on genomic data analysis
Paul Pavlidis, PhD
University of British Columbia, Vancouver, BC Canada
https://pavlab.msl.ubc.ca
October 25, 2018
GigaScience Prize Track
3. The Gene Ontology in 60 seconds
3
GO = Hierarchy of >45000 terms
describing gene function
Applied by annotators to genes with
evidence codes
(“GO annotations” = GOA)
Used in tens of thousands of papers
• Gene description
• Algorithm evaluation
• Enrichment analysis GRIN1
4. Both GO and GOA change over time
Does it matter?
• Are old enrichment results and other
interpretations based on GO still valid?
• Will new results be valid in the future?
No easy way for researchers to easily
evaluate the effects on their own data.
4
5. 5
GOtrack database
• Data for 9 model organisms
• Dating back to 2001
• Over 200,000,000 data points
• Updated monthly
Web app functionality
Track genes and terms
Track enrichment results
10. Evaluating the effect of GO/GOA changes
Inputs: Gene lists from MSigDB
• >2500 Chemical and genetic perturbations (CGP) – “hit lists”
• 0.5-16 years old (median 11)
10
11. Evaluating the effect on enrichment
analysis• Perform enrichment analysis using GO/GOA for the time of publication (t0)
to a recent time point (tnow)
• Compare the lists of enriched terms at t0 and tnow using semantic similarity
measures (Jaccard and others)
11
Define a null distribution:
t0-tnow comparisons for
randomly selected pairs of
hit lists
Random pair
12. New results tend to have more sig. terms
12
Mean t0 = 21; tnow = 110.
One point = one hit
list
13. 13
Null (random hit list pairs)
All t0-tnow comparisons
Semantic similarity drops over time
• Overall 47% have
results less similar than
the 95%ile of the null
• Correlation between
similarity and age is -
0.34
14. Objective changes may conflict with
subjective impressions
14
DNA replication
mitosis
M phase of mitotic cell cycle
DNA modification
biopolymer methylation
methylation
pattern specification process
regulation of gene expression, epigenetic
somatic stem cell population maintenance
stem cell population maintenance
maintenance of cell number
DNA replication initiation
G1/S transition of mitotic cell cycle
cell cycle G1/S phase transition
mitotic nuclear division
gene silencing
cell fate specification
endoderm development
t0
tnow
Example of one hit list as an extreme case: Jaccard similarity = 0.0
17. Sanja Rogic
Shreejoy
Tripathy
Lilah Toker
Ogan Mancarci
Marjan Farahbod
Manuel
Belmadani
Alex Morin
Margot Gunning
Eric Chu
Nivi Thatra
Nathaniel Lim
Shams Bhuiyan
Simran Rai
Stepan Tesar
Dima Vavilov
Aman Sharma
Calvin Chang
John Phan
Jimmy Liu
Former members
Min Feng
Ellie Hogan
Sophia Ly
Cindy-Lee Crichlow
Brandon Huntington
Ben Callaghan
Matthew Jacobson
Dmitry Tebaykin
James Liu
Patrick Savage
Brenna Li
Justin Leong
Nikolaus Fortelny
Nathan Holmes
Patrick Tan
Kris Anderson
Rachel Edgar
Elodie Portales-Casamar
Adri Sedeño
Jesse Gillis
Leon French
Carolyn Ch’ng
Meeta Mistry
Raymond Lim
Eloi Mercier
Anton Zoubarev
Cameron McDonald
Thea Van Rossum
Nicolas St. George
Frances Lui
Artemis Lai
Gayathiri
Charathsandran
Luchia Tseng
John Choi
Fangwen Zhao
Jenni Hantula
Tianna Koreman
Olivia Marais
Hugh Brown
Celia Siu
Cathy Kwok
Willie Kwok
Nathan Eveleigh
Collaborators
Kurt Haas
Doug Allen
Tim O’Connor
Cathy Rankin
Chris Loewen
Chris Overall
Shernaz Bamji
Michael Kobor
Geoff Hicks
Suzanne Lewis
Etienne Sibille
Gustavo
Turecki
Notes de l'éditeur
Previous work
Previous work only tested small numbers of gene sets and usually over shorter periods of time.
Allows us to also look at the effect of time
Stability analysis of 2,573 published hit lists.
(A) Change in number of significant GO terms. Each point is one CGP hit list. Points are jittered to reduce
overplotting. (B) Similarity of enrichment results, using the complete Jaccard index. The CGP hit lists are binned into most recent (orange), old (green), and oldest
(blue). The distribution for the CPs is in black. The blue vertical line indicates the 95%ile of the null.
This is a worst-case scenario for overlap, but completely typical for how the results look.
“The other side of this situation is whether objectively low scores (compared to the null) match subjective
impressions of “instability”, as well. The answer is yes, but arguably less convincingly. For example, the
hit list BENPORATH_ES_2 (Ben-Porath et al., 2008, 40 genes) has a complete Jaccard similarity
between t0 and tnow of 0.0. At t0 , the enriched terms included “DNA replication”, “mitosis”, “methylation”,
and “epigenetic regulation of gene expression”. While none of these terms are enriched at tnow , highly
related terms such as “DNA replication initiation”, “mitotic nuclear division” and “gene silencing” are
enriched (Supplementary File 2).”
Hit list from “An embryonic stem cell–like gene expression signature in poorly differentiated aggressive human tumors”