4. 4
Biological knowledge is growing, rapidly
• More than 22 million articles indexed in PubMed
• Growing at about million/year and rising
5. 5
Scattered genomic knowledge is a problem
GNF Hits
IFITM3
• Scientists faced with new Robotics TFE3
BEX1
and unfamiliar genes on a ST8SIA1
TFEB
daily basis BEX2
SKP1A
....
• Public faced with unfamiliar genes on a daily basis
6. 6
Knowledge synthesis
“the pulling together of ideas or
information to develop a common
framework for understanding”
7. 7
Knowledge synthesis in biology, aka biocuration
• The production of structured data
Unstructured Structured
Gene Property Value
Fibronectin Biological Angiogenesis
Process
Fibronectin Cellular Extracellular
Localization matrix
Fibronectin Related Glomerulopathy
Disease
8. 8
Gene Ontology
“Tool for the unification of biology”[1]
A shared, controlled vocabulary for describing gene function
Molecular Function, Biological Process, Cellular Component
> 10,550 Citations in Google Scholar
[1] Nature Genetics. 2000 May;25(1):25-9.
9. 9
Gene Ontology Annotation Database („GOA‟)
• Records gene function
using gene ontology terms
• Expert synthesis of the
knowledge from
thousands of articles Gene Property Value
Fibronectin Biological Angiogenesis
Process
Fibronectin Cellular Extracellular
Localization matrix
Fibronectin Related Glomerulopathy
Disease
10. 10
33k articles become 31 gene annotations
Gene Ontology Curators
31 function annotations for
human gene
14. 14
Many genes are not thoroughly annotated
GO Annotation
Counts
+ Electronic annotation (IEA)
Biological
Process only
Genes, sorted by decreasing counts
Data: NCBI, February 2013
16. 16
Sooner or later, the
research community will
need to be involved in the
0
annotation effort to scale
up to the rate of data
generation.
17. 17
The Long Tail is a prolific source of content
Short
Head
Content
produced
Long Tail
Contributors (sorted)
News reporting: Newspapers Blogs
Video: TV/Hollywood YouTube
Product reviews: Consumer reports Amazon reviews
Food reviews: Food critics Yelp
Gene annotation: bio-curators ????????????
18. 18
Wikipedia successfully harnesses the long tail
• Within top 10 most Articles
visited websites
Words
• 14 million+ (millions)
registered users
Words/
article
Wikipedia Britannica Online
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
20. 20
The Gene Wiki Hypothesis
“We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.”
-Andrew Su
21. 21
Goal of the Gene Wiki project
• Enable the creation of a collaboratively
written, continuously updated, high
quality review article for every human
gene.
23. 23
Success depends on a positive feedback loop
Value of service
1 100
2 200
Number of Number of
contributors users
24. Gene “stubs” seed community contributions
24
Protein structure
Gene
Symbols and
summary
identifiers
Gene Ontology
annotations
Protein
interactions
Tissue expression
pattern
Linked
references
Links to structured
databases
25. 25
A review article for every gene is powerful
68 editors, 543 edits (as of July 2010)
References to the literature
Hyperlinks to related concepts
26. 26
The Gene Wiki project – 2010 stats
Value of service 10,300 articles
1.2 million words
67MB text
(about 1,000
PloS Biology research
articles)
55 million
page views
Number of Number of
3,500 editors
contributors users
17,000 edits
30. 30
The Gene Wiki hitches a ride on Wikipedia
CC photo by ff137 on flickr
31. 31
Take home messages
Value
• Success depends on
a positive feedback
contributors users
loop
• Where possible,
try to hitch a ride
32. 32
But still, many genes lack structured annotation…
GO Annotation
Counts
+ Electronic annotation (IEA)
Biological
Process only
Genes, sorted by decreasing counts
Data: NCBI, February 2013
33. 33
Can we generate structured annotations from
the text of the gene wiki?
Gene Property Value
? Fibronectin Biological Angiogenesis
Process
Fibronectin Cellular Extracellular
Localization matrix
Fibronectin Related Glomerulopathy
Disease
Great for
building
Great for people to read software for
people to use
36. 36
Simple text mining for gene annotations
NCBI Entrez Gene: 334
Gene Wiki
mapping
Wikilink Candidate
assertion
GO:0006897
GO exact
match
Good, BMC Genomics, 2011.
37. Finding concepts
• NCBO Annotator Web Service
– Gene Ontology
– Human Disease Ontology
• Annotator service selected for:
– Speed, easy API, precision
Clement Jonquet, Nigam H Shah, Mark A Musen, (2009) The Open Biomedical
Annotator. AMIA Summit on Translational Bioinformatics. 56-60
http://bioportal.bioontology.org/annotator
39. Compared to
current dbs Results Manual evaluation
on random sample
match more
DO specific term $"
2%
!# "
,
exact match !#
+"
23%
!#
*"
!# "
)
!# "
(
!# "
'
match more
general term !#&"
5%
!# "
%
no match !#
$"
70%
!"
- . //012" 3 0045"6 . /0"078
40910" :91. //012"
match more
$"
GO
specific term
2% exact match
12%
!# "
,
!#
+"
!#
*"
!# "
)
!# "
(
match more
!# "
'
general term !#&"
no match
58% 28%
!# "
%
!#
$"
!"
- . //012" 3 0045"6 . /0"078
40910" :91. //012"
40. !# "
,
GO problems
!#+"
!#*"
!# "
)
!# "
(
!# "
'
!#&"
!# "
%
!#$"
!"
- . /01"2 . 345"6 7# 90: ; 4<=9>"
1# "38. ?1931941": =109@ AA=83"
3"0; ?1931941"B . 43; . //D"C /01"
0"C .
. 99=3. <=9"
False match (e.g., “Olfactory receptors .. are responsible for the
transduction of odorant signals. The system incorrectly identifies
„transduction‟ (GO:0009293) defined as the transfer of genetic
information to a bacterium from a bacteriophage or between bacterial
or yeast cells mediated by a phage vector
No support in sentence (e.g., "The protein is composed ... including
10 sialic acid residues, which are attached to the protein during
posttranslational modification in the Golgi apparatus.” Such
sentences may lead to incorrect annotations of 'Golgi apparatus' and
'Posttranslational modification‟.)
41. Applications
• Enrichment analysis
• even with false positives, text-mined annotations can
improve statistical analyses that are tolerant to noise.
• GeneWiki+
46. Text mining take home
• Depends a lot on the ontology
• (same text, same algorithm,
completely different results)
• Approach depends on corpus
• concept-centric text has advantages
• Approach depends on purpose
• high false positive rates are common
but may be acceptable – e.g.
enrichment analysis 46
47. Can we skip text mining?
http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/
49. Q414043
Wikidata
Reelin
Protein Q8054
Property:P31 is a
Glycoprotein Q187126
Neural
Property:P128 regulates Q1345738
development
VLDL receptor Q1979313
Property:P129 Interacts Amyloid
with precursor Q423510
protein
49
http://www.wikidata.org/wiki/Q414043
53. 53
“We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.”
-Andrew Su
54. 54
Gene Wiki acknowledgements..
http://wordle.com
Many Wikipedia editors
WP:MCB Project
“A gene wiki for community annotation of gene function” “The Gene Wiki: community intelligence applied to human
PloS Biology 2008 gene annotation” Nucleic Acids Research 2009
“Mining the Gene Wiki for Functional Genomic Knowledge”
BMC Genomics 2011
“The Gene Wiki in 2011: community intelligence applied to human
gene annotation”
Nucleic Acids Research 2012
“Linking genes to diseases with a SNPedia-Gene Wiki mashup”
Journal of Biomedical Semantics 2012
“Building a biomedical semantic network in Wikipedia with Semantic
Wiki Links”
Database: The Journal of Biological Databases and Curation 2012
55. My sister Erin has a PhD in linguististics, lives in Raleigh
and is looking for work in research or teaching..
Help her out!
bgood@scripps.edu
@bgood
i9606.blogspot.com
Funding and Support slideshare/goodb
NIH / NIGMS 55
(Gene Wiki: GM089820)
56. 56
Gene Wiki content improves enrichment analysis
More
p-value significant
(PubMed + GW) PubMed only
Muscle
contraction
More
significant
PubMed + GW
p-value (PubMed only)
Good, BMC Genomics, 2011.
Editor's Notes
645,647 articles that have been explicitly linked to human genes within the NCBI Gene database. (gene2pubmed)Search through PubMed and Google will unearth many many more that are clearly relevant but have not been linked yet.
More is produced every day.
The definition that best met my usage here was ...Oddly, it didn’t come from wordnet or even the Wiktionary, it came from the glossary of a document describing a preschool curriculum.Not sure why I chose that one, but he might have had something to do with it..
Manual curation.
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
In 2008, a group of genome database curators got together and wrote an article about the state of art of biocuration. In it, they expressed deep concern about the amount of data that they were already processing and the knowledge that there would only be more of it coming. One of the things they said in this article was that ‘sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation’. The Gene Wiki and related efforts are an attempt to meet that need.
Now at more than3.5 million articles
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Feb. 14, 201110,290 articles> 75 megabytes text content> 1.3 million words35,997 PubMed citations (about 1 for every two sentences)In past year34,839 edits by 3,599 editorsIncrease of 2.2 megabytes 55 million page views
Just looking at the citations in PubMed actually understates the situation dramatically.
Much easier to start from a large community with a very high page rank then it is to start from scratch…
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Category 1: Yes, this would lead to a new annotation:1A: perfect match – the candidate annotation is exactly as it would be from a curator (e.g., Titin Scleroderma)1B: not specific enough – the candidate annotation is correct but a more specific term should be used instead (e.g., Titin Autoimmune disease)1C: too specific – the candidate annotation is close to correct, but is too specific given the evidence at hand (e.g., Titin Pulmonary Systemic Sclerosis) Category 2: Maybe, but insufficient evidence:2A: evaluator could not find enough supporting evidence in the literature after about 10 minutes of looking (e.g., DUSP7 cellular proliferation; there is literature indicating that DUSP7 is a phosphatase that dephosphorylates MAPK, and hence may play a role in regulating cell proliferation stimulated through MAPK. Although no direct evidence supporting this contention for Human DUSP7 was found, it seems plausible.)2B: there is disagreement in the literature about the truth of this annotationCategory 3: No, this candidate annotation is incorrect:3A: incorrect concept recognition (e.g., “Olfactory receptors share a 7-transmembrane domain structure with many neurotransmitter and hormone receptors and are responsible for the recognition and G protein-mediated transduction of odorant signals.” [24] The system incorrectly identifies ‘transduction’ (GO:0009293 ) which is defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector - a completely different concept from signal transduction as intended in the sentence.)3B: incorrect sentence context - the sentence is a negation or otherwise does not support the predicted annotation for the given gene (e.g., "The protein is composed of ~300 amino acid residues and has ~30 carbohydrate residues attached including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus." [25] Such sentences may lead to incorrect candidate annotations of 'Golgi apparatus' and 'Posttranslational modification’.)3C: this sentence seems factually false (e.g., a hypothetical example: “Insulin injections have been shown to cure Parkinson’s disease and lead to the growth of additional toes”.)
GO terms are more common (we found more than twice as many occurences), are more prone to polysemy, and are more likely to show up in contexts that don’t indicate a direct annotation.
Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores