High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
1. The Gene Wiki: Crowdsourcing human gene
annotation
Andrew Su, Ph.D.
Department of Molecular and Experimental Medicine
The Scripps Research Institute
Biocuration 2012
April 2, 2012
2. 2
The Long Tail is a prolific source of content
Short
Head
Content
produced
Long Tail
Contributors (sorted)
News : Newspapers Blogs
Video: TV/Hollywood YouTube
Product reviews: Consumer reports Amazon reviews
Food reviews: Food critics Yelp
Talent judging: Olympics American Idol
Gene annotation: Manual curation Gene Wiki
3. 3
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
5. 5
Wikipedia has breadth and depth
Articles
Words
(millions)
Wikipedia Britannica
Online
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
7. 7
Wiki success depends on a positive feedback
Gene wiki page utility
1 100
2 200
Number of Number of
contributors users
8. 8
10,000 gene “stubs” within Wikipedia Utility
Users
Contributors
Protein structure
Gene
summary
Symbols and
identifiers
Gene Ontology
annotations
Protein
interactions
Tissue expression
Linked pattern
references
Links to structured
databases
Huss, PLoS Biol, 2008
9. 9
Gene Wiki has a critical mass of readers
Utility
Users
Contributors
Total: ~4.3 million
views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
10. 10
Gene Wiki has a critical mass of editors
Utility
~10,000 words added / month
Users
Contributors
Total 1.42 million words
≈ 230 full-length articles
4.3 million views / month
Cumulative edits
Productive
edits
1000 edits / month
Vandalism
Good, NAR, 2011
11. 11
A review article for every gene is powerful
Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
References to the literature
Hyperlinks to related concepts
12. 12
Making the Gene Wiki more computable
Free text Structured annotations
13. 13
Filling the gaps in gene annotation
NCBI Entrez Gene: 3362
Gene Wiki
mapping
Wikilink Candidate
assertion
GO:0004993
GO exact
synonym
14. 14
Filling the gaps in gene annotation
NCBI Entrez Gene: 334
Gene Wiki
mapping
Wikilink Candidate
assertion
GO:0006897
GO exact
match
15. Disease associations mined from the Gene Wiki
Good, BMC Genomics 2011, 12:603
Gene Wiki Articles
(10,271) 23% exact
match
Filter out 5% match
seeded text parent
2% match
child
70% have
NCBO
no match
Annotator
Matched Disease 2147
Compare to
Ontology terms candidate
DO database
(2983) annotations
16. Disease associations mined from the Gene Wiki
Good, BMC Genomics 2011, 12:603
Expert curation
Correct
Incorrect: 10% 86%
Maybe: 4% Overall specificity: 90-93%
17. GO associations mined from the Gene Wiki
Good, BMC Genomics 2011, 12:603
Gene Wiki Articles
(10,271) 17% exact
match
Filter out
seeded text 26% match
parent
55% have
NCBO no match
Annotator 2% match
child
Matched Gene 6319
Compare to
Ontology terms candidate
GO database
(11,022) annotations
18. GO associations mined from the Gene Wiki
Good, BMC Genomics 2011, 12:603
Expert curation
Correct
14%
Maybe
60% 26%
Incorrect
Overall specificity: 48-64%
19. 19
Common sources of error in GO associations
Good, BMC Genomics 2011, 12:603
1) Incorrect concept recognition
OR2F1: “Olfactory receptors … are
responsible for the recognition and G protein-
mediated transduction of odorant signals.”
Signal transduction (GO:0007165) Transduction (GO:0009293)
The cellular process in which a signal The transfer of genetic information to a
is conveyed to trigger a change in the bacterium from a bacteriophage or
activity or state of a cell. Signal between bacterial or yeast cells
transduction begins with reception of a mediated by a phage vector.
signal, e.g. a ligand binding to a
receptor or receptor activation by a
stimulus such as light, and ends with
regulation of a downstream cellular
process…
20. 20
Common sources of error in GO associations
Good, BMC Genomics 2011, 12:603
2) Incorrect sentence context
MEF2C: “Several post translational
modifications have been identified including
phosphorylation on serine-59 …”
Dephosphorylation
Excretion
Phosporylation Gene expression
Glycosylation
Localization
MEF2C Neurogenesis Methylation
Proteolysis
Secretion
Transport
Myelination Transcription
Translation
21. 21
Novel GO annotations – so what?
6319
11,022 ~100,000
“novel” 4703 (43%)
annotations annotations
annotations match known
mined from from GO
@ 48-64% annotations
Gene Wiki consortium
specificity
22. 22
Gene Wiki content improves enrichment analysis
axon Enrichment
guidance GO term
analysis
(GO:0007411)
811 articles
264 genes PubMed Concept
Gene list
abstracts recognition
GO:0007411
Yes No
Linked genes Yes 13 2
through
No 251 12033
PubMed
P = 1.55 E-20
23. 23
Gene Wiki content improves enrichment analysis
muscle Enrichment
contraction GO term
analysis
(GO:0006936)
251 articles
87 genes PubMed Concept
Gene list
abstracts recognition
+
Gene Wiki
87 articles
GO:0006936 GO:0006936
Linked genes Linked genes
through through
PubMed PubMed +
Gene Wiki
P = 1.0 P = 1.22 E-09
24. 24
Gene Wiki content improves enrichment analysis
More
p-value significant
(PubMed + GW) PubMed only
Muscle
contraction
More
significant
PubMed + GW
p-value (PubMed only)
25. 25
Challenges and future directions
• How to complement and integrate with
traditional biocuration workflows?
• How to disseminate and utilize
crowdsourced annotations?
26. 26
The
Long Tail of scientists
is a valuable source of
information on gene
function
27. 27
Collaborators Group members
Doug Howe, ZFIN Erik Clarke Ian Macleod
John Hogenesch, U Penn
Jon Huss, GNF
Ben Good (*) Chunlei Wu
Luca de Alfaro, UCSC Salvatore Loguercio
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush See poster # 30 for more on
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern the Gene Wiki and
Many Wikipedia editors crowdsourcing in biology!
WP:MCB Project
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
28. 28
Making the Gene Wiki more reliable
Novartis is a multinational 2 The company name is derived
pharmaceutical company from old Greek, and means
based in Basel, Switzerland "destroyer of birds".
that manufactures drugs such
as clozapine
(Clozaril), diclofenac
(Voltaren), …
2
29. 29
Making the Gene Wiki more reliable
Novartis is a multinational 2 The company name is derived
pharmaceutical company from old Greek, and means
based in Basel, Switzerland "destroyer of birds".
that manufactures drugs such
as clozapine (Clozaril),
diclofenac (Voltaren), …
36211 total edits 36 total edits
* *
*
*
* *
*
* *
*
* *
* *
High-trust author Low-trust author
http://www.wikitrust.net/
Notes de l'éditeur
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Transduction accounts for 70% of the concept recognition problems
Tried on 773 GO categories, significant in 356 cases (46%)
We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
We started working with Doug Howe because he helped us learn a lot about biocuration, but clearly we’d need to expand partnersIn particular, since GO curation seems to be largely drawn by organisms
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.