The use of crowdsourcing in biology is gaining popularity as a mechanism to tackle challenges of massive scale. However, to maximize participation and lower the barriers to entry, contributions to crowdsourcing efforts are typically not well-structured, which makes computing on these data challenging and difficult. The presentation will discuss strategies for translating this unstructured content into structured data. Three vignettes (in varying degrees of completion) will be described, one each from our Gene Wiki [1], BioGPS [2], and serious gaming [3] initiatives.
[1]: http://en.wikipedia.org/wiki/Portal:Gene_Wiki
[2]: http://biogps.org
[3]: http://genegames.org
4. 4
Few genes are well annotated
CTNNB1
VEGFA
SIRT1
FGFR2
GO Annotation
TGFB1
TP53
Counts
MEF2C
BMP4 65%
LEF1
WNT5A
TNF
41%
20,473 protein-
coding genes
Genes, sorted by decreasing counts
Data: NCBI, February 2013
5. 5
Few genes are well annotated
GO Annotation
Counts
+ Electronic annotation (IEA)
Genes, sorted by decreasing counts
Data: NCBI, February 2013
6. 6
Few genes are well annotated
GO Annotation
Counts
+ Electronic annotation (IEA)
Biological
Process only
Genes, sorted by decreasing counts
Data: NCBI, February 2013
11. 11
10,000 gene “stubs” within Wikipedia
Protein structure
Gene
summary
Symbols and
identifiers
Gene Ontology
annotations
Protein
interactions
Tissue expression
Linked pattern
references
Links to structured
databases
Huss, PLoS Biol, 2008
12. 12
Gene Wiki has a critical mass of readers
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
13. 13
Gene Wiki has a critical mass of editors
Editor count Editors
Edit count
Edits
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
14. 14
A review article for every gene is powerful
Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
References to the literature
Hyperlinks to related concepts
23. 23
Wikidata
Provide a database of the
world‟s knowledge that
anyone can edit
- Denny Vrandečić
24. 24
Wikidata
Q414043
Reelin
Protein Q8054
Property:P31 is a
Glycoprotein Q187126
Neural
Property:P128 regulates Q1345738
development
VLDL receptor Q1979313
Property:P129 Interacts Amyloid
with precursor Q423510
protein
http://www.wikidata.org/wiki/Q414043
45. 45
-
150 billion human hours
per year
http://www.flickr.com/photos/rvp-cw/6243289302/
46. 46
Using games to fold proteins
Fold.it players have successfully:
• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal
structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding
algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo
designed enzyme (Eiben, Nat Biotechnol, 2011)
50. 50
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
51. 51
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
Macular degeneration, age-related
Myocardial infarction susceptibility
52. 52
No good gene-disease annotation database
Query: Apolipoprotein E
? Alzheimer's disease (AD)
? Lipoprotein glomerulopathy
? Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
? Macular degeneration, age-related
? Myocardial infarction susceptibility
HIV
Psoriasis
Vascular Diseases
53. 53
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD) Memory
Coronary Artery Disease
Neuropsychological Tests Hypertension
Cognition Disorders Mental Status Schedule
Psychiatric Status Rating
Dementia Scales
Cognition Hyperlipidemias
Atrophy
Disease Progression Dementia, Vascular
Cardiovascular Diseases Parkinson Disease
Brain Injuries
Coronary Disease Myocardial Infarction
Diabetes Mellitus, Type 2 …
Memory Disorders 477 diseases!
54. 54
Play Dizeez to annotate gene-disease links
6. Play to win!
5. Hurry!
4. Then on to the
next question…
3. If it‟s „right‟, you get points
1. Read the clue (gene)
2. Click the related disease
(only one is “right”)
55. 55
Dizeez players seem pretty smart…
In total (since Dec 2011):
• 230 unique gamers
• 1045 games played
• 8525 guesses
# Occurrences Gene Disease Gene Wiki OMIM PharmGKB PubMed
11 NBPF3 neuroblastoma
11 SOX8 mental retardation
9 ABL1 leukemia
9 SSX1 synovial sarcoma
8 APC colorectal cancer
8 FES sarcoma
8 RBP3 retinoblastoma
8 GAST gastrinoma
8 DCC colorectal cancer
8 MAP3K5 cancer
56. 56
Using games to predict phenotype from genotype?
http://genegames.org
57. 57
Classification problems in genome biology
Classify new
cancer normal samples
find patterns
cancer
100,000s features
normal
SVM
Neural
networks
Naïve
Bayes
KNN
…
100s samples
58. 58
Random forests
Sample subset
of cases and Train decision
cancer normal features tree
100,000s features
100s samples
73. 73
Results
• 214 registered players
– 50% declared knowledge of cancer
biology
– 40% self-identified as having Ph.D.
• Prediction results
– 70% correct on survival concordance
index
– Best scoring model was 76%
– Player registrations still increasing!
74. 74
Crowdsourcing
empowers the entire
scientific community to
directly participate in the
gene annotation process.
75. 75
Collaborators Group members
Doug Howe, ZFIN Katie Fisch Max Nanis
John Hogenesch, U Penn
Luca de Alfaro, UCSC
Ben Good Chunlei Wu
Angel Pizzaro, U Penn Salvatore Loguercio
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset Key group alumni
Michael Martone, Rush
Konrad Koehler, Karo Bio Erik Clarke
Warren Kibbe, Simon Lim, Northwestern Jon Huss
Many Wikipedia editors Marc Leglise
WP:MCB Project Maximilian Ludvigsson
Ian MacLeod
Camilo Orozco
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Notes de l'éditeur
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
MODs and portals
Genetics resources
Literature resources
Protein resources
Pathway and expression databases
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach