SlideShare une entreprise Scribd logo
1  sur  83
Crowdsourcing Biology: The Gene
Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
April 5, 2013
UCSD DBMI Seminar
Few genes are well annotated…
2
Data: NCBI, February 2013
41%
65%
CTNNB1
VEGFA
SIRT1
FGFR2
TGFB1
TP53
MEF2C
BMP4
LEF1
WNT5A
TNF
20,473
protein-
coding
genes
Genes, sorted by decreasing counts
GOAnnotation
Counts
0
200,000
400,000
600,000
800,000
1,000,000
1979 1984 1989 1994 1999 2004 2009
Number of PubMed-indexed articles
… because the literature is sparsely curated?
3
… because the literature is sparsely curated?
4
0
10
20
1979 1984 1989 1994 1999 2004 2009
Average capacity of human scientistNumber of articles read by typical scientist
5
311,696 articles (1.5% of PubMed)
have been cited by GO annotations
6
0
Sooner or later, the
research community will
need to be involved in the
annotation effort to scale
up to the rate of data
generation.
The Long Tail is a prolific source of content
7
Short
Head
Long Tail
Content
produced
Contributors (sorted)
News :
Video:
Product reviews:
Food reviews:
Talent judging:
Newspapers
TV/Hollywood
Consumer reports
Food critics
Olympics
Blogs
YouTube
Amazon reviews
Yelp
American Idol
Wikipedia is reasonably accurate
8
Wikipedia has breadth and depth
9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words
(millions)
Wikipedia Britannica
Online
10
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
From crowdsourcing to structured data
11
The Gene Wiki
Biological Games
Filtering, extracting, and summarizing PubMed
Documents
Concepts Review article
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Wiki success depends on a positive feedback
14
Gene wiki page utility
Number of
users
Number of
contributors
1001
2002
10,000 gene “stubs” within Wikipedia
15
Protein structure
Symbols and
identifiers
Tissue expression
pattern
Gene Ontology
annotations
Links to structured
databases
Gene
summary
Protein
interactions
Linked
references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers
16
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
Gene Wiki has a critical mass of editors
17
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Editorcount
Editors
Edits
Editcount
A review article for every gene is powerful
18
References to the literature
Hyperlinks to related concepts
Reelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
Making the Gene Wiki more computable
19
Structured annotationsFree text
Filling the gaps in gene annotation
20
Wikilink
GO exact
match
Gene Wiki
mapping
NCBI Entrez Gene: 334
Candidate
assertion
GO:0006897
6319 novel GO annotations
2147 novel DO annotations
Gene Wiki content improves enrichment analysis
21
GO term
Gene list
Concept
recognition
PubMed
abstracts
Enrichment
analysis
GO:0007411
axon
guidance
(GO:0007411)
264 genes
Linked genes
through
PubMed
P = 1.55 E-20
811 articles
Yes No
Yes 13 2
No 251 12033
Gene Wiki content improves enrichment analysis
22
GO term
Gene list
Concept
recognition
PubMed
abstracts
Gene Wiki
+
Enrichment
analysis
GO:0006936 GO:0006936
muscle
contraction
(GO:0006936)
87 genes
Linked genes
through
PubMed
Linked genes
through
PubMed +
Gene Wiki
P = 1.0 P = 1.22 E-09
251 articles
87 articles
Gene Wiki content improves enrichment analysis
23
p-value (PubMed only)
p-value
(PubMed + GW)
Muscle
contraction
More
significant
PubMed + GW
More
significant
PubMed only
Making the Gene Wiki more computable
24
Structured annotationsFree text
Analyses
Making the Gene Wiki more computable
25
Structured annotationsFree text
Databases
Making the Gene Wiki more computable
26
Databases
Linked Data
The
Long Tail of scientists
is a valuable source of
information on gene
function
27
From crowdsourcing to structured data
28
The Gene Wiki
Biological Games
Gene databases are numerous and overlapping
29
… and hundreds
more …
Why is there so much redundancy?
30
Users
Requests
Resources
Time
Community
development
BioGPS emphasizes community extensibility
Why do developers define the gene report view?
31
BioGPS emphasizes user customizability
http://biogps.org
Community extensibility and user customizability
32
Utility: A simple and universal plugin interface
33
KEGG
http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}
STRING
http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}
Pubmed
http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}
URL template
Gene entity
Rendered URL
Utility
UsersContributors
Utility: A simple and universal plugin interface
34
Utility
UsersContributors
Utility: A simple and universal plugin interface
35
Utility
UsersContributors
Utility: A simple and universal plugin interface
36
Utility
UsersContributors
Utility: A simple and universal plugin interface
37
Utility
UsersContributors
Utility: A simple and universal plugin interface
38
Utility: A simple and universal plugin interface
39
Utility
UsersContributors
Total of > 540 gene-centric online
databases registered as BioGPS plugins
Users: BioGPS has critical mass
40
• > 6400 registered users
• 14,000 unique visitors per month
• 155,000 page views per month
1. Harvard
2. NIH
3. UCSD
4. Scripps
5. MIT
6. Cambridge
7. U Penn
8. Stanford
9. Wash U
10. UNC
Top 10 organizations
Daily pageviewsUtility
UsersContributors
Contributors: Explicit and implicit knowledge
41
540 plugins registered
(>300 publicly shared)
by over 120 users
spanning 280+ domains
Utility
UsersContributors
All resources should provide RDF…
42
Mining structured content from HTML
43
Defining a data extraction template
44
…
TP53 TNF APOE IL6 VEGF …EGFR TGFB1
The BioGPS Semantic Annotator
45
http://54.244.135.254:8080
The
Long Tail of
bioinformaticians
can collaboratively
build a gene portal.
46
From crowdsourcing to structured data
47
The Gene Wiki
Biological Games
48
http://www.flickr.com/photos/archana3k1/4124330493/
Seven million human hours
49
Twenty million human hours
http://www.flickr.com/photos/ableman/2171326385/
-
50
150 billion human hours
http://www.flickr.com/photos/rvp-cw/6243289302/
per year
Using games to fold proteins
51
Fold.it players have successfully:
• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal
structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding
algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo
designed enzyme (Eiben, Nat Biotechnol, 2011)
Using games to fold RNAs
52
http://eterna.cmu.edu/
Using games to align sequences
53
http://phylo.cs.mcgill.ca
Using games to diagnose malaria infection
54
http://biogames.ee.ucla.edu/
Using games to map neurons
55
http://eyewire.org
Using games to annotate genes?
56
http://genegames.org
No good gene-disease annotation database
57
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Query: Apolipoprotein E
No good gene-disease annotation database
58
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
Macular degeneration, age-related
Myocardial infarction susceptibility
Query: Apolipoprotein E
No good gene-disease annotation database
59
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
Macular degeneration, age-related
Myocardial infarction susceptibility
HIV
Psoriasis
Vascular Diseases
Query: Apolipoprotein E
?
?
?
?
?
No good gene-disease annotation database
60
Alzheimer's disease (AD)
Neuropsychological Tests
Cognition Disorders
Dementia
Cognition
Disease Progression
Cardiovascular Diseases
Coronary Disease
Diabetes Mellitus, Type 2
Memory Disorders
Query: Apolipoprotein E
Memory
Coronary Artery Disease
Hypertension
Mental Status Schedule
Psychiatric Status Rating
Scales
Hyperlipidemias
Atrophy
Dementia, Vascular
Parkinson Disease
Brain Injuries
Myocardial Infarction
…
477 diseases!
Play Dizeez to annotate gene-disease links
61
3. If it‟s „right‟, you get points
4. Then on to the
next question…
2. Click the related disease
(only one is “right”)
5. Hurry!
1. Read the clue (gene)
6. Play to win!
Dizeez players seem pretty smart…
62
In total (since Dec 2011):
• 230 unique gamers
• 1045 games played
• 8525 guesses
# Occurrences Gene Disease
11 NBPF3 neuroblastoma
11 SOX8 mental retardation
9 ABL1 leukemia
9 SSX1 synovial sarcoma
8 APC colorectal cancer
8 FES sarcoma
8 RBP3 retinoblastoma
8 GAST gastrinoma
8 DCC colorectal cancer
8 MAP3K5 cancer
Gene Wiki OMIM PharmGKB PubMed
Using games to predict phenotype from genotype?
63
http://genegames.org
Classification problems in genome biology
64
cancer normal
find patterns
Classify new
samples
cancer
normal
SVM
Neural
networks
Naïve
Bayes
KNN
…
100s samples
100,000sfeatures
Random forests
65
Sample subset
of cases and
features
Train decision
treecancer normal
100s samples
100,000sfeatures
Random forests
66
cancer normal
100s samples
100,000sfeatures
Random forests
67
Classify new
samples
cancer
normal
cancer normal
100s samples
100,000sfeatures
How to interject
biological
knowledge?
Network-guided forests
68
Dutkowski & Ideker (2011). PLoS Computational Biology
Network-guided forests
69
Sample
features by PPI
network
Train decision
treecancer normal
100s samples
100,000sfeatures
Human-guided forests
70
Sample
features by
human
intelligence
Train decision
treecancer normal
100s samples
100,000sfeatures
71
The Cure: Genomic predictors for disease
72
The Cure: Genomic predictors for disease
73
The Cure: Genomic predictors for disease
74
The Cure: Genomic predictors for disease
75
The Cure: Genomic predictors for disease
76
The Cure: Genomic predictors for disease
77
Human-guided forests
78
Classify new
samples
cancer
normal
“Critical Assessment”-style challenge
79
Results
• 214 registered players
– 50% declared knowledge of cancer
biology
– 40% self-identified as having Ph.D.
• Prediction results
– 70% correct on survival concordance
index
– Best scoring model was 76%
– Player registrations still increasing!
80
The
Long Tail of gamers
can collaboratively
build an accurate
disease classifier.
81
82
Doug Howe, ZFIN
John Hogenesch, U Penn
Jon Huss, GNF
Luca de Alfaro, UCSC
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
WP:MCB Project
Collaborators
Katie Fisch
Ben Good
Salvatore Loguercio
Max Nanis
Chunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Adriel Carolino
Erik Clarke
Jon Huss
Marc Leglise
Maximilian Ludvigsson
Ian MacLeod
Camilo Orozco
Key group alumni
Doctoral Program in Chemical
and Biological Sciences
CALIFORNIA
Office of Graduate Studies
10550 N. Torrey Pines Road
La Jolla, CA 92037
Email:
gradprgrm@scripps.edu
Phone: 858.784.8469
http://education.scripps.edu

Contenu connexe

En vedette

ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotationISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotationAndrew Su
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Andrew Su
 
Ismb2012_poster_cwu
Ismb2012_poster_cwuIsmb2012_poster_cwu
Ismb2012_poster_cwuanewgene
 
Transitioning to a new school year
Transitioning to a new school yearTransitioning to a new school year
Transitioning to a new school yearReedheiress
 
Promoting student learning final
Promoting student learning finalPromoting student learning final
Promoting student learning finalVannessa Rosado
 
Games for Human Gene Annotation
Games for Human Gene AnnotationGames for Human Gene Annotation
Games for Human Gene AnnotationSal
 

En vedette (7)

ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotationISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
 
Ismb2012_poster_cwu
Ismb2012_poster_cwuIsmb2012_poster_cwu
Ismb2012_poster_cwu
 
Transitioning to a new school year
Transitioning to a new school yearTransitioning to a new school year
Transitioning to a new school year
 
Promoting student learning final
Promoting student learning finalPromoting student learning final
Promoting student learning final
 
Octavo
OctavoOctavo
Octavo
 
Games for Human Gene Annotation
Games for Human Gene AnnotationGames for Human Gene Annotation
Games for Human Gene Annotation
 

Similaire à Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceAndrew Su
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...Andrew Su
 
Funding data for research
Funding data for researchFunding data for research
Funding data for researchCrossref
 
Using Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsUsing Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsGolden Helix Inc
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesJosef Scheiber
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!adcobb
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giantsBenjamin Good
 
Biomarkers brain regions
Biomarkers brain regionsBiomarkers brain regions
Biomarkers brain regionsAnn-Marie Roche
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6Andrew Su
 
Biodiversity & Citizen Science in the Genomic Era
Biodiversity & Citizen Science in the Genomic EraBiodiversity & Citizen Science in the Genomic Era
Biodiversity & Citizen Science in the Genomic Erasratnasi
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Chris Evelo
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformaticsnadeem akhter
 
Scott Edmunds: Revolutionizing Data Dissemination: GigaScience
Scott Edmunds: Revolutionizing Data Dissemination: GigaScienceScott Edmunds: Revolutionizing Data Dissemination: GigaScience
Scott Edmunds: Revolutionizing Data Dissemination: GigaScienceGigaScience, BGI Hong Kong
 
AI in medicine: COVID-19 and beyond
AI in medicine: COVID-19 and beyondAI in medicine: COVID-19 and beyond
AI in medicine: COVID-19 and beyondSonja Aits
 
Knowledge curation for COVID-19
Knowledge curation for COVID-19Knowledge curation for COVID-19
Knowledge curation for COVID-19Sonja Aits
 

Similaire à Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (20)

Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
 
Funding data for research
Funding data for researchFunding data for research
Funding data for research
 
Using Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsUsing Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS Variants
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
Biomarkers brain regions
Biomarkers brain regionsBiomarkers brain regions
Biomarkers brain regions
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6
 
Biodiversity & Citizen Science in the Genomic Era
Biodiversity & Citizen Science in the Genomic EraBiodiversity & Citizen Science in the Genomic Era
Biodiversity & Citizen Science in the Genomic Era
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
EnrichR database
EnrichR databaseEnrichR database
EnrichR database
 
Scott Edmunds: Revolutionizing Data Dissemination: GigaScience
Scott Edmunds: Revolutionizing Data Dissemination: GigaScienceScott Edmunds: Revolutionizing Data Dissemination: GigaScience
Scott Edmunds: Revolutionizing Data Dissemination: GigaScience
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
AI in medicine: COVID-19 and beyond
AI in medicine: COVID-19 and beyondAI in medicine: COVID-19 and beyond
AI in medicine: COVID-19 and beyond
 
Knowledge curation for COVID-19
Knowledge curation for COVID-19Knowledge curation for COVID-19
Knowledge curation for COVID-19
 

Plus de Andrew Su

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphAndrew Su
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesAndrew Su
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeAndrew Su
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...Andrew Su
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)Andrew Su
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseAndrew Su
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Andrew Su
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Andrew Su
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchAndrew Su
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceAndrew Su
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceAndrew Su
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Andrew Su
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeAndrew Su
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Andrew Su
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Andrew Su
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Andrew Su
 
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...Andrew Su
 
Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Andrew Su
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)Andrew Su
 

Plus de Andrew Su (19)

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graph
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciences
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebase
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease Research
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen science
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen Science
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledge
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)
 
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
 
Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
 

Dernier

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Dernier (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

  • 1. Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org April 5, 2013 UCSD DBMI Seminar
  • 2. Few genes are well annotated… 2 Data: NCBI, February 2013 41% 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GOAnnotation Counts
  • 3. 0 200,000 400,000 600,000 800,000 1,000,000 1979 1984 1989 1994 1999 2004 2009 Number of PubMed-indexed articles … because the literature is sparsely curated? 3
  • 4. … because the literature is sparsely curated? 4 0 10 20 1979 1984 1989 1994 1999 2004 2009 Average capacity of human scientistNumber of articles read by typical scientist
  • 5. 5 311,696 articles (1.5% of PubMed) have been cited by GO annotations
  • 6. 6 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
  • 7. The Long Tail is a prolific source of content 7 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
  • 9. Wikipedia has breadth and depth 9 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 Articles Words (millions) Wikipedia Britannica Online
  • 10. 10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
  • 11. From crowdsourcing to structured data 11 The Gene Wiki Biological Games
  • 12. Filtering, extracting, and summarizing PubMed Documents Concepts Review article
  • 13. Filtering, extracting, and summarizing PubMed Documents Concepts
  • 14. Wiki success depends on a positive feedback 14 Gene wiki page utility Number of users Number of contributors 1001 2002
  • 15. 10,000 gene “stubs” within Wikipedia 15 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors
  • 16. Gene Wiki has a critical mass of readers 16 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors
  • 17. Gene Wiki has a critical mass of editors 17 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editorcount Editors Edits Editcount
  • 18. A review article for every gene is powerful 18 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002
  • 19. Making the Gene Wiki more computable 19 Structured annotationsFree text
  • 20. Filling the gaps in gene annotation 20 Wikilink GO exact match Gene Wiki mapping NCBI Entrez Gene: 334 Candidate assertion GO:0006897 6319 novel GO annotations 2147 novel DO annotations
  • 21. Gene Wiki content improves enrichment analysis 21 GO term Gene list Concept recognition PubMed abstracts Enrichment analysis GO:0007411 axon guidance (GO:0007411) 264 genes Linked genes through PubMed P = 1.55 E-20 811 articles Yes No Yes 13 2 No 251 12033
  • 22. Gene Wiki content improves enrichment analysis 22 GO term Gene list Concept recognition PubMed abstracts Gene Wiki + Enrichment analysis GO:0006936 GO:0006936 muscle contraction (GO:0006936) 87 genes Linked genes through PubMed Linked genes through PubMed + Gene Wiki P = 1.0 P = 1.22 E-09 251 articles 87 articles
  • 23. Gene Wiki content improves enrichment analysis 23 p-value (PubMed only) p-value (PubMed + GW) Muscle contraction More significant PubMed + GW More significant PubMed only
  • 24. Making the Gene Wiki more computable 24 Structured annotationsFree text Analyses
  • 25. Making the Gene Wiki more computable 25 Structured annotationsFree text Databases
  • 26. Making the Gene Wiki more computable 26 Databases Linked Data
  • 27. The Long Tail of scientists is a valuable source of information on gene function 27
  • 28. From crowdsourcing to structured data 28 The Gene Wiki Biological Games
  • 29. Gene databases are numerous and overlapping 29 … and hundreds more …
  • 30. Why is there so much redundancy? 30 Users Requests Resources Time Community development BioGPS emphasizes community extensibility
  • 31. Why do developers define the gene report view? 31 BioGPS emphasizes user customizability
  • 33. Utility: A simple and universal plugin interface 33 KEGG http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}} STRING http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}} Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}} URL template Gene entity Rendered URL
  • 34. Utility UsersContributors Utility: A simple and universal plugin interface 34
  • 35. Utility UsersContributors Utility: A simple and universal plugin interface 35
  • 36. Utility UsersContributors Utility: A simple and universal plugin interface 36
  • 37. Utility UsersContributors Utility: A simple and universal plugin interface 37
  • 38. Utility UsersContributors Utility: A simple and universal plugin interface 38
  • 39. Utility: A simple and universal plugin interface 39 Utility UsersContributors Total of > 540 gene-centric online databases registered as BioGPS plugins
  • 40. Users: BioGPS has critical mass 40 • > 6400 registered users • 14,000 unique visitors per month • 155,000 page views per month 1. Harvard 2. NIH 3. UCSD 4. Scripps 5. MIT 6. Cambridge 7. U Penn 8. Stanford 9. Wash U 10. UNC Top 10 organizations Daily pageviewsUtility UsersContributors
  • 41. Contributors: Explicit and implicit knowledge 41 540 plugins registered (>300 publicly shared) by over 120 users spanning 280+ domains Utility UsersContributors
  • 42. All resources should provide RDF… 42
  • 44. Defining a data extraction template 44 … TP53 TNF APOE IL6 VEGF …EGFR TGFB1
  • 45. The BioGPS Semantic Annotator 45 http://54.244.135.254:8080
  • 46. The Long Tail of bioinformaticians can collaboratively build a gene portal. 46
  • 47. From crowdsourcing to structured data 47 The Gene Wiki Biological Games
  • 49. 49 Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  • 50. - 50 150 billion human hours http://www.flickr.com/photos/rvp-cw/6243289302/ per year
  • 51. Using games to fold proteins 51 Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 52. Using games to fold RNAs 52 http://eterna.cmu.edu/
  • 53. Using games to align sequences 53 http://phylo.cs.mcgill.ca
  • 54. Using games to diagnose malaria infection 54 http://biogames.ee.ucla.edu/
  • 55. Using games to map neurons 55 http://eyewire.org
  • 56. Using games to annotate genes? 56 http://genegames.org
  • 57. No good gene-disease annotation database 57 Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Query: Apolipoprotein E
  • 58. No good gene-disease annotation database 58 Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility Query: Apolipoprotein E
  • 59. No good gene-disease annotation database 59 Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases Query: Apolipoprotein E ? ? ? ? ?
  • 60. No good gene-disease annotation database 60 Alzheimer's disease (AD) Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders Query: Apolipoprotein E Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction … 477 diseases!
  • 61. Play Dizeez to annotate gene-disease links 61 3. If it‟s „right‟, you get points 4. Then on to the next question… 2. Click the related disease (only one is “right”) 5. Hurry! 1. Read the clue (gene) 6. Play to win!
  • 62. Dizeez players seem pretty smart… 62 In total (since Dec 2011): • 230 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease 11 NBPF3 neuroblastoma 11 SOX8 mental retardation 9 ABL1 leukemia 9 SSX1 synovial sarcoma 8 APC colorectal cancer 8 FES sarcoma 8 RBP3 retinoblastoma 8 GAST gastrinoma 8 DCC colorectal cancer 8 MAP3K5 cancer Gene Wiki OMIM PharmGKB PubMed
  • 63. Using games to predict phenotype from genotype? 63 http://genegames.org
  • 64. Classification problems in genome biology 64 cancer normal find patterns Classify new samples cancer normal SVM Neural networks Naïve Bayes KNN … 100s samples 100,000sfeatures
  • 65. Random forests 65 Sample subset of cases and features Train decision treecancer normal 100s samples 100,000sfeatures
  • 66. Random forests 66 cancer normal 100s samples 100,000sfeatures
  • 67. Random forests 67 Classify new samples cancer normal cancer normal 100s samples 100,000sfeatures How to interject biological knowledge?
  • 68. Network-guided forests 68 Dutkowski & Ideker (2011). PLoS Computational Biology
  • 69. Network-guided forests 69 Sample features by PPI network Train decision treecancer normal 100s samples 100,000sfeatures
  • 70. Human-guided forests 70 Sample features by human intelligence Train decision treecancer normal 100s samples 100,000sfeatures
  • 71. 71
  • 72. The Cure: Genomic predictors for disease 72
  • 73. The Cure: Genomic predictors for disease 73
  • 74. The Cure: Genomic predictors for disease 74
  • 75. The Cure: Genomic predictors for disease 75
  • 76. The Cure: Genomic predictors for disease 76
  • 77. The Cure: Genomic predictors for disease 77
  • 80. Results • 214 registered players – 50% declared knowledge of cancer biology – 40% self-identified as having Ph.D. • Prediction results – 70% correct on survival concordance index – Best scoring model was 76% – Player registrations still increasing! 80
  • 81. The Long Tail of gamers can collaboratively build an accurate disease classifier. 81
  • 82. 82 Doug Howe, ZFIN John Hogenesch, U Penn Jon Huss, GNF Luca de Alfaro, UCSC Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim, Northwestern Many Wikipedia editors WP:MCB Project Collaborators Katie Fisch Ben Good Salvatore Loguercio Max Nanis Chunlei Wu Group members Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820) Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Adriel Carolino Erik Clarke Jon Huss Marc Leglise Maximilian Ludvigsson Ian MacLeod Camilo Orozco Key group alumni
  • 83. Doctoral Program in Chemical and Biological Sciences CALIFORNIA Office of Graduate Studies 10550 N. Torrey Pines Road La Jolla, CA 92037 Email: gradprgrm@scripps.edu Phone: 858.784.8469 http://education.scripps.edu

Notes de l'éditeur

  1. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
  2. If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  3. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  4. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  5. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  6. Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  7. Tried on 773 GO categories, significant in 356 cases (46%)
  8. We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  9. Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  10. Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  11. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  12. Developer resources do not scale with usagePractical effects:Core developers’ time is always the rate-limiting step Addition of new features and data always feels slowEventually, new databases are created to fill the gap80% duplication for 20% innovation
  13. MODs and portals
  14. Genetics resources
  15. Literature resources
  16. Protein resources
  17. Pathway and expression databases
  18. Pathway and expression databases
  19. Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  20. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  21. Empire state building
  22. Question: how to interject biological knowledge in the feature selection process?
  23. Kellogg School slide.pptx