SlideShare a Scribd company logo
1 of 58
Crowdsourcing to structure biological
           knowledge
                    Andrew Su, Ph.D.
      Department of Molecular and Experimental Medicine
               The Scripps Research Institute

                         ISI, USC

                    August 16, 2012
2
Human genetics underlies human health
                                     Molecular understanding of:
                                     • Biological function
                                     • Genetic variation
                                     • Mutation         “Gene
                                     • Deletion      annotation”
                                     • Amplification
                                     • …




              ~3 billion   ~23,000
               bases        genes




                                              Molecular
                                            diagnostics &
                                            therapeutics
3
Structured gene annotations enable computation



         Structured annotations
4
Few genes are well annotated


               TP53
               TNF
               APOE
               MTHFR
               IL6
               HLA-DRB1
   Counts




               VEGFA
               EGFR
               TGFB1                              59%
               ACE

                       PubMed
                                                        38%            23,278 protein-
                                                                        coding genes

                Gene
            ontology (GO)




                            Genes, sorted by decreasing counts


                                                         Data: NCBI gene2pubmed, August 2010
5
Biocuration is a key annotation bottleneck


                   Number of PubMed-indexed articles

    1,000,000


     800,000


     600,000


     400,000


     200,000


           0
                1979 1984 1989 1994 1999 2004 2009
6




311,696 articles (1.5% of PubMed)
have been cited by GO annotations
7




    Sooner or later, the
 research community will
need to be involved in the
             0
annotation effort to scale
   up to the rate of data
        generation.
8
The Long Tail is a prolific source of content


                       Short
                       Head
             Content
            produced


                                       Long Tail



                               Contributors (sorted)




             News :      Newspapers                 Blogs
              Video:    TV/Hollywood               YouTube
   Product reviews:    Consumer reports         Amazon reviews
     Food reviews:       Food critics                Yelp
     Talent judging:      Olympics               American Idol
9




  We can harness the
Long Tail of scientists
to directly participate in
  the gene annotation
        process.
10
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
11
 10,000 gene “stubs” within Wikipedia          Utility




                                                         Users

                                        Contributors



                                         Protein structure
    Gene
  summary
                                          Symbols and
                                           identifiers


                                         Gene Ontology
                                          annotations
   Protein
interactions

                                        Tissue expression
  Linked                                     pattern
references

                                         Links to structured
                                             databases



Huss, PLoS Biol, 2008
12
 Gene Wiki has a critical mass of readers
                                         Total: 4.0 million views / month




Huss, PLoS Biol, 2008; Good, NAR, 2011
13
 Gene Wiki has a critical mass of editors



                         Editor count   Editors




                                                          Edit count
                                                  Edits




                  Increase of ~10,000 words / month from >1,000 edits
                               Currently 1.42 million words
                      Approximately equal to 230 full-length articles
Good, NAR, 2011
14
A review article for every gene is powerful




      Reelin: 68 editors, 543 edits since July 2002
      Heparin: 175 editors, 320 edits since June 2003
      AMPK: 44 editors, 84 edits since March 2004
      RNAi: 232 editors, 708 edits since October 2002
                                          References to the literature
         Hyperlinks to related concepts
Filtering, extracting, and summarizing PubMed



Documents




 Concepts
16
Document- and concept-centric text mining
                          Predicate

                Subject               Object
17
Simple text mining for gene annotations

                                          NCBI Entrez Gene: 334



                           Gene Wiki
                           mapping


          Wikilink                           Candidate
                                             assertion

                                          GO:0006897



                           GO exact
                            match
            6319 novel Gene Ontology annotations
            2147 novel Disease Ontology annotations
18
Gene Wiki+ for integrative queries


                      mwsync




                http://genewikiplus.org
19
Dynamic queries across genes, diseases, SNPs
20
21




TOP 100
GENES
22
Gene Wiki+ for integrative queries


                     mwsync

                                OMIM
                              PharmGKB



                   {{#ask:
                   [[Category:Human_proteins]]
                         [[is_associated_with::

                   <q>[[Category:Breast_cancer]
                   ]</q>]]
                         [[HasSNP::
      …




                     <q>[[is_associated_with::
                http://genewikiplus.org
23
Gene Wiki+ for integrative queries


                      mwsync

                                   OMIM
                                 PharmGKB




                http://genewikiplus.org
24
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
25
Not just the biomedical literature…
26
BioGPS aggregates gene-centric information




                  http://biogps.org
27
The plugin interface is simple and universal


Pubmed
   http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}


STRING
   http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}


 KEGG
   http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}




           URL template
                                        Rendered URL
              Gene entity
28
The plugin interface is simple and universal
29
The plugin interface is simple and universal
30
The plugin interface is simple and universal
31
The plugin interface is simple and universal
32
The plugin interface is simple and universal




              Total of 389 gene-centric online
          databases registered as BioGPS plugins
33
BioGPS has a critical mass of users
           Daily pageviews




  • > 4100 registered users              Top 10 organizations
  • 4000 unique visitors per week   1.     Harvard     6. Cambridge
                                    2.     NIH         7. U Penn
  • 40,000 page views per week
                                    3.     UCSD        8. Stanford
                                    4.     Scripps     9. Wash U
                                    5.     MIT         10. UNC
34
All resources should provide RDF…
35
Mining structured content from HTML
36
Defining a data extraction template
        TP53   TNF   APOE   IL6   VEGF EGFR TGFB1   …
  …
37
The BioGPS Semantic Annotator




              http://50.112.124.237
38
All resources should provide flat files…
39
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
40



Seven million human hours




                            http://www.flickr.com/photos/archana3k1/4124330493/
41



Twenty million human hours




                             http://www.flickr.com/photos/ableman/2171326385/
42
-
    150 billion human hours
              per year




                              http://www.flickr.com/photos/rvp-cw/6243289302/
43
Using games to fold proteins



      Fold.it players have successfully:
      • Outperformed state of the art protein
        folding algorithms (Cooper, Nature, 2010)
      • Solved a previously-intractable crystal
        structure (Khatib, Nat Struct Mol Biol, 2011)
      • Designed an improved protein folding
        algorithm (Khatib, PNAS, 2011)
      • Improved enzyme activity of de novo
        designed enzyme (Eiben, Nat Biotechnol, 2011)
44
Using games to fold RNAs




              http://eterna.cmu.edu/
45
Using games to align sequences




              http://phylo.cs.mcgill.ca
46
Using games to annotate gene-disease links

                    hurry!

                                        then on to the next question

       If its ‘right’, you get points




                      Click the related disease




                             http://genegames.org
47
Dizeez players seem pretty smart…

  In total:
  • 207 unique gamers
  • 1045 games played
  • 8525 guesses

# Occurrences   Gene Disease              Pubmed   OMIM PharmGKB   Gene Wiki

      7         GAST gastrinoma
      7         RBP3 retinoblastoma
      7         SSX1 synovial sarcoma
      6          TG    Graves' disease
      6         CRYGC Cataract
      6         SOX8 mental retardation
      6          WRN Werner syndrome
      6          ABL1 leukemia
      6         MLL3 leukemia
      6         SNAI2 breast carcinoma
48
Dizeez players seem pretty smart…

  In total:
  • 207 unique gamers
  • 1045 games played
  • 8525 guesses

# Occurrences    Gene Disease              Pubmed   OMIM PharmGKB   Gene Wiki

      5         MECOM sarcoma
      4         ATF7   cancer
      3         ABCB5 acute myeloid leukemia
      3         SART1 glioblastoma
      3         NCK1   leukemia
      3         NEK1   cancer
49
GenESP: Two-player annotation games
50
COMBO: Genomic predictors for disease


                          make predictions on
  cancer   normal           new samples


                     find patterns
                                         cancer

                                         normal
51
COMBO: Genomic predictors for disease
52
COMBO: Genomic predictors for disease
53
COMBO: Genomic predictors for disease
54
COMBO: Genomic predictors for disease
55
COMBO: Genomic predictors for disease
56
COMBO: Genomic predictors for disease
57




  We can harness the
Long Tail of scientists
to directly participate in
  the gene annotation
        process.
58
       Collaborators                                                  Group members
Doug Howe, ZFIN                                             Erik Clarke       Ian Macleod
John Hogenesch, U Penn
Jon Huss, GNF
                                                            Ben Good          Chunlei Wu
Luca de Alfaro, UCSC                                        Salvatore Loguercio
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
      Fondation Jean Dausset                             Summer internships for students!
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
    WP:MCB Project



                                                                                     Contact
                                                                                 http://sulab.org
 Recruiting graduate students
                                                                                asu@scripps.edu
  in quantitative biology! See                                                    @andrewsu
 http://education.scripps.edu/                                                    +Andrew Su



                                        Funding and Support



                                   (BioGPS: GM83924, Gene Wiki: GM089820)

More Related Content

Similar to Crowdsourcing to structure biological knowledge (USC/ISI)

Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Andrew Su
 
Biotechnology as Career Option 2012
Biotechnology as Career Option 2012Biotechnology as Career Option 2012
Biotechnology as Career Option 2012Reportbioinformatics
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgAndrew Su
 
Stephen Friend Food & Drug Administration 2011-07-18
Stephen Friend Food & Drug Administration 2011-07-18Stephen Friend Food & Drug Administration 2011-07-18
Stephen Friend Food & Drug Administration 2011-07-18Sage Base
 
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12Sage Base
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceAndrew Su
 
Friend EORTC 2012-11-08
Friend EORTC 2012-11-08Friend EORTC 2012-11-08
Friend EORTC 2012-11-08Sage Base
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyBarry Smith
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In GenomicsSaul Kravitz
 
Stephen Friend AMIA Symposium 2012-03-21
Stephen Friend AMIA Symposium 2012-03-21Stephen Friend AMIA Symposium 2012-03-21
Stephen Friend AMIA Symposium 2012-03-21Sage Base
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Chris Evelo
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wdWagied Davids
 
Leveraging ancestral state reconstruction to infer community function from a ...
Leveraging ancestral state reconstruction to infer community function from a ...Leveraging ancestral state reconstruction to infer community function from a ...
Leveraging ancestral state reconstruction to infer community function from a ...Morgan Langille
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentationSKUAST-Kashmir
 
Stephen Friend NIH PPP Coordinating Committee Meeting 2012-02-16
Stephen Friend NIH PPP Coordinating Committee Meeting 2012-02-16Stephen Friend NIH PPP Coordinating Committee Meeting 2012-02-16
Stephen Friend NIH PPP Coordinating Committee Meeting 2012-02-16Sage Base
 

Similar to Crowdsourcing to structure biological knowledge (USC/ISI) (20)

Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...
 
Biotechnology as Career Option 2012
Biotechnology as Career Option 2012Biotechnology as Career Option 2012
Biotechnology as Career Option 2012
 
Chibucos annot go_final
Chibucos annot go_finalChibucos annot go_final
Chibucos annot go_final
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Stephen Friend Food & Drug Administration 2011-07-18
Stephen Friend Food & Drug Administration 2011-07-18Stephen Friend Food & Drug Administration 2011-07-18
Stephen Friend Food & Drug Administration 2011-07-18
 
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
 
Friend EORTC 2012-11-08
Friend EORTC 2012-11-08Friend EORTC 2012-11-08
Friend EORTC 2012-11-08
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In Genomics
 
Stephen Friend AMIA Symposium 2012-03-21
Stephen Friend AMIA Symposium 2012-03-21Stephen Friend AMIA Symposium 2012-03-21
Stephen Friend AMIA Symposium 2012-03-21
 
eScience-School-Oct2012-Campinas-Brazil
eScience-School-Oct2012-Campinas-BrazileScience-School-Oct2012-Campinas-Brazil
eScience-School-Oct2012-Campinas-Brazil
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
 
Leveraging ancestral state reconstruction to infer community function from a ...
Leveraging ancestral state reconstruction to infer community function from a ...Leveraging ancestral state reconstruction to infer community function from a ...
Leveraging ancestral state reconstruction to infer community function from a ...
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentation
 
Stephen Friend NIH PPP Coordinating Committee Meeting 2012-02-16
Stephen Friend NIH PPP Coordinating Committee Meeting 2012-02-16Stephen Friend NIH PPP Coordinating Committee Meeting 2012-02-16
Stephen Friend NIH PPP Coordinating Committee Meeting 2012-02-16
 
31961.ppt
31961.ppt31961.ppt
31961.ppt
 

More from Andrew Su

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphAndrew Su
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesAndrew Su
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeAndrew Su
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...Andrew Su
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)Andrew Su
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseAndrew Su
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Andrew Su
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Andrew Su
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchAndrew Su
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceAndrew Su
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceAndrew Su
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Andrew Su
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeAndrew Su
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6Andrew Su
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Andrew Su
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Andrew Su
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Andrew Su
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...Andrew Su
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)Andrew Su
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing SymposiumAndrew Su
 

More from Andrew Su (20)

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graph
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciences
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebase
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease Research
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen science
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen Science
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledge
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Crowdsourcing to structure biological knowledge (USC/ISI)

  • 1. Crowdsourcing to structure biological knowledge Andrew Su, Ph.D. Department of Molecular and Experimental Medicine The Scripps Research Institute ISI, USC August 16, 2012
  • 2. 2 Human genetics underlies human health Molecular understanding of: • Biological function • Genetic variation • Mutation “Gene • Deletion annotation” • Amplification • … ~3 billion ~23,000 bases genes Molecular diagnostics & therapeutics
  • 3. 3 Structured gene annotations enable computation Structured annotations
  • 4. 4 Few genes are well annotated TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology (GO) Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
  • 5. 5 Biocuration is a key annotation bottleneck Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
  • 6. 6 311,696 articles (1.5% of PubMed) have been cited by GO annotations
  • 7. 7 Sooner or later, the research community will need to be involved in the 0 annotation effort to scale up to the rate of data generation.
  • 8. 8 The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
  • 9. 9 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
  • 10. 10 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 11. 11 10,000 gene “stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Protein interactions Tissue expression Linked pattern references Links to structured databases Huss, PLoS Biol, 2008
  • 12. 12 Gene Wiki has a critical mass of readers Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011
  • 13. 13 Gene Wiki has a critical mass of editors Editor count Editors Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011
  • 14. 14 A review article for every gene is powerful Reelin: 68 editors, 543 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 References to the literature Hyperlinks to related concepts
  • 15. Filtering, extracting, and summarizing PubMed Documents Concepts
  • 16. 16 Document- and concept-centric text mining Predicate Subject Object
  • 17. 17 Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel Gene Ontology annotations 2147 novel Disease Ontology annotations
  • 18. 18 Gene Wiki+ for integrative queries mwsync http://genewikiplus.org
  • 19. 19 Dynamic queries across genes, diseases, SNPs
  • 20. 20
  • 22. 22 Gene Wiki+ for integrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer] ]</q>]] [[HasSNP:: … <q>[[is_associated_with:: http://genewikiplus.org
  • 23. 23 Gene Wiki+ for integrative queries mwsync OMIM PharmGKB http://genewikiplus.org
  • 24. 24 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 25. 25 Not just the biomedical literature…
  • 26. 26 BioGPS aggregates gene-centric information http://biogps.org
  • 27. 27 The plugin interface is simple and universal Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}} STRING http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}} KEGG http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}} URL template Rendered URL Gene entity
  • 28. 28 The plugin interface is simple and universal
  • 29. 29 The plugin interface is simple and universal
  • 30. 30 The plugin interface is simple and universal
  • 31. 31 The plugin interface is simple and universal
  • 32. 32 The plugin interface is simple and universal Total of 389 gene-centric online databases registered as BioGPS plugins
  • 33. 33 BioGPS has a critical mass of users Daily pageviews • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
  • 34. 34 All resources should provide RDF…
  • 36. 36 Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  • 37. 37 The BioGPS Semantic Annotator http://50.112.124.237
  • 38. 38 All resources should provide flat files…
  • 39. 39 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 40. 40 Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
  • 41. 41 Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  • 42. 42 - 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  • 43. 43 Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 44. 44 Using games to fold RNAs http://eterna.cmu.edu/
  • 45. 45 Using games to align sequences http://phylo.cs.mcgill.ca
  • 46. 46 Using games to annotate gene-disease links hurry! then on to the next question If its ‘right’, you get points Click the related disease http://genegames.org
  • 47. 47 Dizeez players seem pretty smart… In total: • 207 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 7 GAST gastrinoma 7 RBP3 retinoblastoma 7 SSX1 synovial sarcoma 6 TG Graves' disease 6 CRYGC Cataract 6 SOX8 mental retardation 6 WRN Werner syndrome 6 ABL1 leukemia 6 MLL3 leukemia 6 SNAI2 breast carcinoma
  • 48. 48 Dizeez players seem pretty smart… In total: • 207 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 5 MECOM sarcoma 4 ATF7 cancer 3 ABCB5 acute myeloid leukemia 3 SART1 glioblastoma 3 NCK1 leukemia 3 NEK1 cancer
  • 50. 50 COMBO: Genomic predictors for disease make predictions on cancer normal new samples find patterns cancer normal
  • 57. 57 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
  • 58. 58 Collaborators Group members Doug Howe, ZFIN Erik Clarke Ian Macleod John Hogenesch, U Penn Jon Huss, GNF Ben Good Chunlei Wu Luca de Alfaro, UCSC Salvatore Loguercio Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Summer internships for students! Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim, Northwestern Many Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)

Editor's Notes

  1. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  2. Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  3. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  4. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  5. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  6. MODs and portals
  7. Genetics resources
  8. Literature resources
  9. Protein resources
  10. Pathway and expression databases
  11. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  12. Empire state building