SlideShare a Scribd company logo
1 of 56
The Gene Wiki:
Synthesizing knowledge about human
       genes with Wikipedia
             Benjamin Good
                   Feb. 26, 2013




          http://www.slideshare.net/goodb
2
“Knowledge about human genes”
3
“Knowledge about human genes”




 1) There is a lot


 2) It is scattered
4
Biological knowledge is growing, rapidly


 • More than 22 million articles indexed in PubMed

                    • Growing at about million/year and rising
5
Scattered genomic knowledge is a problem

                                   GNF                     Hits
                                                           IFITM3
   • Scientists faced with new     Robotics                TFE3
                                                           BEX1
     and unfamiliar genes on a                             ST8SIA1
                                                           TFEB

     daily basis                                           BEX2
                                                           SKP1A
                                                           ....




   • Public faced with unfamiliar genes on a daily basis
6
Knowledge synthesis




      “the pulling together of ideas or
      information to develop a common
      framework for understanding”
7
Knowledge synthesis in biology, aka biocuration


   • The production of structured data



    Unstructured                    Structured
                      Gene          Property       Value
                      Fibronectin   Biological     Angiogenesis
                                    Process
                      Fibronectin   Cellular       Extracellular
                                    Localization   matrix
                      Fibronectin   Related        Glomerulopathy
                                    Disease
8
Gene Ontology



    “Tool for the unification of biology”[1]
  A shared, controlled vocabulary for describing gene function

  Molecular Function, Biological Process, Cellular Component

  > 10,550 Citations in Google Scholar




                            [1] Nature Genetics. 2000 May;25(1):25-9.
9
Gene Ontology Annotation Database („GOA‟)


• Records gene function
  using gene ontology terms

• Expert synthesis of the
  knowledge from
  thousands of articles     Gene          Property       Value
                            Fibronectin   Biological     Angiogenesis
                                          Process
                            Fibronectin   Cellular       Extracellular
                                          Localization   matrix
                            Fibronectin   Related        Glomerulopathy
                                          Disease
10
33k articles become 31 gene annotations




             Gene Ontology Curators


            31 function annotations for
            human gene
11




Great!
12




BUT
13




GO annotation is not complete
14
Many genes are not thoroughly annotated
     GO Annotation
        Counts




                                                      + Electronic annotation (IEA)


                      Biological
                     Process only


                                Genes, sorted by decreasing counts




                                                                        Data: NCBI, February 2013
15




1 million articles per year....
16




    Sooner or later, the
 research community will
need to be involved in the
             0
annotation effort to scale
   up to the rate of data
        generation.
17
The Long Tail is a prolific source of content


                        Short
                        Head
            Content
           produced


                                         Long Tail



                                Contributors (sorted)




   News reporting:      Newspapers                   Blogs
            Video:     TV/Hollywood                YouTube
  Product reviews:    Consumer reports           Amazon reviews
    Food reviews:       Food critics                 Yelp
  Gene annotation:      bio-curators             ????????????
18
Wikipedia successfully harnesses the long tail


   • Within top 10 most    Articles
     visited websites

                           Words
   • 14 million+             (millions)



     registered users
                            Words/
                            article

                                                   Wikipedia           Britannica Online
                             http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
19
Wikipedia is reasonably accurate
20
The Gene Wiki Hypothesis



    “We can harness the
   Long Tail of scientists
   to directly participate in
     the gene annotation
           process.”
                   -Andrew Su
21
Goal of the Gene Wiki project


   • Enable the creation of a collaboratively
     written, continuously updated, high
     quality review article for every human
     gene.
Filtering, extracting, and summarizing PubMed
23
Success depends on a positive feedback loop

                  Value of service




                          1   100
                      2             200




    Number of                             Number of
   contributors                             users
Gene “stubs” seed community contributions
                                                            24




                                            Protein structure


     Gene
                                             Symbols and
   summary
                                              identifiers



                                            Gene Ontology
                                             annotations
    Protein
 interactions

                                            Tissue expression
                                                 pattern

   Linked
 references
                                            Links to structured
                                                databases
25
A review article for every gene is powerful




        68 editors, 543 edits (as of July 2010)




                                          References to the literature
         Hyperlinks to related concepts
26
The Gene Wiki project – 2010 stats

                     Value of service   10,300 articles
                                        1.2 million words
                                        67MB text
                                              (about 1,000
                                        PloS Biology research
                                        articles)




                                                  55 million
                                                  page views
    Number of                                   Number of
                3,500 editors
   contributors                                   users
                17,000 edits
Monthly growth of words in Gene Wiki articles, page views per month and edits per month
                          between 1 September 2009 and 1 September 2011.




                                       Good B M et al. Nucl. Acids Res. 2012;40:D1255-D1261


© The Author(s) 2011. Published by Oxford University Press.
28




Why is it working?
29
Google loves Wikipedia


 • 1.86 million
   results from
   Google

 • courses

 • products

 • databases

 • ...
30
The Gene Wiki hitches a ride on Wikipedia




     CC photo by ff137 on flickr
31
Take home messages

                                   Value
 • Success depends on
   a positive feedback
                         contributors      users
   loop

 • Where possible,
   try to hitch a ride
32
But still, many genes lack structured annotation…
     GO Annotation
        Counts




                                                      + Electronic annotation (IEA)


                      Biological
                     Process only


                                Genes, sorted by decreasing counts




                                                                        Data: NCBI, February 2013
33
Can we generate structured annotations from
the text of the gene wiki?



                               Gene          Property       Value
                           ?   Fibronectin   Biological     Angiogenesis
                                             Process
                               Fibronectin   Cellular       Extracellular
                                             Localization   matrix
                               Fibronectin   Related        Glomerulopathy
                                             Disease



                                       Great for
                                       building
    Great for people to read           software for
                                       people to use
Filtering, extracting, and summarizing PubMed



Documents




 Concepts
35
Document- and concept-centric text mining
                          Predicate

                Subject               Object
36
 Simple text mining for gene annotations

                                                   NCBI Entrez Gene: 334



                                       Gene Wiki
                                       mapping


                            Wikilink                  Candidate
                                                      assertion

                                                   GO:0006897



                                       GO exact
                                        match




Good, BMC Genomics, 2011.
Finding concepts
• NCBO Annotator Web Service
    – Gene Ontology
    – Human Disease Ontology


• Annotator service selected for:
    – Speed, easy API, precision


Clement Jonquet, Nigam H Shah, Mark A Musen, (2009) The Open Biomedical
Annotator. AMIA Summit on Translational Bioinformatics. 56-60
http://bioportal.bioontology.org/annotator
Mining workflow
             Gene Wiki Articles
                 (10,271)



                 Filtering,
                 cleanup




                  Extract
                 concepts
                  (NCBO)




11,022 matched                 2,983 matched
 gene ontology                disease ontology
     terms                         terms
Compared to
               current dbs                                      Results                     Manual evaluation
                                                                                            on random sample
                   match more


DO                 specific term                                   $"
                       2%
                                                                  !# "
                                                                   ,
                                          exact match             !#
                                                                   +"
                                             23%
                                                                  !#
                                                                   *"
                                                                  !# "
                                                                    )
                                                                  !# "
                                                                   (
                                                                  !# "
                                                                    '
                                                 match more
                                                 general term     !#&"
                                                     5%
                                                                  !# "
                                                                    %
   no match                                                       !#
                                                                   $"
     70%
                                                                   !"
                                                                         - . //012"   3 0045"6 . /0"078
                                                                                                      40910"   :91. //012"



                  match more
                                                                   $"

GO
                  specific term
                      2%           exact match
                                      12%
                                                                  !# "
                                                                   ,
                                                                  !#
                                                                   +"
                                                                  !#
                                                                   *"
                                                                  !# "
                                                                    )
                                                                  !# "
                                                                   (

                                                 match more
                                                                  !# "
                                                                    '
                                                 general term     !#&"
no match
  58%                                               28%
                                                                  !# "
                                                                    %
                                                                  !#
                                                                   $"
                                                                   !"
                                                                         - . //012"   3 0045"6 . /0"078
                                                                                                      40910"   :91. //012"
!# "
   ,
                                  GO problems
 !#+"
 !#*"
 !# "
   )
 !# "
   (
 !# "
   '
 !#&"
 !# "
   %
 !#$"
   !"
        - . /01"2 . 345"6 7# 90: ; 4<=9>"
                         1# "38.            ?1931941": =109@ AA=83"
                                                            3"0;      ?1931941"B . 43; . //D"C /01"
                                                                               0"C            .
                                                  . 99=3. <=9"
False match (e.g., “Olfactory receptors .. are responsible for the
transduction of odorant signals. The system incorrectly identifies
„transduction‟ (GO:0009293) defined as the transfer of genetic
information to a bacterium from a bacteriophage or between bacterial
or yeast cells mediated by a phage vector
No support in sentence (e.g., "The protein is composed ... including
10 sialic acid residues, which are attached to the protein during
posttranslational modification in the Golgi apparatus.” Such
sentences may lead to incorrect annotations of 'Golgi apparatus' and
'Posttranslational modification‟.)
Applications

• Enrichment analysis
   • even with false positives, text-mined annotations can
     improve statistical analyses that are tolerant to noise.

• GeneWiki+
Gene Wiki+ for integrative queries

                                        mwsync




Good, J Biomed Semantics, 2012.
                                  http://genewikiplus.org   42
Dynamic queries across genes, diseases, SNPs




                                                           43
Good, J Biomed Semantics, 2012.
Gene Wiki+ for integrative queries
                                    mwsync

                                                 OMIM
                                               PharmGKB


                                  {{#ask:
                                  [[Category:Human_proteins]]
                                        [[is_associated_with::

                                  <q>[[Category:Breast_cancer]]</q>]
                                  ]
                                        [[HasSNP::
                                           <q>[[is_associated_with::
                     …




                                  <q>[[Category:Breast_cancer]]</q>]
http://genewikiplus.org           ]                                  44
Good, J Biomed Semantics, 2012.
                                          </q>]]
Gene Wiki+ for integrative queries
                                        mwsync

                                                     OMIM
                                                   PharmGKB




Good, J Biomed Semantics, 2012.
                                  http://genewikiplus.org     45
Text mining take home
• Depends a lot on the ontology
  • (same text, same algorithm,
    completely different results)
• Approach depends on corpus
  • concept-centric text has advantages
• Approach depends on purpose
  • high false positive rates are common
    but may be acceptable – e.g.
    enrichment analysis                    46
Can we skip text mining?




http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/
Wikidata
Provide a database of the
 world‟s knowledge that
    anyone can edit
            - Denny Vrandečić




                                48
Q414043
                        Wikidata
                    Reelin




                                            Protein           Q8054
Property:P31          is a
                                            Glycoprotein    Q187126


                                            Neural
Property:P128       regulates                               Q1345738
                                            development

                                            VLDL receptor   Q1979313
Property:P129       Interacts               Amyloid
                       with                 precursor       Q423510
                                            protein
                                                               49
                http://www.wikidata.org/wiki/Q414043
Q414043
                                 Wikidata


                                                                                        Q8054
Property:P31
                                                                                      Q187126


Property:P128                                                                         Q1345738

                                                                                      Q1979313
Property:P129
                                                                                      Q423510
                                                                                         50
        http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
Wikidata




http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force   51
Wikidata




http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force   52
53




 “We can harness the
Long Tail of scientists
to directly participate in
  the gene annotation
        process.”
                -Andrew Su
54
Gene Wiki acknowledgements..
                                                                                                          http://wordle.com
 Many Wikipedia editors
 WP:MCB Project




“A gene wiki for community annotation of gene function”   “The Gene Wiki: community intelligence applied to human
PloS Biology 2008                                         gene annotation” Nucleic Acids Research 2009




                                                           “Mining the Gene Wiki for Functional Genomic Knowledge”
                                                           BMC Genomics 2011

                                                           “The Gene Wiki in 2011: community intelligence applied to human
                                                           gene annotation”
                                                           Nucleic Acids Research 2012

                                                           “Linking genes to diseases with a SNPedia-Gene Wiki mashup”
                                                           Journal of Biomedical Semantics 2012

                                                           “Building a biomedical semantic network in Wikipedia with Semantic
                                                           Wiki Links”
                                                           Database: The Journal of Biological Databases and Curation 2012
My sister Erin has a PhD in linguististics, lives in Raleigh
    and is looking for work in research or teaching..
                      Help her out!




                                    bgood@scripps.edu
                                    @bgood
                                    i9606.blogspot.com
Funding and Support                 slideshare/goodb
    NIH / NIGMS                                       55
 (Gene Wiki: GM089820)
56
 Gene Wiki content improves enrichment analysis



                                More
      p-value                significant
  (PubMed + GW)             PubMed only

                                                             Muscle
                                                           contraction



                                                More
                                             significant
                                            PubMed + GW




                              p-value (PubMed only)
Good, BMC Genomics, 2011.

More Related Content

Viewers also liked

Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1 Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1
schelby
 
Human Guided Forests (HGF)
Human Guided Forests (HGF)Human Guided Forests (HGF)
Human Guided Forests (HGF)
Benjamin Good
 
Fedora Iptables
Fedora IptablesFedora Iptables
Fedora Iptables
zubin71
 
Short update on The Cure game first week
Short update on The Cure game first weekShort update on The Cure game first week
Short update on The Cure game first week
Benjamin Good
 
Light steel villa catalogue log
Light steel villa catalogue logLight steel villa catalogue log
Light steel villa catalogue log
eishimachinery
 

Viewers also liked (20)

Mark Hopper Product And Marketing Exec 2010
Mark Hopper Product And Marketing Exec 2010Mark Hopper Product And Marketing Exec 2010
Mark Hopper Product And Marketing Exec 2010
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søk
 
(Bio)Hackathons
(Bio)Hackathons(Bio)Hackathons
(Bio)Hackathons
 
Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1 Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1
 
2016 mem good
2016 mem good2016 mem good
2016 mem good
 
Gene wiki jamboree
Gene wiki jamboreeGene wiki jamboree
Gene wiki jamboree
 
Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3
 
Human Guided Forests (HGF)
Human Guided Forests (HGF)Human Guided Forests (HGF)
Human Guided Forests (HGF)
 
Fedora Iptables
Fedora IptablesFedora Iptables
Fedora Iptables
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
 
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
 
Short update on The Cure game first week
Short update on The Cure game first weekShort update on The Cure game first week
Short update on The Cure game first week
 
2to3
2to32to3
2to3
 
B2B Branding Explained
B2B Branding ExplainedB2B Branding Explained
B2B Branding Explained
 
Channeling Collaborative Spirit
Channeling Collaborative SpiritChanneling Collaborative Spirit
Channeling Collaborative Spirit
 
The National Society For The Protection Of Hmmm
The National Society For The Protection Of HmmmThe National Society For The Protection Of Hmmm
The National Society For The Protection Of Hmmm
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alpha
 
Light steel villa catalogue log
Light steel villa catalogue logLight steel villa catalogue log
Light steel villa catalogue log
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio
 

Similar to Gene Wiki at Phenotype RCN annual meeting

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Andrew Su
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Monica Munoz-Torres
 
Integration of Bioinformatics Web Services through the Search Computing Techn...
Integration of Bioinformatics Web Services through the Search Computing Techn...Integration of Bioinformatics Web Services through the Search Computing Techn...
Integration of Bioinformatics Web Services through the Search Computing Techn...
Davide Chicco
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
EBI
 

Similar to Gene Wiki at Phenotype RCN annual meeting (20)

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
 
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISB2012: The Gene Wiki: Crowdsourcing human gene annotationISB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
 
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotationISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challenges
 
Big Data
Big DataBig Data
Big Data
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
What Synthetic Biology Can Do For You
What Synthetic Biology Can Do For YouWhat Synthetic Biology Can Do For You
What Synthetic Biology Can Do For You
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.
 
Is microbial ecology driven by roaming genes?
Is microbial ecology driven by roaming genes?Is microbial ecology driven by roaming genes?
Is microbial ecology driven by roaming genes?
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
 
Stephen Friend Cytoscape Retreat 2011-05-20
Stephen Friend Cytoscape Retreat 2011-05-20Stephen Friend Cytoscape Retreat 2011-05-20
Stephen Friend Cytoscape Retreat 2011-05-20
 
Friend NRNB 2012-12-13
Friend NRNB 2012-12-13Friend NRNB 2012-12-13
Friend NRNB 2012-12-13
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
 
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms
Lit Review Talk by Kato Mivule: A Review of Genetic AlgorithmsLit Review Talk by Kato Mivule: A Review of Genetic Algorithms
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms
 
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
 
Integration of Bioinformatics Web Services through the Search Computing Techn...
Integration of Bioinformatics Web Services through the Search Computing Techn...Integration of Bioinformatics Web Services through the Search Computing Techn...
Integration of Bioinformatics Web Services through the Search Computing Techn...
 
Bioinformatics relevance with biotechnology
Bioinformatics relevance with biotechnologyBioinformatics relevance with biotechnology
Bioinformatics relevance with biotechnology
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 

More from Benjamin Good

Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Benjamin Good
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
Benjamin Good
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
Benjamin Good
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen science
Benjamin Good
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
Benjamin Good
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
Benjamin Good
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Benjamin Good
 

More from Benjamin Good (20)

Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
 
Pathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsPathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMs
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden
 
Science Game Lab
Science Game LabScience Game Lab
Science Game Lab
 
Wikidata and the Semantic Web of Food
Wikidata and the  Semantic Web of FoodWikidata and the  Semantic Web of Food
Wikidata and the Semantic Web of Food
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshop
 
Opportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocurationOpportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocuration
 
Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2
 
Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
Citizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfCitizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdf
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen science
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Serious games for bioinformatics education. ISMB 2014 education workshop
Serious games for bioinformatics education.  ISMB 2014 education workshopSerious games for bioinformatics education.  ISMB 2014 education workshop
Serious games for bioinformatics education. ISMB 2014 education workshop
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
 

Gene Wiki at Phenotype RCN annual meeting

  • 1. The Gene Wiki: Synthesizing knowledge about human genes with Wikipedia Benjamin Good Feb. 26, 2013 http://www.slideshare.net/goodb
  • 3. 3 “Knowledge about human genes” 1) There is a lot 2) It is scattered
  • 4. 4 Biological knowledge is growing, rapidly • More than 22 million articles indexed in PubMed • Growing at about million/year and rising
  • 5. 5 Scattered genomic knowledge is a problem GNF Hits IFITM3 • Scientists faced with new Robotics TFE3 BEX1 and unfamiliar genes on a ST8SIA1 TFEB daily basis BEX2 SKP1A .... • Public faced with unfamiliar genes on a daily basis
  • 6. 6 Knowledge synthesis “the pulling together of ideas or information to develop a common framework for understanding”
  • 7. 7 Knowledge synthesis in biology, aka biocuration • The production of structured data Unstructured Structured Gene Property Value Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease
  • 8. 8 Gene Ontology “Tool for the unification of biology”[1] A shared, controlled vocabulary for describing gene function Molecular Function, Biological Process, Cellular Component > 10,550 Citations in Google Scholar [1] Nature Genetics. 2000 May;25(1):25-9.
  • 9. 9 Gene Ontology Annotation Database („GOA‟) • Records gene function using gene ontology terms • Expert synthesis of the knowledge from thousands of articles Gene Property Value Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease
  • 10. 10 33k articles become 31 gene annotations Gene Ontology Curators 31 function annotations for human gene
  • 13. 13 GO annotation is not complete
  • 14. 14 Many genes are not thoroughly annotated GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
  • 15. 15 1 million articles per year....
  • 16. 16 Sooner or later, the research community will need to be involved in the 0 annotation effort to scale up to the rate of data generation.
  • 17. 17 The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News reporting: Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Gene annotation: bio-curators ????????????
  • 18. 18 Wikipedia successfully harnesses the long tail • Within top 10 most Articles visited websites Words • 14 million+ (millions) registered users Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
  • 20. 20 The Gene Wiki Hypothesis “We can harness the Long Tail of scientists to directly participate in the gene annotation process.” -Andrew Su
  • 21. 21 Goal of the Gene Wiki project • Enable the creation of a collaboratively written, continuously updated, high quality review article for every human gene.
  • 22. Filtering, extracting, and summarizing PubMed
  • 23. 23 Success depends on a positive feedback loop Value of service 1 100 2 200 Number of Number of contributors users
  • 24. Gene “stubs” seed community contributions 24 Protein structure Gene Symbols and summary identifiers Gene Ontology annotations Protein interactions Tissue expression pattern Linked references Links to structured databases
  • 25. 25 A review article for every gene is powerful 68 editors, 543 edits (as of July 2010) References to the literature Hyperlinks to related concepts
  • 26. 26 The Gene Wiki project – 2010 stats Value of service 10,300 articles 1.2 million words 67MB text (about 1,000 PloS Biology research articles) 55 million page views Number of Number of 3,500 editors contributors users 17,000 edits
  • 27. Monthly growth of words in Gene Wiki articles, page views per month and edits per month between 1 September 2009 and 1 September 2011. Good B M et al. Nucl. Acids Res. 2012;40:D1255-D1261 © The Author(s) 2011. Published by Oxford University Press.
  • 28. 28 Why is it working?
  • 29. 29 Google loves Wikipedia • 1.86 million results from Google • courses • products • databases • ...
  • 30. 30 The Gene Wiki hitches a ride on Wikipedia CC photo by ff137 on flickr
  • 31. 31 Take home messages Value • Success depends on a positive feedback contributors users loop • Where possible, try to hitch a ride
  • 32. 32 But still, many genes lack structured annotation… GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
  • 33. 33 Can we generate structured annotations from the text of the gene wiki? Gene Property Value ? Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease Great for building Great for people to read software for people to use
  • 34. Filtering, extracting, and summarizing PubMed Documents Concepts
  • 35. 35 Document- and concept-centric text mining Predicate Subject Object
  • 36. 36 Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match Good, BMC Genomics, 2011.
  • 37. Finding concepts • NCBO Annotator Web Service – Gene Ontology – Human Disease Ontology • Annotator service selected for: – Speed, easy API, precision Clement Jonquet, Nigam H Shah, Mark A Musen, (2009) The Open Biomedical Annotator. AMIA Summit on Translational Bioinformatics. 56-60 http://bioportal.bioontology.org/annotator
  • 38. Mining workflow Gene Wiki Articles (10,271) Filtering, cleanup Extract concepts (NCBO) 11,022 matched 2,983 matched gene ontology disease ontology terms terms
  • 39. Compared to current dbs Results Manual evaluation on random sample match more DO specific term $" 2% !# " , exact match !# +" 23% !# *" !# " ) !# " ( !# " ' match more general term !#&" 5% !# " % no match !# $" 70% !" - . //012" 3 0045"6 . /0"078 40910" :91. //012" match more $" GO specific term 2% exact match 12% !# " , !# +" !# *" !# " ) !# " ( match more !# " ' general term !#&" no match 58% 28% !# " % !# $" !" - . //012" 3 0045"6 . /0"078 40910" :91. //012"
  • 40. !# " , GO problems !#+" !#*" !# " ) !# " ( !# " ' !#&" !# " % !#$" !" - . /01"2 . 345"6 7# 90: ; 4<=9>" 1# "38. ?1931941": =109@ AA=83" 3"0; ?1931941"B . 43; . //D"C /01" 0"C . . 99=3. <=9" False match (e.g., “Olfactory receptors .. are responsible for the transduction of odorant signals. The system incorrectly identifies „transduction‟ (GO:0009293) defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector No support in sentence (e.g., "The protein is composed ... including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus.” Such sentences may lead to incorrect annotations of 'Golgi apparatus' and 'Posttranslational modification‟.)
  • 41. Applications • Enrichment analysis • even with false positives, text-mined annotations can improve statistical analyses that are tolerant to noise. • GeneWiki+
  • 42. Gene Wiki+ for integrative queries mwsync Good, J Biomed Semantics, 2012. http://genewikiplus.org 42
  • 43. Dynamic queries across genes, diseases, SNPs 43 Good, J Biomed Semantics, 2012.
  • 44. Gene Wiki+ for integrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>] ] [[HasSNP:: <q>[[is_associated_with:: … <q>[[Category:Breast_cancer]]</q>] http://genewikiplus.org ] 44 Good, J Biomed Semantics, 2012. </q>]]
  • 45. Gene Wiki+ for integrative queries mwsync OMIM PharmGKB Good, J Biomed Semantics, 2012. http://genewikiplus.org 45
  • 46. Text mining take home • Depends a lot on the ontology • (same text, same algorithm, completely different results) • Approach depends on corpus • concept-centric text has advantages • Approach depends on purpose • high false positive rates are common but may be acceptable – e.g. enrichment analysis 46
  • 47. Can we skip text mining? http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/
  • 48. Wikidata Provide a database of the world‟s knowledge that anyone can edit - Denny Vrandečić 48
  • 49. Q414043 Wikidata Reelin Protein Q8054 Property:P31 is a Glycoprotein Q187126 Neural Property:P128 regulates Q1345738 development VLDL receptor Q1979313 Property:P129 Interacts Amyloid with precursor Q423510 protein 49 http://www.wikidata.org/wiki/Q414043
  • 50. Q414043 Wikidata Q8054 Property:P31 Q187126 Property:P128 Q1345738 Q1979313 Property:P129 Q423510 50 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
  • 53. 53 “We can harness the Long Tail of scientists to directly participate in the gene annotation process.” -Andrew Su
  • 54. 54 Gene Wiki acknowledgements.. http://wordle.com Many Wikipedia editors WP:MCB Project “A gene wiki for community annotation of gene function” “The Gene Wiki: community intelligence applied to human PloS Biology 2008 gene annotation” Nucleic Acids Research 2009 “Mining the Gene Wiki for Functional Genomic Knowledge” BMC Genomics 2011 “The Gene Wiki in 2011: community intelligence applied to human gene annotation” Nucleic Acids Research 2012 “Linking genes to diseases with a SNPedia-Gene Wiki mashup” Journal of Biomedical Semantics 2012 “Building a biomedical semantic network in Wikipedia with Semantic Wiki Links” Database: The Journal of Biological Databases and Curation 2012
  • 55. My sister Erin has a PhD in linguististics, lives in Raleigh and is looking for work in research or teaching.. Help her out! bgood@scripps.edu @bgood i9606.blogspot.com Funding and Support slideshare/goodb NIH / NIGMS 55 (Gene Wiki: GM089820)
  • 56. 56 Gene Wiki content improves enrichment analysis More p-value significant (PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only) Good, BMC Genomics, 2011.

Editor's Notes

  1. 645,647 articles that have been explicitly linked to human genes within the NCBI Gene database. (gene2pubmed)Search through PubMed and Google will unearth many many more that are clearly relevant but have not been linked yet.
  2. More is produced every day.
  3. The definition that best met my usage here was ...Oddly, it didn’t come from wordnet or even the Wiktionary, it came from the glossary of a document describing a preschool curriculum.Not sure why I chose that one, but he might have had something to do with it..
  4. Manual curation.
  5. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  6. Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  7. In 2008, a group of genome database curators got together and wrote an article about the state of art of biocuration. In it, they expressed deep concern about the amount of data that they were already processing and the knowledge that there would only be more of it coming. One of the things they said in this article was that ‘sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation’. The Gene Wiki and related efforts are an attempt to meet that need.
  8. Now at more than3.5 million articles
  9. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  10. Feb. 14, 201110,290 articles&gt; 75 megabytes text content&gt; 1.3 million words35,997 PubMed citations (about 1 for every two sentences)In past year34,839 edits by 3,599 editorsIncrease of 2.2 megabytes 55 million page views
  11. Just looking at the citations in PubMed actually understates the situation dramatically.
  12. Much easier to start from a large community with a very high page rank then it is to start from scratch…
  13. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  14. Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  15. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  16. Category 1: Yes, this would lead to a new annotation:1A: perfect match – the candidate annotation is exactly as it would be from a curator (e.g., Titin Scleroderma)1B: not specific enough – the candidate annotation is correct but a more specific term should be used instead (e.g., Titin Autoimmune disease)1C: too specific – the candidate annotation is close to correct, but is too specific given the evidence at hand (e.g., Titin Pulmonary Systemic Sclerosis) Category 2: Maybe, but insufficient evidence:2A: evaluator could not find enough supporting evidence in the literature after about 10 minutes of looking (e.g., DUSP7  cellular proliferation; there is literature indicating that DUSP7 is a phosphatase that dephosphorylates MAPK, and hence may play a role in regulating cell proliferation stimulated through MAPK. Although no direct evidence supporting this contention for Human DUSP7 was found, it seems plausible.)2B: there is disagreement in the literature about the truth of this annotationCategory 3: No, this candidate annotation is incorrect:3A: incorrect concept recognition (e.g., “Olfactory receptors share a 7-transmembrane domain structure with many neurotransmitter and hormone receptors and are responsible for the recognition and G protein-mediated transduction of odorant signals.” [24] The system incorrectly identifies ‘transduction’ (GO:0009293 ) which is defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector - a completely different concept from signal transduction as intended in the sentence.)3B: incorrect sentence context  - the sentence is a negation or otherwise does not support the predicted annotation for the given gene (e.g., &quot;The protein is composed of ~300 amino acid residues and has ~30 carbohydrate residues attached including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus.&quot; [25] Such sentences may lead to incorrect candidate annotations of &apos;Golgi apparatus&apos; and &apos;Posttranslational modification’.)3C: this sentence seems factually false (e.g., a hypothetical example: “Insulin injections have been shown to cure Parkinson’s disease and lead to the growth of additional toes”.)
  17. GO terms are more common (we found more than twice as many occurences), are more prone to polysemy, and are more likely to show up in contexts that don’t indicate a direct annotation.
  18. Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
  19. We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores