1. Integration of GO, Pathway data and Interaction data Chris Mungall Peter D’Eustachio
2. The GO was originally intended to integrate databases How are we doing? Interoperability of genomic databases is limited by this lack of progress, and it is this major obstacle that the Gene Ontology (GO) Consortium was formed to address Gene Ontology: Tool for theUnificationofBiology. Nat Genet 2000 SGD FB GOA
3. GO The GO was originally intended to integrate databases How are we doing? Not as well as we could! GO SGD FB GOA Pathway Commons IMEX Reactome Cyc … BioGRID Intact …
4. Integration enhances analyses and reduces workload Division of labor leave specialized curation to specialized systems biology databases but data needs to be re-combined to prevent siloing GO is an invaluable single-stop shop for term enrichment etc Can we quantify how integrating with systems biology databases helps users? Yes! We can do the experiment: GO term enrichment analysis on all MolSigDB withReactome annotations Also include Reactome inputs/outputs, not currently in GOA withoutReactomeannotations
5. Integration enhances analyses GOA+R: Many p-values will significantly improved Recapitulated biologically valid results that would have been suppressed had one single resource been used Examples: Genes down-regulated in Alzheimers
6. How are we currently integrating systems biology datasets? Interaction data Currently Intact, soon IMEX “protein binding” and “self-protein binding” only (+with) Pathway data Currently Reactomeonly Loses much of what is in Reactome E,g,inputs and outputs Manually curated GO<->Reactome links incomplete not always to the most specific term labor-intensive become stale over time other pathway databases? This can be improved!
7. Automating integration using cross-product definitions – pathway databases [Term] id: GO:0015871 name: choline transport intersection_of: GO:0006810 ! transport intersection_of:results_in_transport_ofCHEBI:15354 ! choline
8. Automating integration using cross-products – pathway databases We can also automatically map: catalysis terms [165*] transport [373] binding [133] phosphorylation and other modifications metabolism [278] signaling … All this relies on different cross-product files Any pathway database that exports BioPax-OWL can be used E.ghumancyc, mousecyc, pathwaycommons, … *Numbers for Reactome-human
10. Automated Integration: Results Reactome Evaluation in progress Many manually assigned equivalencies recapitulated Inferred equivalencies differed in some cases sometimes better than manually assigned sometimes required info not in biopax export ongoing discussions BioGrid not evaluated (all trivial) inferred annotations improve some enrichment results E.g. Brentani angiogenesis gene sets, increased enrichment for VEGFR binding Obvious but useful as proof of concept
11. Conclusions and future work We can be more efficient: Coordinate with systems bio databases to divide labor Prevent siloing through semi-automated integration GO acts as a high-level ‘window’ on systems biology databases Still to be done: Make integration tool production-ready Reconcile existing mis-alignments, particularly signaling highly inconsistent between GO and Reactome Explore open questions – e.g. auto-generate terms? Finish cross-products, they are vital particular PRO, CHEBI
Editor's Notes
The Gene Ontology was created as a response to the need to address the need for interoperability in genomic databases in the wake of the sequencing of the first metazoan genomes. In the paper Gene Ontology: tool for the unification of biology published nearly ten years ago, Ashburner et al state: Progress in the way that biologists describe and conceptualize the shared biological elements has not kept pace with sequencing . . . Interoperability of genomic databases is limited by this lack of progress, and it is this major obstacle that the Gene Ontology (GO) Consortium was formed to address [25].The GO has since become the de-facto terminological standard for functional annotation, and its success is evident in the popularity of GO-based class enrichment analyses. However, the intervening ten years have witnessed an explosion of interest in systems biology, with a concomitant increase in the number of databases providing information on interactions and pathways, including Reactome, Nature Signaling, PANTHER [26], BIND, BioGRID and HumanCyc (the EcoCyc metabolic pathway database preceded GO [27]). These databases each have their own individual data models and schemas, creating an interoperability problem. This has partly been mitigated by the adoption of BioPAX as a standard exchange format, which allows the aggregation of multiple pathway databases in single “one-stop shopping” warehouses, such as the Pathway Knowledge Base [28], Pathway Commons, and WikiPathways. However, the data is still only partially integrated, and if a researcher wishes to obtain a comprehensive view of a pathway they must still examine multiple records, in addition to GO annotations