Model Organism Linked Data

  1. Model Organism Linked Data (NIH Commons/MOD Interoperability supplement to SGD) Michel Dumontier Associate Professor of Medicine Stanford University
  2. Team Michel Dumontier (Biomedical Informatics Research, Stanford) Maxime Déraspe (U. Laval & Biomedical Informatics Research, Stanford) Jacques Corbeil (U. Laval) Mike Cherry (Department of Genetics, Stanford) Kalpana Karra (Department of Genetics, Stanford) Gail Binkley (Department of Genetics, Stanford) Gos Micklem (Cambridge Systems Biology Centre, U. of Cambridge) Julie Sullivan (Cambridge Systems Biology Centre, U. of Cambridge)
  3. 25+ available endpoints available for different MODs: YeastMine, WormMine, FlyMine, ZebrafishMine, MouseMine, ThaleMine, HumanMine Access via (db query) API Core data object model (commonly used) + Mine-specific customizations -> heterogeneity in tables, fields, and terminologies used pose challenges for interoperability and pan-database queries InterMine is a platform for Model Organism Data
  4. Linked Data and Semantic Web technologies (RDF, SPARQL) are increasingly adopted in the bioinformatics data provider community: DBCLS, EBI, NCBI, NLM, and many others MODs, like many Omics databases, often rely on other people’s content Linked Data can offer deferenceable links to authoratitive sources Opportunity to improve MOD data interoperability through mapping of their Ontologies and Vocabularies Towards increased interoperability with Semantic Web technologies
  5. Model Organism Linked Data (MO-LD) Effort to expose InterMine data a FAIR - Findable, Accessible, Interoperable, Reusable Specific Aims: 1. To improve interoperability of MOD data by publishing Linked Data 2. To enable and demonstrate federated queries between MOD data and the network of Linked Data 3. To package our software and data for easier local and cloud-based deployment
  6. Includes 6 MODs - YeastMine, FlyMine, ZebrafishMine, RatMine, MouseMine, HumanMine Linked with 38 Bio2RDF datasets RefSeq, PantherDB, GO, NCBI gene, HGNC, ENSEMBL, OMIM, … InterMine-RDFizer script to reproduce with any InterMine instance Web application to visualize, explore and query the Linked Datasets Model Organism Linked Database (MO-LD)
  7. RDFization of InterMine Query InterMine API with Object Model Convert the tabular results into triples (RDF) Merge the resources with the same primary keys Link Data with external datasets Load the RDF data into a triple store
  8. InterMine-LD
  9. External linked datasets (38) with the 6 MODs Linking MODs with LOD - incomplete linking InterMine primary key Identifier DataSource 00001 Q6GZX4 Uniprot 00002 ASIC1 HGNC 00003 GO:0004396 GO 00004 AL732629.6 RefSeq Cross References Table* from InterMine * Also done with Ontology tables
  10. Linked Data Platform SPARQL Query Editor Faceted Browser (Virtuoso) RelFinder for Relation Visualization Application Programming Interface ( - OpenAPIs specification)
  11. SPARQL Support for Programmers Get all reactions from KEGG that are associated with genes that are extrinsic components of the cell membrane
  12. Federated Query
  13. RelFinder - Find connections between 2 or more entities
  14. Infrastructure Deployment and Reusability Docker (container engine) to build and deploy the MOLD infrastructure Microservices architecture for reusability and extensibility : Web application, API and Virtuoso images Cloud-Ready - tested on Amazon EC2 Tutorial : Only 5 commands to deploy a Linked-MOD !
  15. Reflections Not all data in MODs are available in the InterMine instance Not all references are in the cross-references table, limits Linked Data generation Team interactions led to change in export process RDFizer focuses only on two tables of the core object model offers as template by InterMine (CrossReference + DataSource and Ontology + OntologyTerm). Support for mine-specific tables would also improve coverage of contents and links
  16. Can we improve the quality of the representation by using community vocabularies (FALDO, CiTo, SIO)? Can we offer high performance query services (Triple Pattern Fragments/HDT) How can we persist data in other archives (wikidata / Are curation priorties in line with what users want? Can pan-species analyses tell us something about success in drug discovery? Future Directions