Model Organism Linked Data

Model Organism Linked Data
(NIH Commons/MOD Interoperability supplement to SGD)
Michel Dumontier
Associate Professor of Medicine
Stanford University

Team
Michel Dumontier (Biomedical Informatics Research, Stanford)
Maxime Déraspe (U. Laval & Biomedical Informatics Research, Stanford)
Jacques Corbeil (U. Laval)
Mike Cherry (Department of Genetics, Stanford)
Kalpana Karra (Department of Genetics, Stanford)
Gail Binkley (Department of Genetics, Stanford)
Gos Micklem (Cambridge Systems Biology Centre, U. of Cambridge)
Julie Sullivan (Cambridge Systems Biology Centre, U. of Cambridge)

25+ available endpoints available for different MODs:
YeastMine, WormMine, FlyMine, ZebrafishMine, MouseMine, ThaleMine, HumanMine
Access via (db query) API
Core data object model (commonly used) + Mine-specific customizations
-> heterogeneity in tables, fields, and terminologies used
pose challenges for interoperability and pan-database queries
InterMine is a platform for Model Organism Data

Linked Data and Semantic Web technologies (RDF, SPARQL) are increasingly adopted
in the bioinformatics data provider community:
DBCLS, EBI, NCBI, NLM, and many others
MODs, like many Omics databases, often rely on other people’s content
Linked Data can offer deferenceable links to authoratitive sources
Opportunity to improve MOD data interoperability through mapping of their Ontologies
and Vocabularies
Towards increased interoperability with
Semantic Web technologies

Model Organism Linked Data (MO-LD)
Effort to expose InterMine data a FAIR -
Findable, Accessible, Interoperable, Reusable
Specific Aims:
1. To improve interoperability of MOD data by publishing Linked Data
2. To enable and demonstrate federated queries between MOD data and the
network of Linked Data
3. To package our software and data for easier local and cloud-based
deployment

Includes 6 MODs -
YeastMine, FlyMine, ZebrafishMine, RatMine, MouseMine, HumanMine
Linked with 38 Bio2RDF datasets
RefSeq, PantherDB, GO, NCBI gene, HGNC, ENSEMBL, OMIM, …
InterMine-RDFizer script to reproduce with any InterMine instance
Web application to visualize, explore and query the Linked Datasets
Model Organism Linked Database (MO-LD)

RDFization of InterMine
Query InterMine API with Object Model
Convert the tabular results into triples (RDF)
Merge the resources with the same primary
keys
Link Data with external datasets
Load the RDF data into a triple store

External linked datasets (38)
with the 6 MODs
Linking MODs with LOD
- incomplete linking
InterMine
primary key
Identifier DataSource
00001 Q6GZX4 Uniprot
00002 ASIC1 HGNC
00003 GO:0004396 GO
00004 AL732629.6 RefSeq
Cross References Table*
from InterMine
* Also done with Ontology tables

Linked Data Platform
SPARQL Query Editor
Faceted Browser (Virtuoso)
RelFinder for Relation Visualization
Application Programming Interface
(Swagger.io - OpenAPIs specification)
MO-LD.org

SPARQL Support for Programmers
Get all reactions from
KEGG that are associated
with genes that are extrinsic
components of the cell
membrane

RelFinder - Find connections between 2 or more entities

Infrastructure Deployment and Reusability
Docker (container engine) to build and deploy the MOLD infrastructure
https://hub.docker.com/u/mold
Microservices architecture for reusability and extensibility :
Web application, API and Virtuoso images
Cloud-Ready - tested on Amazon EC2
Tutorial : https://github.com/mo-ld/mold-dock
Only 5 commands to deploy a Linked-MOD !

Reflections
Not all data in MODs are available in the InterMine instance
Not all references are in the cross-references table, limits Linked Data generation
Team interactions led to change in export process
RDFizer focuses only on two tables of the core object model offers as template by
InterMine (CrossReference + DataSource and Ontology + OntologyTerm).
Support for mine-specific tables would also improve coverage of contents and links

Can we improve the quality of the representation by using community
vocabularies (FALDO, CiTo, SIO)?
Can we offer high performance query services (Triple Pattern Fragments/HDT)
How can we persist data in other archives (wikidata / schema.org+cse)
Are curation priorties in line with what users want?
Can pan-species analyses tell us something about success in drug discovery?
Future Directions

Model Organism Linked Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Model Organism Linked Data

Similaire à Model Organism Linked Data (20)

Plus de Michel Dumontier

Plus de Michel Dumontier (20)

Dernier

Dernier (20)

Model Organism Linked Data