The biomedical research community is providing large-scale data sources to enable knowledge discovery from the data alone, or from novel scientific experiments in combination with the existing knowledge.
Increasingly semantic Web technologies are being developed and used including ontologies, triple stores and combinations thereof.
The amount of data is constantly increasing as well as the complexity of data.
Since the data sources are publicly available, the amount of content can be derived giving an overview on the accessible content but also on the state of the data representation in comparison to the existing content.
For a better understanding of the existing data resources, i.e.\ judgments on the distribution of data triples across concepts, data types and primary providers, we have performed a comprehensive analysis which delivers an overview on the accessible content for semantic Web solutions.
It can be derived that the information related to genes, proteins and chemical entities form the center, whereas the content related to diseases and pathways forms a smaller portion.
Further data relates to dietary content and specific questions such as cancer prevention and toxicological effects of drugs.
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
Quantifying Biomedical Semantic Resources for Drug Discovery Platforms
1. Quantifying the content of Biomedical Semantic
Resources as a core for Drug Discovery
Platforms
Ali Hasnain and Dietrich Rebholz-Schuhmann
May 2017
2. Agenda
• Introduction
• Motivation
• Ontologies
• Biomedical Ontologies
• Drugs and Chemical Compound Ontologies
• Upper level Ontologies
• Data Repositories/ Databases for Drug Discovery
• Gene, Gene Expression and Protein Databases
• Pathway databases
• Chemical and Structure Databases
• Disease Specific Databases for Prevention
• Literature databases
• Life Sciences Linked Open Data Cloud
• Linked Open Drug Data (LODD)
• Bio2RDF
• LinkedLifeData
• Related Work
• Conclusion
2
3. Introduction
• Biomedical data exists as ontologies, repositories, and
other open data resources e.g, Life Science Linked
Open Data (LS- LOD) relevant in the context of Drug
Discovery and Cancer Chemoprevention.
• The analysis gives an overview of which resources
have to be considered, what amount of data requires
integration and provides the opportunity to tailor
semantic solutions to specific needs in terms of size
and performance.
5. Linked Data for Cancer Chemoprevention
• Because Biomedical Data is heterogeneous and spread
across multiple sources
5
~5 molecs testable in
the lab
~2000 small
molecs
~100 molecs
~ 10 interesting
pathways
Literature
Insilicomodels
Browsedatabases
Hypothesis
Generation
Linked Data
6. Heterogeneous Data – Multiple Data sources
DrugBank
DailyMed
CheBI, KEGG
Reactome
Sider
BioPax
Medicare
6
8. Ontologies
These ontologies can fall into three main categories:
1. The Biomedical ontologies are mainly used by biomedical
applications and define the basic biological structures
(e.g. genes, pathways etc).
2. The Drugs and Chemical Compound Ontologies are
related to the clinical drugs and their active ingredients.
3. The upper level ontologies describe general concepts that
many biomedical ontologies share.
8
9. Ontology spectrum by Jimeno et. al [1]
[1]: Antonio Jimeno-Yepes, Ernesto Jim´enez-Ruiz, Rafael Berlanga, and Dietrich Rebholz-Schuhmann. Use of shared lexical resources for efficient
ontological engineering. In Semantic Web Applications and Tools for Life Sciences Workshop (SWAT4LS). CEUR WS Proceedings, volume 435, pages 93–
136, 2008
9
10. Biomedical Ontologies (selected)
• Advancing Clinico-Genomic Trials on Cancer (ACGT) Master Ontology (MO)
– data exchange in oncology, integration of clinical and molecular data
• Biological Pathway Exchange (BioPAX)
– metabolic, biochemical, transcription regulation, protein synthesis, signal transduction
pathways
• Experimental Factor Ontology (EFO)
– enhance and promote consistent annotation, automatic annotation to integrate external
data
• Gene Ontology (GO)
– for describing biological processes, molecular functions and cellular components of gene
products
• Medical Subject Headings (MeSH)
– hierarchical structure for indexing, cataloguing, and searching for biomedical/ health-related
data.
• Microarray Gene Expression Data Ontology (MGED)
– the biological sample, the treatment sample and the micro-array chip technology in the
experiment
• National Cancer Institute (NCI) Thesaurus
– integrates molecular and clinical cancer-related information to integrate, retrieve and relate
concepts
• Ontology for biomedical Investigations (OBI)
– designs, protocols, instrumentation, materials, processes, data in biological & biomedical
investigations 10
11. Drugs and Chemical Compound Ontologies (selected)
• RxNorm
– standard names for clinical drugs active drug ingredient, dosage
strength, physical form) and links
• Basic Formal Ontology (BFO)
– formalise entities such as 3D enduring objects and comprehending
processes
• OBO Relation Ontology (RO)
– formal definitions of basic relations that cross-cut the biomedical domain
• Provenance Ontology (PROVO)
– provides classes, properties and restrictions for provenance information
11
Generic and Upper Ontologies (selected)
12. Statistical overview of implementation details of
Ontologies (selected)
Ontology Category Year* Topic Implementation Classes Properties Individuals Depth
ACGT-MO Biomedical 2008 Cancer OWL/CVC/RDF/XML 1769 260 61 18
BioPAX Biomedical 2010 Pathways OWL/CVC/RDF/XML 68 96 0 4
EFO Biomedical 2015 Experimental Factors OWL/CVC/RDF/XML 18596 35 0 14
GO Biomedical 2016 Genomics and Proteomic OWL/CVC/RDF/XML 4419 9 0 16
MeSH Biomedical 2009 Health RDF/TTL/ CSV 252375 38 0 15
MGED Biomedical 2009 Microarray Experiment OWL/CVC/RDF/XML 233 121 698 8
NCIT Biomedical 2007 Clinical care OWL/CVC/RDF/XML 118167 173 45715 16
OBI Biomedical 2008 Experimental Data OWL/CVC/RDF/XML 2932 106 178 16
UMLS Biomedical 1993 Biomedical/ Health RDF 3221702 - - -
RxNorm Drugs 1993 Clinical Drugs OWL/CVC/RDF/XML 118555 46 0 0
BFO Generic 2003 Genuine Upper Ontology OWL/CVC/RDF/XML 35 0 0 5
RO Generic 2005
Relations used in all OBO
ontologies
OWL/CVC/RDF/XML - - - -
PROVO Generic 2012 PROV Data Model OWL/CVC/RDF/XML 30 50 4 3
*Statistics as of Aug 2016 - listed at BioPortal- Year specify the time when the last- most recent version is
produce. “-” means information not available.
12
14. Public Data Repositories for Drug Discovery
• The databases are separated into the following
categories:
– Gene, Gene Expression and Protein Databases for
gene and protein annotations as well as the expression
levels and related clinical data.
– Pathway Databases denoting the protein interactions and
the overall functional outcomes.
– Chemical and Structure Databases including Biological
Activities for the information related to drugs and other
chemicals including also toxicity observations and clinical
trials.
– Disease Specific Databases for Prevention which
deliver content specific to the prevention of cancer.
– Literature Databases
14
15. Gene, Gene Expression and Protein Databases
• GenBank
– over 65 B nucleotide bases in more than 61 M sequences
• ArrayExpress
– 65060 experiments 1'973'776 assays, annotated data for gene
expression from biological experiments
• Gene Expression Omnibus (GEO)
– 3'848 datasets gene expression for specific studies
• Universal Protein Resource (UniProt)
– 63'686'057 sequences, 21'364'768'379 amino acids
classifications, cross-references, annotation of proteins
• Protein Data Bank (PDB)
– 118280 Biological Structures evidence of experimentally
validated protein structures
• Protein Database
– 30'047Protein Entries, 41'327PPIs translated coding regions
from GenBank, TPA, SwissProt, PIR, PRF, UniProt and PDB.
15
16. Pathway Databases
• Kyoto Encyclopedia of Genes and Genomes (KEGG)
– 432'883PathwayMaps, 153'776hierarchies, genome
sequencing and high-throughput experimental technologies
• Reactome
– 9'386 Proteins and pathway data for signalling,
transcriptional regulation, translation, apoptosis, other
• Wikipathways
– 2'475 pathways complementing e g. KEGG, Reactome,
Pathway Commons
• cPath: Pathway Database Software
– 31'698 pathways, 1'151'476 interactions, pathway
visualisation, analysis and modelling
16
17. Chemical and Structure Databases including
Biological Activities
• Chemical Compounds Database (Chembase)
– 150'000 pages, compounds, their physical and chemical properties, mass spectra
• Chemical Entities of Biological Interest (ChEBI)
– 48'296 compounds, natural and synthetic atom, molecule, ion, radical, conformer
• DrugBank
– 8,261 drugs, 4,164 targets, 243 Enzymes, 118 Transporters, drug (chemical,
pharmaceutical), drug target (sequence, structure, pathway)
• PubChem
– 89'124'401 Compounds, compound neighbouring, sub/superstructure, bioactivity
data
• Aggregated Computational Toxicology Resource (ACToR)
– more than 500 public source , environmental chemicals searchable by name and
structure
• ClinicalTrials
– 213'868 studies , offers information for locating clinical trials for diseases and
conditions
• TOXicology Data NETwork (TOXNET)
– toxicology, hazardous chemicals, environmental health and related areas 17
18. Disease Specific Databases for
Prevention• Colon Chemoprevention Agents Database (CCAD)
– 1,137 agents and literature data for colon chemoprevention in human, rats,
mice
• Dietary Supplements Labels Database
– 5'000 brands of dietary supplements to compare label ingredients in different
brands. Links to other databases such as MedlinePlus and PubMed
• REPAIRtoire Database
– DNA damage links, pathways, proteins for DNA re-pair, diseases related to
mutations
• Pubmed
– journal citations i.e. Primary source of information for bio-medical researchers
• PubMed Dietary Supplement Subset
– dietary supplement literature including vitamin, mineral, botanical/herbal
supplements
18
Literature Databases
19. Statistical overview of implementation details of
libraries and databases (selected)
Database Category Year* Topic Implementation Size/ Stats
PubMed Literature 1996 Biomedical Literature WebBased/ CSV 11 M Journal citations
PDSS Literature 1999 Citations of dietary supplement WebBased X
DSLD Chemoprevention 2013 Ingredients of dietary supplement WebBased > 5000 selected brands
ClinicalTrials Toxicity 2000 Clinical Trials WebBased 213,868 studies
TOXNET Toxicity 1987 Toxicology Database WebBased X
ACToR Compound 2008 Chemical Toxicity Data WebBased >500 public sources
DrugBank Compound 2008 Drug Data WebBased/LOD 8206 drugs
ChEBI Compound X Small Molecular entities WebBased/LOD 48,296 compounds
PubChem Compound 2004 Compound Structure WebBased/LOD 89,124,401 compounds
ChemSpider Chemical 2007 Compound Structure WebBased >40 million structures
KEGG Pathway 1995 Genomic, Chemical, systemic WebBased/LOD 432883pathway maps
Reactome Pathway 2003 Pathways WebBased 9386 proteins
Wikipathway Pathway 2007 Biological pathways WebBased 2475 pathways
cPath Pathway 2005 Biological pathways Desktop/WebBased 31698 pathways
Uniprot Protein 2002 Protein Sequence WebBased/LOD 63686057sequences
PDB Protein 1971 3D structural data of Proteins WebBased/LOD 30,047protein
*Statistics as of Aug 2016 - Year specify the time when the last- most recent version is produce. “X” means
information not available.
19
20. Life Sciences Linked Open Data Cloud
• Linked biomedical datasets relevant in a Cancer Chemoprevention
and drug discovery scenario:
– Linked Open Drug Data (LODD)
• Set of linked datasets relevant to Drug Discovery that includes data
from several datasets including Drugbank, LinkedCT, DailyMed,
Diseasome, SIDER, STITCH, Medicare, RxNorm, ClinicalTrials.gov,
NCBI Entrez Gene and OMIM.
– Bio2RDF
• Contains multiple linked biological databases including pathways
databases such as KEGG, PDB and several NCBIs databases. An
open-source project that uses Semantic Web technologies to build
and provide the largest network of Linked Data for the Life Sciences.
– LinkedLifeData
• A semantic data integration platform for the biomedical domain
containing 5 billion RDF statements from various sources including
UniProt, PubMed, EntrezGene and 20 more.
20
21. The Linked Open Data Cloud
“Life sciences will drive adoption of the Semantic Web, just as high-energy physics
drove the early Web.”
- Sir Tim Berners-Lee, 2005
Proteins
Molecules
Genes
Diseases
21
25. [2]: Zeginis, D., et al.: A collaborative methodology for developing a semantic model for interlinking Cancer
Chemoprevention linked-data sources. Semantic Web (2013)
[3]: Hasnain, A.’ et al.: Linked Biomedical Dataspace: Lessons Learned integrating Data for Drug Discovery. In:
International Semantic Web Conference (In-Use Track), October 2014 (2014)
Related Work (selected)
• Zeginis et al. [2] proposed “meet-in-the-middle” approach to develop
the semantic model relevant for cancer chemoprevention. Relevant
data was analysed in a bottom-up fashion from analysing the
domain whereas a top-down approach was considered to collect
ontologies, vocabularies and data models.
• Hasnain et al. [3] proposed Linked Biomedical Dataspace (access
and use biomedical resources relevant for cancer chemoprevention)
with components namely:
– a) knowledge extraction,
– b) link creation,
– c) query execution and
– d) knowledge publishing.
25
26. Conclusion
• In this paper we introduce and classify different tiers of biomedical Data
relevant to Cancer Chemoprevention and Drug Discovery domain.
• This involves Ontologies, databases and Life Science Linked Open Data in
Healthcare, Life Sciences and Biomedical Domain
• We classify ontologies into three main classes:
– i) biomedical Ontologies (e.g. EFO, OBI, GO etc),
– ii) Drugs and Chemical Compound Ontologies (e.g. RxNorm) and
– iii) Generic and Upper Ontologies (e.g. BFO, RO, PROV).
• Similarly we categorise libraries and databases in five categories:
– (i) Gene, Gene Expression and Protein Databases,
– (ii) Pathway databases,
– (iii) Chemical and Structure Databases including Biological Activities,
– (iv) Disease Specific Databases for Prevention, and
– (v) Literature databases.
26
Link to next slide is –Linked data is the faclitates complex queries and workflows to be assembled
To discovery which links could our datasets have to other datasources, we’ve explored what types of data are published in the linked open data cloud.
What we found was a lot of messy data – looking at 8 datasets containing molecular data, their descriptions are very different; chebi calls molecules compounds, drugbank calls them drugs, dailymed calls them drugs as well but uses a different identifier.
Link – how to start linking all of these datasets such that they can be made available in a unified query interface?
EGFR: Epidermal growth factor receptor
BioMedical Ontologies:
Advancing Clinico-Genomic Trials on Cancer (ACGT)
Master Ontology (MO
Biological Pathway Exchange (BioPAX)
Experimental Factor Ontology (EFO)
Gene Ontology (GO
Medical Subject Headings (MeSH
Microarray Gene Expression Data Ontology (MGED
National Cancer Institute (NCI)
Ontology for biomedical Investigations (OBI
Unified Medical Language System (UMLS)
Drugs and Chemical Compound Ontologies:
RxNorm
Generic and Upper Ontologies:
Basic Formal Ontology (BFO
OBO Relation Ontology (RO)
Provenance Ontology (PROVO)
Literature Databases:
Pubmed
PubMed Dietary Supplement Subset
Natural Sources of Chemoprevention Agents Databases:
Dietary Supplements Labels Database
Toxicity and Efficacy Databases:
ClinicalTrials
TOXicology Data NETwork (TOXNET)
Biological Activity of Compounds Databases
Aggregated Computational Toxicology Resource (ACToR)
DrugBank
Chemical Entities of Biological Interest (ChEBI)
PubChem
Repartoire Database
Gene Expression Databases:
Cancer Gene Expression Database (CGED)
ArrayExpress
Gene Expression Omnibus (GEO)
Gene and DNA Databases:
GenBank
Chemical and Physical Structure Databases:
ChemSpider
Chemical Compounds Database (Chembase)
Sigma-Aldrich
ChemDB
Disease Specific Compound Databases:
Colon Chemoprevention Agents Database
Pathway Databases:
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Reactome
Wikipathways\footnote
cPath: Pathway Database Software\footnote
Protein Databases:
Universal Protein Resource (UniProt)
Protein Data Bank (PDB)
Protein Database
M: when data is catalogues, we can discovering new links by crossreferencing with existing datasets
-> once we identify these concepts, how do we actualy query them toegether?