Quantifying Biomedical Semantic Resources for Drug Discovery Platforms

Quantifying the content of Biomedical Semantic
Resources as a core for Drug Discovery
Platforms
Ali Hasnain and Dietrich Rebholz-Schuhmann
May 2017

Agenda
• Introduction
• Motivation
• Ontologies
• Biomedical Ontologies
• Drugs and Chemical Compound Ontologies
• Upper level Ontologies
• Data Repositories/ Databases for Drug Discovery
• Gene, Gene Expression and Protein Databases
• Pathway databases
• Chemical and Structure Databases
• Disease Specific Databases for Prevention
• Literature databases
• Life Sciences Linked Open Data Cloud
• Linked Open Drug Data (LODD)
• Bio2RDF
• LinkedLifeData
• Related Work
• Conclusion
2

Introduction
• Biomedical data exists as ontologies, repositories, and
other open data resources e.g, Life Science Linked
Open Data (LS- LOD) relevant in the context of Drug
Discovery and Cancer Chemoprevention.
• The analysis gives an overview of which resources
have to be considered, what amount of data requires
integration and provides the opportunity to tailor
semantic solutions to specific needs in terms of size
and performance.

We live in a world of data
Motivation
4

Linked Data for Cancer Chemoprevention
• Because Biomedical Data is heterogeneous and spread
across multiple sources
5
~5 molecs testable in
the lab
~2000 small
molecs
~100 molecs
~ 10 interesting
pathways
Literature
Insilicomodels
Browsedatabases
Hypothesis
Generation
Linked Data

Heterogeneous Data – Multiple Data sources
DrugBank
DailyMed
CheBI, KEGG
Reactome
Sider
BioPax
Medicare
6

Biomedical Data Integration
nih:EGFR
epidermal growth
factor receptor
Homo
sapiens
CCCCGGCGCAGCGCGGCCGCAGCA
GCCTCCGCCCCCCGCACGGTGTGA
GCGCCCGACGCGGCCGAGGCGG …
nih:EGF
nci:has_description
nih:sequence
nih:organism
nih:interacts
nih:organism
rea:EGFR
rea:Membrane
rea:Receptor
rea:Transferase
rea:keyword
rea:keyword
rea:keyword
NCBI Reactome
sameAs
7

Ontologies
These ontologies can fall into three main categories:
1. The Biomedical ontologies are mainly used by biomedical
applications and define the basic biological structures
(e.g. genes, pathways etc).
2. The Drugs and Chemical Compound Ontologies are
related to the clinical drugs and their active ingredients.
3. The upper level ontologies describe general concepts that
many biomedical ontologies share.
8

Ontology spectrum by Jimeno et. al [1]
[1]: Antonio Jimeno-Yepes, Ernesto Jim´enez-Ruiz, Rafael Berlanga, and Dietrich Rebholz-Schuhmann. Use of shared lexical resources for efficient
ontological engineering. In Semantic Web Applications and Tools for Life Sciences Workshop (SWAT4LS). CEUR WS Proceedings, volume 435, pages 93–
136, 2008
9

Biomedical Ontologies (selected)
• Advancing Clinico-Genomic Trials on Cancer (ACGT) Master Ontology (MO)
– data exchange in oncology, integration of clinical and molecular data
• Biological Pathway Exchange (BioPAX)
– metabolic, biochemical, transcription regulation, protein synthesis, signal transduction
pathways
• Experimental Factor Ontology (EFO)
– enhance and promote consistent annotation, automatic annotation to integrate external
data
• Gene Ontology (GO)
– for describing biological processes, molecular functions and cellular components of gene
products
• Medical Subject Headings (MeSH)
– hierarchical structure for indexing, cataloguing, and searching for biomedical/ health-related
data.
• Microarray Gene Expression Data Ontology (MGED)
– the biological sample, the treatment sample and the micro-array chip technology in the
experiment
• National Cancer Institute (NCI) Thesaurus
– integrates molecular and clinical cancer-related information to integrate, retrieve and relate
concepts
• Ontology for biomedical Investigations (OBI)
– designs, protocols, instrumentation, materials, processes, data in biological & biomedical
investigations 10

Drugs and Chemical Compound Ontologies (selected)
• RxNorm
– standard names for clinical drugs active drug ingredient, dosage
strength, physical form) and links
• Basic Formal Ontology (BFO)
– formalise entities such as 3D enduring objects and comprehending
processes
• OBO Relation Ontology (RO)
– formal definitions of basic relations that cross-cut the biomedical domain
• Provenance Ontology (PROVO)
– provides classes, properties and restrictions for provenance information
11
Generic and Upper Ontologies (selected)

Statistical overview of implementation details of
Ontologies (selected)
Ontology Category Year* Topic Implementation Classes Properties Individuals Depth
ACGT-MO Biomedical 2008 Cancer OWL/CVC/RDF/XML 1769 260 61 18
BioPAX Biomedical 2010 Pathways OWL/CVC/RDF/XML 68 96 0 4
EFO Biomedical 2015 Experimental Factors OWL/CVC/RDF/XML 18596 35 0 14
GO Biomedical 2016 Genomics and Proteomic OWL/CVC/RDF/XML 4419 9 0 16
MeSH Biomedical 2009 Health RDF/TTL/ CSV 252375 38 0 15
MGED Biomedical 2009 Microarray Experiment OWL/CVC/RDF/XML 233 121 698 8
NCIT Biomedical 2007 Clinical care OWL/CVC/RDF/XML 118167 173 45715 16
OBI Biomedical 2008 Experimental Data OWL/CVC/RDF/XML 2932 106 178 16
UMLS Biomedical 1993 Biomedical/ Health RDF 3221702 - - -
RxNorm Drugs 1993 Clinical Drugs OWL/CVC/RDF/XML 118555 46 0 0
BFO Generic 2003 Genuine Upper Ontology OWL/CVC/RDF/XML 35 0 0 5
RO Generic 2005
Relations used in all OBO
ontologies
OWL/CVC/RDF/XML - - - -
PROVO Generic 2012 PROV Data Model OWL/CVC/RDF/XML 30 50 4 3
*Statistics as of Aug 2016 - listed at BioPortal- Year specify the time when the last- most recent version is
produce. “-” means information not available.
12

Classes vs. Properties plot (Selected Ontologies)
13
1769
260
68
96
18596
35
4419
9
252375
38
233
121
118167
173
2932
106
3221702
118555
4635 30
50
CLASSES PROPERTIES
ACGT-MO BioPAX EFO GO MeSH MGED NCIT
OBI UMLS RxNorm BFO RO PROVO

Public Data Repositories for Drug Discovery
• The databases are separated into the following
categories:
– Gene, Gene Expression and Protein Databases for
gene and protein annotations as well as the expression
levels and related clinical data.
– Pathway Databases denoting the protein interactions and
the overall functional outcomes.
– Chemical and Structure Databases including Biological
Activities for the information related to drugs and other
chemicals including also toxicity observations and clinical
trials.
– Disease Specific Databases for Prevention which
deliver content specific to the prevention of cancer.
– Literature Databases
14

Gene, Gene Expression and Protein Databases
• GenBank
– over 65 B nucleotide bases in more than 61 M sequences
• ArrayExpress
– 65060 experiments 1'973'776 assays, annotated data for gene
expression from biological experiments
• Gene Expression Omnibus (GEO)
– 3'848 datasets gene expression for specific studies
• Universal Protein Resource (UniProt)
– 63'686'057 sequences, 21'364'768'379 amino acids
classifications, cross-references, annotation of proteins
• Protein Data Bank (PDB)
– 118280 Biological Structures evidence of experimentally
validated protein structures
• Protein Database
– 30'047Protein Entries, 41'327PPIs translated coding regions
from GenBank, TPA, SwissProt, PIR, PRF, UniProt and PDB.
15

Pathway Databases
• Kyoto Encyclopedia of Genes and Genomes (KEGG)
– 432'883PathwayMaps, 153'776hierarchies, genome
sequencing and high-throughput experimental technologies
• Reactome
– 9'386 Proteins and pathway data for signalling,
transcriptional regulation, translation, apoptosis, other
• Wikipathways
– 2'475 pathways complementing e g. KEGG, Reactome,
Pathway Commons
• cPath: Pathway Database Software
– 31'698 pathways, 1'151'476 interactions, pathway
visualisation, analysis and modelling
16

Chemical and Structure Databases including
Biological Activities
• Chemical Compounds Database (Chembase)
– 150'000 pages, compounds, their physical and chemical properties, mass spectra
• Chemical Entities of Biological Interest (ChEBI)
– 48'296 compounds, natural and synthetic atom, molecule, ion, radical, conformer
• DrugBank
– 8,261 drugs, 4,164 targets, 243 Enzymes, 118 Transporters, drug (chemical,
pharmaceutical), drug target (sequence, structure, pathway)
• PubChem
– 89'124'401 Compounds, compound neighbouring, sub/superstructure, bioactivity
data
• Aggregated Computational Toxicology Resource (ACToR)
– more than 500 public source , environmental chemicals searchable by name and
structure
• ClinicalTrials
– 213'868 studies , offers information for locating clinical trials for diseases and
conditions
• TOXicology Data NETwork (TOXNET)
– toxicology, hazardous chemicals, environmental health and related areas 17

Disease Specific Databases for
Prevention• Colon Chemoprevention Agents Database (CCAD)
– 1,137 agents and literature data for colon chemoprevention in human, rats,
mice
• Dietary Supplements Labels Database
– 5'000 brands of dietary supplements to compare label ingredients in different
brands. Links to other databases such as MedlinePlus and PubMed
• REPAIRtoire Database
– DNA damage links, pathways, proteins for DNA re-pair, diseases related to
mutations
• Pubmed
– journal citations i.e. Primary source of information for bio-medical researchers
• PubMed Dietary Supplement Subset
– dietary supplement literature including vitamin, mineral, botanical/herbal
supplements
18
Literature Databases

Statistical overview of implementation details of
libraries and databases (selected)
Database Category Year* Topic Implementation Size/ Stats
PubMed Literature 1996 Biomedical Literature WebBased/ CSV 11 M Journal citations
PDSS Literature 1999 Citations of dietary supplement WebBased X
DSLD Chemoprevention 2013 Ingredients of dietary supplement WebBased > 5000 selected brands
ClinicalTrials Toxicity 2000 Clinical Trials WebBased 213,868 studies
TOXNET Toxicity 1987 Toxicology Database WebBased X
ACToR Compound 2008 Chemical Toxicity Data WebBased >500 public sources
DrugBank Compound 2008 Drug Data WebBased/LOD 8206 drugs
ChEBI Compound X Small Molecular entities WebBased/LOD 48,296 compounds
PubChem Compound 2004 Compound Structure WebBased/LOD 89,124,401 compounds
ChemSpider Chemical 2007 Compound Structure WebBased >40 million structures
KEGG Pathway 1995 Genomic, Chemical, systemic WebBased/LOD 432883pathway maps
Reactome Pathway 2003 Pathways WebBased 9386 proteins
Wikipathway Pathway 2007 Biological pathways WebBased 2475 pathways
cPath Pathway 2005 Biological pathways Desktop/WebBased 31698 pathways
Uniprot Protein 2002 Protein Sequence WebBased/LOD 63686057sequences
PDB Protein 1971 3D structural data of Proteins WebBased/LOD 30,047protein
*Statistics as of Aug 2016 - Year specify the time when the last- most recent version is produce. “X” means
information not available.
19

Life Sciences Linked Open Data Cloud
• Linked biomedical datasets relevant in a Cancer Chemoprevention
and drug discovery scenario:
– Linked Open Drug Data (LODD)
• Set of linked datasets relevant to Drug Discovery that includes data
from several datasets including Drugbank, LinkedCT, DailyMed,
Diseasome, SIDER, STITCH, Medicare, RxNorm, ClinicalTrials.gov,
NCBI Entrez Gene and OMIM.
– Bio2RDF
• Contains multiple linked biological databases including pathways
databases such as KEGG, PDB and several NCBIs databases. An
open-source project that uses Semantic Web technologies to build
and provide the largest network of Linked Data for the Life Sciences.
– LinkedLifeData
• A semantic data integration platform for the biomedical domain
containing 5 billion RDF statements from various sources including
UniProt, PubMed, EntrezGene and 20 more.
20

The Linked Open Data Cloud
“Life sciences will drive adoption of the Semantic Web, just as high-energy physics
drove the early Web.”
- Sir Tim Berners-Lee, 2005
Proteins
Molecules
Genes
Diseases
21

Meaningful Biomedical Correlation
Proteins
Molecules
Genes
Diseases
:Protein
:Molecule
:Gene
:Disease
Uniprot
PDB
Pfam PROSITE
ProDom
Uniref
UniPark Daily
medDrug
Bank ChemBL
Pub
Chem KEGG
Gene
Ontology
GeneID
Affy
metrix
Homo
gene
MGI
Disea
some
SIDER
22

Statistical overview of datasets involved in LS-
LOD, Bio2RDF and LLD (selected)
Dataset Category Year* Topic Size/ Coverage
Drugbank LODD 2010 Drugs 766920 triples, 4800 drugs
LinkedCT LODD X Clinical Trials 25 M triples, 106000 trials
DailyMed LODD 2010 Drugs 1604983 triples, >36K products
Dbpedia LODD 2009 Drugs/ Diseases/Proteins 218M triples, 2300 drugs, 2200 proteins
Diseasome LODD 2010 Diseases/ Genes 91182 triples, 2600 genes
SIDER LODD 2010 Diseases/ Side Effects 192515 triples, 63K effects, 1737 genes
STITCH LODD 2010 Chemicals/ Proteins 7.5 M chemicals, 0.5 M proteins
ChEMBLE LODD 2010 Assay/ Proteins/ Organisms 130 M triples
Affymetrix Bio2RDF 2014 Microarrays 8694237 triples, 6679943 entities
BioModels Bio2RDF 2014 Biological/ mathematical models 2380009 triples, 188308 entities
BioPortal Bio2RDF 2014 Biological/ biomedical entities 19920395 triples, 2199594 entities
KEGG Bio2RDF 2014 Genes 50197150 triples, 6533307 entities
PharmaG-KB Bio2RDF 2014 Genotypes/ Phenotypes 278049209 triples, 25325504 entities
PubMed Bio2RDF 2014 Citations 5005343905 triples, 412593720 entities
Taxonomy Bio2RDF 2014 Taxonomy 21310356 triples, 1147211 entities
LLD LLD 2014 Drugs, Chromosomes 10192641644 statements
*Statistics as of Aug 2016 (source DataHub) - Year specify the time when the last- most recent version is
produced. “X” means information not available.
23

Triples vs. Unique Entities (selected LS-LOD datasets)
24
86942371
6679943
2380009
188380
19920395
2199594
409942525
50061452
98835804
7337123
326720894
19768641
8801487
530538
3672531
316950
73048
6995
11663
1129
97520151
5950074
3628205
372136
7189769
869985
2323345
176579
3306107223
364255265
48781511
3110993
50197150
6533307
2174579
59776
55914
5032
7323864
305401
# OF TRIPLES # OF UNIQUE ENTITIES
[affymetrix] [biomodels] [bioportal] [chembl] [clinicaltrials] [ctd] [dbsnp] [drugbank] [genage] [gendr]
[goa] [hgnc] [homologene] [interpro] [iproclass] [irefindex] [kegg] [linkedspl] [lsr] [mesh]

[2]: Zeginis, D., et al.: A collaborative methodology for developing a semantic model for interlinking Cancer
Chemoprevention linked-data sources. Semantic Web (2013)
[3]: Hasnain, A.’ et al.: Linked Biomedical Dataspace: Lessons Learned integrating Data for Drug Discovery. In:
International Semantic Web Conference (In-Use Track), October 2014 (2014)
Related Work (selected)
• Zeginis et al. [2] proposed “meet-in-the-middle” approach to develop
the semantic model relevant for cancer chemoprevention. Relevant
data was analysed in a bottom-up fashion from analysing the
domain whereas a top-down approach was considered to collect
ontologies, vocabularies and data models.
• Hasnain et al. [3] proposed Linked Biomedical Dataspace (access
and use biomedical resources relevant for cancer chemoprevention)
with components namely:
– a) knowledge extraction,
– b) link creation,
– c) query execution and
– d) knowledge publishing.
25

Conclusion
• In this paper we introduce and classify different tiers of biomedical Data
relevant to Cancer Chemoprevention and Drug Discovery domain.
• This involves Ontologies, databases and Life Science Linked Open Data in
Healthcare, Life Sciences and Biomedical Domain
• We classify ontologies into three main classes:
– i) biomedical Ontologies (e.g. EFO, OBI, GO etc),
– ii) Drugs and Chemical Compound Ontologies (e.g. RxNorm) and
– iii) Generic and Upper Ontologies (e.g. BFO, RO, PROV).
• Similarly we categorise libraries and databases in five categories:
– (i) Gene, Gene Expression and Protein Databases,
– (ii) Pathway databases,
– (iii) Chemical and Structure Databases including Biological Activities,
– (iv) Disease Specific Databases for Prevention, and
– (v) Literature databases.
26

Quantifying Biomedical Semantic Resources for Drug Discovery Platforms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Quantifying Biomedical Semantic Resources for Drug Discovery Platforms

Similar to Quantifying Biomedical Semantic Resources for Drug Discovery Platforms (20)

More from Syed Muhammad Ali Hasnain

More from Syed Muhammad Ali Hasnain (11)

Recently uploaded

Recently uploaded (20)

Quantifying Biomedical Semantic Resources for Drug Discovery Platforms

Editor's Notes