With its focus on investigating the nature and basis for the sustained existence of living systems, modern biology has always been a fertile, if not challenging, domain for formal knowledge representation and automated reasoning. Over the past 15 years, hundreds of projects have developed or leveraged ontologies for entity recognition and relation extraction, semantic annotation, data integration, query answering, consistency checking, association mining and other forms of knowledge discovery. In this talk, I will discuss our efforts to build a rich foundational network of ontology-annotated linked data, discover significant biological associations across these data using a set of partially overlapping ontologies, and identify new avenues for drug discovery by applying measures of semantic similarity over phenotypic descriptions. As the portfolio of Semantic Web technologies continue to mature in terms of functionality, scalability and an understanding of how to maximize their value, increasing numbers of biomedical researchers will be strategically poised to pursue increasingly sophisticated KR projects aimed at improving our overall understanding of the capability and behavior of biological systems.
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Generating Biomedical Hypotheses Using Semantic Web Technologies
1. Generating (and Testing) Biomedical Hypotheses
Using Semantic Web Technologies
Michel Dumontier, Ph.D.
Associate Professor of Medicine (Biomedical Informatics)
Stanford University
1
@micheldumontier::AAAI-FSW:Nov 16, 2013
4. We use the biomedical literature to find out what we
believe to be true
4
@micheldumontier::AAAI-FSW:Nov 16, 2013
5. we need to access to the most recent biomedical data
(problems: access, format, identifiers & linking)
5
@micheldumontier::AAAI-FSW:Nov 16, 2013
6. we also need access to the most effective software
to analyze, predict and evaluate
(problems: OS, versioning, input/output formats)
6
@micheldumontier::AAAI-FSW:Nov 16, 2013
7. we build support for our hypotheses by constructing
and (partly) reusing fairly sophisticated workflows
7
@micheldumontier::AAAI-FSW:Nov 16, 2013
8. Wouldn’t it be great if we could just find the evidence
required to support or dispute a scientific hypothesis using
the most up-to-date and relevant data, tools and scientific
knowledge?
8
@micheldumontier::AAAI-FSW:Nov 16, 2013
9. madness!
1. Build a massive network of interconnected
data and software using web standards
2. Develop methods to generate and test
hypotheses across these data
1. tactical formalization
2. uncovering associations
3. evidence gathering
3. Contribute back to the global knowledge
graph
9
@micheldumontier::AAAI-FSW:Nov 16, 2013
10. The Semantic Web
is the new global web of knowledge
standards for publishing, sharing and querying
facts, expert knowledge and services
scalable approach for the discovery
of independently formulated
and distributed knowledge
10
@micheldumontier::AAAI-FSW:Nov 16, 2013
11. we’re building an incredible network of linked data
11 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch
@micheldumontier::AAAI-FSW:Nov 16, 2013
12. Bio2RDF Linked Data Map
Bio2RDF Release 2: Improved coverage, interoperability and provenance of Life Science Linked Data . ESWC 2013.
12
@micheldumontier::AAAI-FSW:Nov 16, 2013
13. At the heart of Linked Data for the Life Sciences
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers,
treatments
Terminologies & publications
• Free and open source
• Uses Semantic Web standards
• Release 2 (Jan 2013): 1B+ interlinked statements from 19 conventional and high
value datasets
• Includes provenance, statistics
• Partnerships with EBI, NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool
providers
13
A Callahan, J Cruz-Toledo, M Dumontier. Querying Bio2RDF Linked Open Data with a Global Schema.
2013. Journal of Biomedical Semantics. 4(Suppl 1):S1.
@micheldumontier::AAAI-FSW:Nov 16, 2013
14. Resource Description Framework
• It’s a language to represent knowledge
– Logic-based formalism -> automated reasoning
– graph-like properties -> data analysis
• Good for
– Describing in terms of type, attributes, relations
– Integrating data from different sources
– Sharing the data (W3C standard)
– Reuse what is available, develop what you’re
missing.
14
Dumontier::FDA:Nov 15,2013
19. Directions
• research
– approaches for data publication
– methods for large scale data reduction
– methods for mining associations
• service
– coverage, quality and licensing
– better user interfaces
– sustainability of service
19
@micheldumontier::AAAI-FSW:Nov 16, 2013
20. tactical formalization
a. discovery of drug and disease pathway associations
b. prediction of drug targets using phenotypes
20
@micheldumontier::AAAI-FSW:Nov 16, 2013
21. ontology as a
strategy to formally
represent and
integrate knowledge
21
@micheldumontier::AAAI-FSW:Nov 16, 2013
22. Have you heard of OWL?
22
@micheldumontier::AAAI-FSW:Nov 16, 2013
23. Identification of
disease enriched pathways
def: A biological pathway is constituted by a
set of molecular components that undertake
some biological transformation to achieve a
stated objective
glycolysis : a pathway that converts glucose to
pyruvate
23
@micheldumontier::AAAI-FSW:Nov 16, 2013
24. aberrant pathways
which pathways are associated with disease?
Which pathways can be perturbed by drugs?
disease
pathway
drug
24
@micheldumontier::AAAI-FSW:Nov 16, 2013
25. Identification of
drug and disease enriched pathways
• Approach
– Integrate 3 datasets: DrugBank, PharmGKB and
CTD
– Integrate 7 terminologies: MeSH, ATC, ChEBI,
UMLS, SNOMED, ICD, DO
– Identify significant pathways via enrichment
analysis over the fully inferred knowledge base
• Formalized as an OWL-EL ontology
– 650,000+ classes, 3.2M subClassOf axioms, 75,000
equivalentClass axioms
25
Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics.
Bioinformatics. 2012.
@micheldumontier::AAAI-FSW:Nov 16, 2013
26. Top Level Classes
(disjointness)
pathway
drug
gene
disease
Reciprocal
Existentials
Class subsumption
mercaptopurine
[pharmgkb:PA450379]
purine-6-thiol
[CHEBI:2208]
Class Equivalence
mercaptopurine
[drugbank:DB01033]
mercaptopurine
[mesh:D015122]
mercaptopurine
[ATC:L01BB02]
drug
property chains
pathway
drug
disease
gene
26
@micheldumontier::AAAI-FSW:Nov 16, 2013
27. Benefits: Enhanced Query Capability
– Use any mapped terminology to query a target resource.
– Use knowledge in target ontologies to formulate more
precise questions
• ask for drugs that are associated with diseases of the joint:
‘Chikungunya’ (do:0050012) is defined as a viral infectious disease
located in the ‘joint’ (fma:7490) and caused by a ‘Chikungunya
virus’ (taxon:37124).
– Learn relationships that are inferred by automated
reasoning.
• alcohol (ChEBI:30879) is associated with alcoholism (PA443309)
since alcoholism is directly associated with ethanol (CHEBI:16236)
• ‘parasitic infectious disease’ (do:0001398) retrieves 129 disease
associated drugs, 15 more than are directly associated.
27
@micheldumontier::AAAI-FSW:Nov 16, 2013
28. Knowledge Discovery through Data
Integration and Enrichment Analysis
• OntoFunc: Tool to discover significant associations between sets of objects
and ontology categories. enrichment of attribute among a selected set of
input items as compared to a reference set. hypergeometric or the
binomial distribution, Fisher's exact test, or a chi-square test.
• We found 22,653 disease-pathway associations, where for each pathway
we find genes that are linked to disease.
– Mood disorder (do:3324) associated with Zidovudine Pathway
(pharmgkb:PA165859361). Zidovudine is for treating HIV/AIDS. Side
effects include fatigue, headache, myalgia, malaise and anorexia
• We found 13,826 pathway-chemical associations
– Clopidogrel (chebi:37941) associated with Endothelin signaling
pathway (pharmgkb:PA164728163). Endothelins are proteins that
constrict blood vessels and raise blood pressure. Clopidogrel inhibits
platelet aggregation and prolongs bleeding time.
28
@micheldumontier::AAAI-FSW:Nov 16, 2013
29. PhenomeDrug
A computational approach to predict drug
targets, drug effects, and drug indications using
phenotypic information
Mouse model phenotypes provide information about human drug targets.
Hoehndorf R, Hiebert T, Hardy NW, Schofield PN, Gkoutos GV, Dumontier M.
Bioinformatics. 2013.
29
@micheldumontier::AAAI-FSW:Nov 16, 2013
30. animal models provide insight for on target effects
• In the majority of 100 best selling drugs ($148B in US
alone), there is a direct correlation between knockout
phenotype and drug effect
• Gastroesophageal reflux
– Proton pump inhibitors (e.g Prilosec - $5B)
– KO of alpha or beta unit of H/K ATPase are achlorhydric (normal
pH) in stomach compared to wild type (pH 3-5) when exposed
to acidic solution
• Immunological Indications
– Anti-histamines (Claritin, Allegra, Zyrtec)
– KO of histamine H1 receptor leads to decreased responsiveness
of immune system
– Predicts on target effects : drowsiness, reduced anxiety
Zambrowicz and Sands. Nat Rev Drug Disc. 2003.
30
@micheldumontier::AAAI-FSW:Nov 16, 2013
31. Identifying drug targets
from mouse knock-out phenotypes
Main idea: if a drug’s phenotypes matches the phenotypes of a
null model, this suggests that the drug is an inhibitor of the gene
phenotypes
similar
effects
non-functional
gene model
drug
ortholog
gene
31
inhibits
human gene
@micheldumontier::AAAI-FSW:Nov 16, 2013
32. Terminological Interoperability
(we must compare apples with apples)
Mouse
Phenotypes
PhenomeNet
Mammalian
Phenotype
Ontology
PhenomeDrug
Drug effects
(mappings from UMLS to DO, NBO, MP)
@micheldumontier::AAAI-FSW:Nov 16, 2013
33. Semantic Similarity
Given a drug effect profile D and a mouse model M, we
compute the semantic similarity as an information weighted
Jaccard metric.
The similarity measure used is non-symmetrical and
determines the amount of information about a drug effect
profile D that is covered by a set of mouse model
phenotypes M.
33
@micheldumontier::AAAI-FSW:Nov 16, 2013
34. null models do well in predicting
targets of drug inhibitors
• 14,682 drugs; 7,255 mouse genotypes
• Validation against known and predicted inhibitor-target pairs
– 0.739 ROC AUC for human targets (DrugBank)
– 0.736 ROC AUC for mouse targets (STITCH)
– 0.705 ROC AUC for human targets (STITCH)
• diclofenac (STITCH:000003032)
– NSAID used to treat pain, osteoarthritis and rheumatoid arthritis
– Drug effects include liver inflammation (hepatitis), swelling of liver
(hepatomegaly), redness of skin (erythema)
– 49% explained by PPARg knockout
• peroxisome proliferator activated receptor gamma (PPARg) regulates metabolism,
proliferation, inflammation and differentiation,
• Diclofenac is a known inhibitor
– 46% explained by COX-2 knockout
• Diclofenac is a known inhibitor
@micheldumontier::AAAI-FSW:Nov 16, 2013
35. Finding evidence to support or dispute a hypothesis takes
some digging around
35
@micheldumontier::AAAI-FSW:Nov 16, 2013
36. FDA Use Case:
TKI non-QT Cardiotoxicity
• Tyrosine Kinase Inhibitors (TKI)
– Imatinib, Sorafenib, Sunitinib, Dasatinib, Nilotinib, Lapatinib
– Used to treat cancer
– Linked to cardiotoxicity.
• FDA launched drug safety program to detect toxicity
– Need to integrate data and ontologies (Abernethy, CPT 2011)
– Abernethy (2013) suggest using public data in genetics,
pharmacology, toxicology, systems biology, to
predict/validate adverse events
• What evidence could we gather to give credence that
TKI’s causes non-QT cardiotoxicity?
36
@micheldumontier::AAAI-FSW:Nov 16, 2013
37. HyQue
• The goal of HyQue is retrieve and evaluate evidence
that supports/disputes a hypothesis
– hypotheses are described as a set of events
• e.g. binding, inhibition, phenotypic effect
– events are associated with types of evidence
• a query is written to retrieve data
• a weight is assigned to provide significance
• Hypotheses are written by people who seek answers
• data retrieval rules are written by people who know the
data and how it should be interpreted
1. HyQue: Evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
2. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC
2012). Heraklion, Crete. May 27-31, 2012.
37
@micheldumontier::AAAI-FSW:Nov 16, 2013
38. HyQue: A Semantic Web Application
Hypothesis
Evaluation
Data
Ontologies
Software
@micheldumontier::AAAI-
38
39. SPIN functions
to evaluate TKI Cardiotoxicity
• Effects [r3.sider, lit]
• Does the drug have known cardiotoxic effects?
• Models [r2.mgi]
• Are there cardiotoxic phenotypes in null mouse models of target genes?
• Assays [ebi.chembl]
• Is hERG inhibited?
• Is the drug TUNEL positive?
• Targets [r2.drugbank, lit]
• Does the drug inhibit cardiotoxic-related proteins?
• RAF1, PDFGFR, VEGFR, AMPK
39
@micheldumontier::AAAI-FSW:Nov 16, 2013
44. In Summary
• This talk was about making sense of the
structured data we already have
• RDF-based Linked Open Data acts as a substrate
for query answering and task-based formalization
in OWL
• Discovery through the generation of testable
hypotheses in the target domain.
• The next big challenge lies in capturing the
output of computational experiments just as
much as extracting/formalizing user-contributed
knowledge .
44
@micheldumontier::AAAI-FSW:Nov 16, 2013
45. Acknowledgements
Bio2RDF Release 2:
Allison Callahan, Jose Cruz-Toledo, Peter Ansell
Aberrant Pathways: Robert Hoehndorf, Georgios Gkoutos
PhenomeDrug: Tanya Hiebert, Robert Hoehndorf, Georgios
Gkoutos, Paul Schofield
HyQue: Alison Callahan, Nigam Shah
45
@micheldumontier::AAAI-FSW:Nov 16, 2013
Growth of PubMed citations from 1986 to 2010. Over the past 20 years, the total number of citations in PubMed has increased at a ∼4% growth rate. There are currently over 20-million citations in PubMed
The Bio2RDF project transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery.
Slick used car salesman
Endothelins are proteins that constrict blood vessels and raise blood pressure. They are normally kept in balance by other mechanisms, but when they are over-expressed, they contribute to high blood pressure (hypertension) and heart disease.
Can’t answer questions that require background knowledge
Associate Director for Drug Safety: Darrell Abernethy, M.D., Ph.D