Real World Applications of OWL
Michel Dumontier, Ph.D.
Associate Professor of Bioinformatics
Department of Biology, School of Computer Science, Institute of Biochemistry,
Carleton University
Ottawa Institute of Systems Biology
Ottawa-Carleton Institute of Biomedical Engineering
Professeur Associé, Université Laval
Visiting Associate Professor, Stanford University
1 Protege Short Course::Dumontier:March 2012
Knowledge Capture
Rightfield
K.Wolstencroft, S.Owen,
M.Horridge, O.Krebs,
W.Mueller, JL. Snoep,
F.Preez, C.Goble
RightField: Embedding
ontology annotation in
spreadsheets.
Bioinformatics (2011),
May 2011
3 Protege Short Course::Dumontier:March 2012
Formalization
SNOMED-CT
• SNOMED-CT (Clinical Terms)
ontology
• used in healthcare systems of
more than 15 countries, including
Australia, Canada, Denmark,
Spain, Sweden and the UK
• also used by major US providers,
e.g., Kaiser Permanente
• ontology provides common
vocabulary for recording clinical
data
• 395036 classes
4 Protege Short Course::Dumontier:March 2012
SNOMED-CT
• Pattern based knowledge capture
• need training and an information system to
implement
5 Protege Short Course::Dumontier:March 2012
SNOMED - verification
• Kaiser Permanente extending SNOMED to express,
e.g.:
– non-viral pneumonia (negation)
– infectious pneumonia is caused by a virus or a bacterium
(disjunction)
– double pneumonia occurs in two lungs (cardinalities)
• This is easy in SNOMED-OWL
– but reasoner failed to find expected subsumptions, e.g., that
bacterial pneumonia is a kind of non-viral pneumonia
• Ontology highly under-constrained: need to add
disjointness axioms (at least)
– virus and bacterium must be disjoint
- Ian Horrocks OWL2 tutorial
6 Protege Short Course::Dumontier:March 2012
SNOMED
• Adding disjointness led to surprising results
– many classes become inconsistent, e.g.,
percutanious embolization of hepatic artery using
fluoroscopy guidance
• Cause of inconsistencies identified as class
groin
– groin asserted to be subclass of both abdomen and
leg
– abdomen and leg are disjoint
– modelling of groin (and other similar “junction”
regions) identified as incorrect
- Ian Horrocks OWL2 tutorial
7 Protege Short Course::Dumontier:March 2012
Consistency Checking
Formalization of SBML annotations into
OWL ontologies
• Biomodels contains hundreds of quantitative
models
• SBML is an XML-based format for specifying
models and their parameters
• Models and their components are being
semantically annotated
• Use the ontologies to validate the assertions
Integrating systems biology models and biomedical ontologies.
Hoehndorf R, Dumontier M, Gennari JH, Wimalaratne S, de Bono B, Cook DL, Gkoutos GV.
BMC Syst Biol. 2011 Aug 11;5:124.
8 Protege Short Course::Dumontier:March 2012
Additional annotations are specified using the
Resource Description Framework (RDF)
<species metaid="_525530" id="GLCi"
Implicit subject compartment="cyto"
and xml attributes initialConcentration="0.097652231064563">
<annotation>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-
The annotation ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
element stores the xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#"
xmlns:bqbiol="http://biomodels.net/biology-qualifiers/"
RDF xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
subject <rdf:Description rdf:about="#_525530">
<bqbiol:is>
<rdf:Bag>
<rdf:li
predicate rdf:resource="urn:miriam:obo.chebi:CHEBI%3A4167"/>
<rdf:li rdf:resource="urn:miriam:kegg.compound:C00031"/>
</rdf:Bag>
</bqbiol:is>
object
</rdf:Description>
</rdf:RDF>
</annotation>
</species>
The intent is to express that the species represents a substance composed of glucose molecules
We also know from the SBML model that this substance is located in the cytosol and with a (initial)
concentration of 0.09765M
9
For each model annotation, we make a
commitment to what it represents
OWL Axiom:
M SubClassOf: represents some MaterialEntity
Conversion rule: a Model annotated with class C represents:
If C is a SubClassOf MaterialEntity then
M SubClassOf: represents some C
If C is a SubClassOf Function then
M SubClassOf: represents some (has-function some C)
If C is a SubClassOf Process then
M SubClassOf: represents some (has-function some (realized-
by only C))
10
Model verification
After reasoning, we found 27 models to be inconsistent
reasons
1. our representation - functions sometimes found in the place
of physical entities (e.g. entities that secrete insulin). better
to constrain with appropriate relations
2. SBML abused – e.g. species used as a measure of time
3. Incorrect annotations - constraints in the ontologies
themselves mean that the annotation is simply not possible
12
Finding inconsistencies with
axiomatically enhanced ontologies
ATPase activity (GO:0004002) is a Catalytic activity that has
Water and ATP as input, ADP and phosphate as output and is
a part of an ATP catabolic process.
To this, we add:
• GO: ATP + Water the only inputs (universal quantification)
• ChEBI: Water, ATP, alpha-D-glucose 6-phosphate are all
different (disjointness)
• “ATP” input to “ATPase” reaction, which is annotated with
ATPase activity. The species “ATP”, however, is mis-
annotated with Alpha-D-glucose 6-phosphate
(CHEBI:17665), not with ATP.
• Unsatisfiable -> curation error in BIOMD0000000176 and
BIOMD0000000177 models of anaerobic glycolysis in
yeast.
13
Classification:
Phosphotases
• Bioinformaticians use tools to identify functional
domains (e.g., InterProScan)
• Tools simply show the presence of domains -
they do not classify proteins
• Experts classify proteins according to domain
arrangements - the presence and number of
each domain is important
PhosphaBase: an ontology-driven database resource for protein phosphatases.
Wolstencroft KJ, Stevens R, Tabernero L, Brass A. Proteins. 2005 Feb 1;58(2):290-4.
14 Protege Short Course::Dumontier:March 2012
Defining Protein Phosphatases
• Necessary and sufficient conditions are
stipulated using EquivalentClass axioms
• A protein phosphatase is exactly a protein that
consists of exactly one transmembrane domain
and contains at least one phosphotase domain
ProteinPhosphatase
EquivalentTo:
Protein
AND hasDomain 1 transMembraneDomain
AND hasDomain min 1 PhosphataseCatalyticDomain
16 Protege Short Course::Dumontier:March 2012
More precise class expressions can
be formulated for subtypes
Inclusion of universal quantifier now restricts the domains
to only the types listed
R2A EquivalentTo:
Protein
AND hasDomain 2 ProteinTyrosinePhosphataseDomain
AND hasDomain 1 TransmembraneDomain
AND hasDomain 4 FibronectinDomains
AND hasDomain 1 ImmunoglobulinDomain
AND hasDomain 1 MAMDomain
AND hasDomain 1 Cadherin-LikeDomain
AND hasDomain only (TyrosinePhosphataseDomain OR
TransmembraneDomain OR FibronectinDomain OR
ImnunoglobulinDomain OR Clathrin-LikeDomain OR ManDomain)
17 Protege Short Course::Dumontier:March 2012
Describing chemical functional groups in OWL-DL
for the classification of chemical compounds
methyl group
hydroxyl group
Ethanol
Knowledge of functional Functional groups describe
groups is important in chemical reactivity in terms of
chemical synthesis, atoms and their connectivity, and
pharmaceutical design and exhibits characteristic chemical
lead optimization. behavior when present in a
compound.
N Villanueva-Rosales, M Dumontier. 2007. OWLED, Innsbruck, Austria.
18 Protege Short Course::Dumontier:March 2012
Describing Functional Groups in DL
R group
O
R H
HydroxylGroup:
CarbonGroup that (hasSingleBondWith some (OxygenAtom that
hasSingleBondWith some HydrogenAtom)
19 Protege Short Course::Dumontier:March 2012
Question Answering:
Classes as self-contained queries
• Query PubChem, DrugBank and dbPedia
23 Protege Short Course::Dumontier:March 2012
Querying Kidney and Urinary
Knowledge Base and Ontology
Query: What are the genes involved in
Proteins transport expressed in Proximal Tubule Epithelial Cell?
Entre gene
KUPO Ontology
Gene X GO:0054426 PT epithelial cell
go:biological_process MA:00345
rdfs:label
Gene Y
ro:part_of kupo:002444
Higgings Dataset Proximal tubule
DT epithelial cell
MA:000345 MA:00456
Gene X rdfs:label
kupo:expressed_in Distal tubule
ro:part_of kupo:004672
Gene Y
MA:00456
kupo:expressed_in
24 Protege Short Course::Dumontier:March 2012
Semantic Annotation and Query
ArrayExpress
Curation Curation
>250,000
Assays ATLAS
AE/GEO acquire >10,000 Re-annotate & summarize
experiments
Ontologically Modeling Sample Variables in Gene Expression Data
malone@ebi.ac.uk
25 Protege Short Course::Dumontier:March 2012
ontology-based data exploration
Query for Cell adhesion genes in all „organism parts‟
„View on EFO‟
Ontologically Modeling Sample Variables in Gene Expression Data
malone@ebi.ac.uk
26 Protege Short Course::Dumontier:March 2012
Ontology-based query expansion for ArrayExpress
Archive @ www.ebi.ac.uk/arrayexpress
27 Protege Short Course::Dumontier:March 2012
Semantic Assistant
services relevant for the user's current task are offered directly within a desktop
application. This approach relies on ontology-described semantic web services
to provide external natural language processing (NLP) pipelines
Leverage of OWL-DL axioms in a Contact Centre for Technical Product Support
Alex Kouznetsov, Bradley Shoebottom, René Witte, Christopher JO Baker. OWLED 2010.
29 Protege Short Course::Dumontier:March 2012
Plug-in for Open Office Client
30 Protege Short Course::Dumontier:March 2012
• HyQue helps construct and evaluate
(automatically obtain support for) hypotheses
using formalized background knowledge and
data using the Semantic Web
• HyQue makes it possible to develop a reliability
model around data based on our scientific
expectations of corroborating evidence
Callahan A, Dumontier M, Shah NH. HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed
Semantics. 2011 May 17;2 Suppl 2:S3.
Callahan A, Dumontier M. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended
Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012. Accepted.
31 Protege Short Course::Dumontier:March 2012
Hypothesis
h1: • simple event-
e1 (Gal4p induces expression of GAL1) based
expression
h2:
e2 (Gal3p induces expression of GAL2 • conjunctive
hypothesis –
e3 AND Gal4p induces expression of GAL7) must satisfy
two
h3: expressions
e4 (Gal4p induces expression of GAL7
e5 AND Gal80p inhibits production of Gal4p • conjunctive
when GAL3 is over-expressed hypothesis
with
e6 AND Gal80p induces expression of GAL7) conditional
expression
32 Protege Short Course::Dumontier:March 2012
HYQUE ARCHITECTURE
Callahan A, Dumontier M, Shah NH. HyQue: evaluating hypotheses using Semantic Web
technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
Callahan A, Dumontier M. Evaluating scientific hypotheses using the SPARQL Inferencing Notation.
Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012. Accepted.
33 Protege Short Course::Dumontier:March 2012
Rule-based assessment of evidence
• „induce‟ rule (maximum score: 5):
– Is event negated? GO:0010628
• If yes, subtract 2
– Is logical operator „induce‟? CHEBI:36080
• If yes, add 1; if no, subtract 1
– Is agent of type „protein‟ or „RNA‟?
• If yes, add 1; if of type „gene‟, subtract 1
– Is target of type „gene‟? SO:0000236
• If yes, add 1; if no, subtract 1
– Does agent have known „transcription factor activity‟?
• If yes, add 1 GO:0003700
– Is event located in the „nucleus‟?
• If yes, add 1; if no, subtract 1
GO:0005634
34 Protege Short Course::Dumontier:March 2012
Linked Open Results :
from hypothesis to evidence
35 Protege Short Course::Dumontier:March 2012
Literature-Based Enrichment Analysis
• Enrichment analysis on terms extracted using a target
ontology for associated articles.
Enabling enrichment analysis with the Human Disease Ontology. Paea LePendu, , Mark A. Musen, Nigam H. Shah.
Journal of Biomedical Informatics. Volume 44, Supplement 1, December 2011, Pages S31–S38
36 Protege Short Course::Dumontier:March 2012
Phenotype-based predictions
Phenotypes can be used as a substrate to
cluster similar diseases, identify potential
model systems, predict potential disease-
treating drugs or their adverse events, drug
repurposing, etc
Robert Hoehndorf, Paul N. Schofield and Georgios V. Gkoutos. PhenomeNET: a whole-
phenome approach to disease gene discovery. Nucleid Acids Research, 2011.
Linking pharmgkb to phenotype studies and animal models of disease for drug repurposing.
Hoehndorf R, Oellrich A, Rebholz-Schuhmann D, Schofield PN, Gkoutos GV. Pac Symp
Biocomput. 2012:388-99.
CK Chen, CJ Mungall, GV Gkoutos et al. MouseFinder: candidate disease genes from mouse
phenotype data. Human Mutation 2012
38 Protege Short Course::Dumontier:March 2012
Tetralogy of
Fallot
Phenotype ontologies should
contain descriptions of
morphological, behavioural,
physiological, developmental
characteristics
Human Phenotype Ontology
OMIM
39 Protege Short Course::Dumontier:March 2012
Compare Diseases based on their
Phenotypes
Comparison using Weighted Jaccard – uses information content for a
phenotype regarding genotype or disease
40 Protege Short Course::Dumontier:March 2012
Inferring equivalent phenotypes by
reasoning over OWL ontologies
human „overriding aorta [HP:0002623]‟ EquivalentTo:
„phenotype of‟ some („has part‟ some („aorta [FMA:3734]‟ and „overlaps with‟ some
„membranous part of interventricular septum [FMA:7135]‟)
mouse „overriding aorta [MP:0000273 ]‟ EquivalentTo:
„phenotype of‟ some („has part‟ some („aorta [MA:0000062]‟ and „overlaps with‟
some „membranous interventricular septum [MA:0002939]‟
Uberon super-anatomy ontology provides inter-species mappings
„aorta [FMA:3734]‟ EquivalentTo: „aorta [MA:0002939]‟
„membranous part of interventricular septum [FMA:3734]‟ EquivalentTo:
„membranous interventricular septum [MA:0000062]
Thus, „overriding aorta [HP:0002623] EquivalentTo:„overriding aorta[MP:0000273]‟
41 Protege Short Course::Dumontier:March 2012
Identifying potential mouse models
for human diseases
Quantitative ROC Analysis prediction
against curated models yields 0.89 AUC
Prediction of Tetralogy of Fallot added by
MGI
42 Protege Short Course::Dumontier:March 2012
Conclusion
• OWL has come of age and can be used in
an increasing number of scientific
investigations and applications
• OWL applications cover knowledge
capture, formalization, verification,
classification, semantic annotation, query
formulation, query answering, search,
hypothesis testing and prediction
43 Protege Short Course::Dumontier:March 2012
Notes de l'éditeur
So slide 1 is the workflow we have for getting data into the two repositories. We take publicly submitted data, often accompanying a publication, and we also have a pipeline importing data from GEO which we re-annotate. Data is ran through some scripts to check minimum QC then it is manually curated which then goes into ArrayExpress. The second part of that pipeline is that a subset of that data, selected based on array design type, is re-annotated against ontology terms. We use the gene annotations plus ontology terms to integrate and analyse on a per gene per condition level and summarise the diff expression for each gene vs condition - this is Atlas.Second slide is showing the gene expression atlas. you can explore the data using the ontology (far right), it expands based on the tree and on synonym (annotation properties). The view you see is based on the ontology tree.Third slide is ArrayExpress which is not annotated with EFO URIs but here we use the ontology to drive query expansion simply using it as a vocabulary. A search for cancer would expand on all subtypes and perform a Lucene query across the data for anything matching this text. This will change in near future as we are aiming to annotate both against the ontology.
The RDF representing the evaluation of the input hypothesis is linked to both the hypothesis AND the data used to evaluate the hypothesis
Workflow for generating background annotation sets for enrichment analysis: First, we start with a corpus of PubMed articles identified in manually curated GO annotations. These curated annotations provide gene-to-article associations. Next, we annotate the titles and abstracts of each article with ontology terms using the NCBO Annotator service. Terms associations can be expanded based on inferred hierarchical relationships. Finally, the gene-to-article associations are linked with the curated article-to-term associations to obtain a list of gene-to-term associations. The resulting term frequencies provide a background set for enrichment analysis