Web Science - ISoLA 2012

Using OWL Domain Models as
Abstract Workflow Models

Or...
Conducting in silico research in the Web
from hypothesis to publication

Mark Wilkinson
Isaac Peral Senior Researcher in Biological Informatics
Centro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain
Adjunct Professor of Medical Genetics, University of British Columbia
Vancouver, BC, Canada.

Context
“While it took 2,300 years after the first
report of angina for the condition to be
commonly taught in medical
curricula, modern discoveries are
being disseminated at an increasingly
rapid pace. Focusing on the last 150
years, the trend still appears to be
linear, approaching the axis around
2025.”

The Healthcare Singularity and the Age of Semantic
Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-
Intensive Scientific Discovery Tony Hey (Editor), 2009

Slide adapted with permission from Joanne Luciano, Presentation
at Health Web Science Workshop 2012, Evanston IL, USA
June 22, 2012.

“The Singularity”

The X-intercept is where, the moment a discovery is
made, it is immediately put into practice

(not only medical practice, but any research endeavour...)

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009
Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USA
June 22, 2012.

The technology required
to achieve this
does not yet exist

You
Are
Here

Scientific research would have to be conducted
within a medium that
immediately interpreted and disseminated
the results...

You
Are
Here

...in a form that immediately (actively!) affected the
research of others...

You
Are
Here

...without requiring them to be aware
of these new discoveries.

To achieve this vision

We must learn how to
do research IN the Web

Not OVER the Web

I’d like to show you how close
we now are to this vision

and how we got there

We wanted to duplicate
a real, peer-reviewed, bioinformatics analysis

simply by building a model in the Web
describing what the answer
(if one existed)
would look like

...the machine had to make
every other decision
on it’s own

Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies
data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).

Original Study Simplified

Using what is known about interactions in fly & yeast

predict new interactions with your
human protein of interest

Abstracted

Given a protein P in Species X

Find proteins similar to P in Species Y
Retrieve interactors in Species Y
Sequence-compare Y-interactors with Species X genome
(1)  Keep only those with homologue in X

Find proteins similar to P in Species Z
Retrieve interactors in Species Z
Sequence-compare Z-interactors with (1)

 Putative interactors in Species X

Modeling the answer...

OWL

Web Ontology Language (OWL) is the
language approved by the W3C
for representing knowledge in the Web


Note that every word in
this diagram is, in reality, a
URL (because it is OWL)


The model of a Potential
Interactor is published in
The Web

It utilizes concepts from
other models published in
The Web
(ours and other’s)
by referencing their URLs


The model of a Potential
Interactor is a network of
concepts distributed
within the Web

It will be affected by
changes to those concepts

We do not “own” all of
those concepts!


ProbableInteractor
is homologous to (
Potential Interactor from ModelOrganism1…)
and
Potential Interactor from ModelOrganism2…)

Probable Interactor is defined in OWL as a subclass of Potential Interactor
that requires homologous pairs of interacting proteins to exist in both
comparator model organisms.

(Effectively, an intersection)

Publish our OWL model of a Probable Interactor

in the Web

Running a Web Science 2.0
Experiment

In a local data-file

provide the protein we are interested in

and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # human
uniprot:Q9UK53 a i:ProteinOfInterest . # ING1
taxon:4932 a i:ModelOrganism1 . # yeast
taxon:7227 a i:ModelOrganism2 . # fly

The tricky bit is...

In the abstract, the
search for homology is
“generic” – ANY model
organism.

But when the machine
attempts to do the
experiment, it will have
to use several different
and specific resources
because our question
specifies two different taxon:4932 a i:ModelOrganism1 . # yeast
species taxon:7227 a i:ModelOrganism2 . # fly

This is the question we ask:
(the query language here is SPARQL)

PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>

SELECT ?protein
FROM <file:/local/workflow.input.n3>
WHERE {

?protein a i:ProbableInteractor .

}

The reference (URL) to our OWL model of the answer

Our system then derives (and executes) the following workflow automatically

These are different
Web services!

...selected at run-time
based on the same model

There are three very cool things about what you just saw...


The system was able to
create a workflow based on
an OWL model (ontology)


The system was able to create a
COMPUTATIONAL workflow
based on a BIOLOGICAL model


The workflow it created
(i.e. the services chosen)
differed depending on context

taxon:4932 a i:ModelOrganism1 . # yeast

taxon:7227 a i:ModelOrganism2 . # fly

We got the answer

“simply” by designing a model of the answer!

Design Pattern for
Web Services on the Semantic Web

A Web application that answers
SPARQL-DL queries

Query-answering
Enhanced by SADI

What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {
locus:DEF genetics:hasVariant ?allele .
?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc
}

What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {
locus:DEF genetics:hasVariant ?allele .
?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc
}

Note that there is no “FROM” clause!
We don’t tell it where it should get the information,
The machine has to figure that out by itself...

Enter that query into
SHARE

SHARE examines available SADI Web Services
...and in a few seconds you get your answer.

The query results are live hyperlinks
to the respective Database or images
(the answer is IN the Web!)

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>
PREFIX ont: <http://ontology.dumontierlab.com/>
PREFIX uniprot: <http://lsrn.org/UniProt:>
SELECT ?gene ?pathway
WHERE {
uniprot:P47989 pred:isEncodedBy ?gene .
?gene ont:isParticipantIn ?pathway .
}

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>
PREFIX ont: <http://ontology.dumontierlab.com/>
PREFIX uniprot: <http://lsrn.org/UniProt:>
SELECT ?gene ?pathway
WHERE {
uniprot:P47989 pred:isEncodedBy ?gene .
?gene ont:isParticipantIn ?pathway .
}

Note again that there is no “From” clause…

I have not told SHARE where to look for the
answer, I am simply asking my question

Two different
Two different providers of
providers of pathway
gene information
information (KEGG and
(KEGG & GO);
NCBI); were found &
were found & accessed
accessed

The results are all links to the original data
(The answer is IN the Web!)

Show me the latest Blood Urea Nitrogen and Creatinine levels
of patients who appear to be rejecting their transplants
(I showed you this query in ISoLA 2010… sorry for repeating myself  )

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#>
PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#>
SELECT ?patient ?bun ?creat
FROM <http://sadiframework.org/ontologies/patients.rdf>
WHERE {
?patient rdf:type patient:LikelyRejecter .
?patient l:latestBUN ?bun .
?patient l:latestCreatinine ?creat .
}

Show me the latest Blood Urea Nitrogen and Creatinine levels
of patients who appear to be rejecting their transplants
(I showed you this query in 2010… sorry for repeating myself!)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#>
PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#>
SELECT ?patient ?bun ?creat
FROM <http://sadiframework.org/ontologies/patients.rdf>
WHERE {
?patient rdf:type patient:LikelyRejecter .
?patient l:latestBUN ?bun .
?patient l:latestCreatinine ?creat .
}

Likely Rejecter:

A patient who has creatinine levels
that are increasing over time

- - Mark D Wilkinson’s definition

Likely Rejecter:

…but there is no “likely rejecter”
column or table in our database…
only blood chemistry measurements
at various time-points

Likely Rejecter:

So the data required to answer this question
DOESN’T EXIST!

SHARE “decomposes” the
Likely Rejector OWL class
into its constituent property restrictions

Each property restriction in the Class
is matched with a SADI Service

The matched SADI Service can
generate data that has that property

SHARE chains these SADI services
are into a workflow...

...the outputs from that workflow are
Instances (OWL Individuals)
of the Likely Rejector OWL Class

For example… SHARE utilizes SADI to discover
analytical services on the Web that do linear regression analysis;

required for the “increasing over time” part of the Class definition

SHARE examines the OWL Class

Gathers, from the Web, the ontologies that are
referenced by that Class

then uses those ontological properties to identify
which data-sources and analytical tools it must
access to create data matching that Class definition

The way SHARE builds the workflow varies
depending on the context of the query
(i.e. which data/ontologies it reads – Mine? Yours?)

and on what part of the query
it is trying to answer at any given moment
(which ontological concept is relevant to that clause)

derives and executes the following workflow automatically
using an OWL ontology that describes the biology

The analytical tools chosen for that
workflow were determined based on

context

even though the biological (ontological)
model driving their selection was the
same

i.e.

The published model is re-usable

i.e.

The published model is re-usable

In different contexts... by different researchers

Because the model IS the experiment

the published EXPERIMENT is re-usable!!

Simply point the same query at your own dataset...

The

scientific publication

is an

executable document!

Every component of the model

Every component of the input data

Every component of the output data

is a URL

Therefore the model, the question,
the experiment, and the results

are inherently IN the Web

Every component of the model

Every component of the input data

Every component of the output data

is a URL

The answer, and the knowledge derived from it,
is immediately available to Web search engines
and moreover, can instantly affect the outcome of
other Web Science experiments

Change the way we think of “hypotheses”

In Web Science 2.0

Model what the world would “look like”
if your hypothesis were true

Then ask “is there any data that
fits that model?”

Please join us!

SADI and SHARE are Open-Source projects

http://sadiframework.org

University of British Columbia

Luke McCarthy – Lead Dev. Edward Kawas
Everything... SADI Service auto-generator

Benjamin VanderValk Ian Wood
SHARE & SADI & Experimental modeling & Experimental modeling project
myHeath Button

Soroush Samadian
Cardiovascular data modeling and queries

C-BRASS Collaborators at other sites

U of New Brunswick Carleton University

Dr. Chris Baker Dr. Michel Dumontier
Alexandre Riazanov Marc-Alexandre Nolin
Leonid Chepelev
Steve Etlinger
Nichaella Kieth
Jose Cruz

Web Science - ISoLA 2012

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à Web Science - ISoLA 2012

Similaire à Web Science - ISoLA 2012 (20)

Plus de Mark Wilkinson

Plus de Mark Wilkinson (20)

Dernier

Dernier (20)

Web Science - ISoLA 2012

Notes de l'éditeur