I gave this talk in the EDBT 2014 conference, which tool place in Athens, Greece.
I show how data examples can be used to characterize the behavior of scientific modules. I present a new methods that automatically generate the data examples, and show that such data examples are useful for the human user to understand the task of the modules, and that they can be used to assist curators in repairing broken workflows (i.e., workflows for which one or more modules are no longer supplied by their providers)
Student Profile Sample - We help schools to connect the data they have, with ...
Annotating Scientific Modules Behavior Using Data Examples
1. Annotating the Behavior of
Scientific Modules Using Data
Examples: A Practical Approach
Khalid Belhajjame
Université Paris-Dauphine, LAMSADE
Khalid.Belhajjame@dauphine.fr
2. Scientific Workflows
We have recorded a dramatic
increase in the number of scientist
who utilize scientific modules as
building in the composition of their
experiments
In 2011, the EBI recorded 21
millions invocation to the
scientific modules they host
Typically, an experiment is designed
as a workflow, the steps of which
represent invocation to scientific
modules
3. Scientific Module Annotation
Semantic annotations can be used to describe scientific modules.
Existing semantic annotations are confined to the description of
modules parameters.
Annotations describing the
behavior of the modules as to the
task they play are rarely available
Designing an ontology that captures precisely the behavior of modules is
challenging.
Proposal: To describe the behavior of scientific modules using data examples
6. Generating Data Examples
Data examples can be used as a means to
describe the behavior of scientific modules.
Enumerating all possible data examples that
can be used to describe a given module may be
expensive, and may contain redundant data
examples that describe the same behavior.
Issue: which data examples should be used to characterize the functionality
of a given module?
Solution: We show how software testing techniques can be adapted
to the problem of generating data examples without relying on the
availability of the module specification, which often is not accessible.
7. Identifying the Classes of
Behavior of a Scientific Module
To generate data examples, we start by identifying the classes of
behavior of the module.
Consider a module m with an input parameter i, the
domain of legal values of I is divided into partitions p1, …,
pn. The partitioning is performed in a way to cover all
classes of behavior of the module.
To do so, we need access to the module specification, which is
rarely available.
In this work, we use a different source of information, namely
the domain ontology used for annotating module parameters.
8. Identifying the Classes of
Behavior of a Scientific Module
An ontology can be viewed as a hierarchy of concepts.
We use this hierarchy to specify the classes of behavior
of scientific modules
Consider the module getAccession,
which given an input annotated as
biological sequence returns the
accession used for its identification.
a module can be partitioned into the following :
BiologicalSequence, NucleotideSequence, RNASequence,
DNASequence, and ProteinSequence.
9. Generating Data Examples Covering
Input Parameter Partitions
Given the partitions of input parameters identified
using the domain ontology, and given a pool of
annotated instances, the input values necessary for
constructing data examples can be automatically
identified:
Data examples covering the partitions in question can
then be constructed by invoking the model using the
input values identified.
hat cover thosepartitions. Such dataexamplescan bespecified by
soliciting from thehuman annotator examplesinput valuesthat be-
ong to the respective partitions, and then invoking the module m
o obtain thecorresponding output values, necessary for construct-
ng the data examples. The construction of such data examples
can, however, befully automated if apool of annotated instancesis
available. Specifically, given pl , apool of annotated instances, the
valuesof i necessary for constructing dataexamplesthat cover the
partitionsof theinput i of themodulemcanbeobtained asfollows:
{ hc, get I nst ance(c, pl )i s.t . c v sem(i )}
where get I nst ance(c, pl ) is a function that returns an instance
of theconcept c from theannotated pool of instancespl. Notethat
his function returns a realization of the concept in question [25],
n thesense that the instance of c chosen is not an instance of any
strict subconcept of c, i.e. not an instance of any concept c0
< c.
10. Generating Data Examples Covering
Output Parameter Partitions
The method for constructing data examples based on
the partitioning of the domains of output parameters is
can be difficult to implement.
Given a partition po of the output parameter o of a
module m, we need to find values that if used to feed
the inputs of m, the output o generates a value that
belongs to the partition po.
A source that we use for identifying (some of) data
examples that cover the output partitions, is the set of
data examples generated to cover the partitions of the
input parameters.
11. Evaluation
The method that we have just described is not an exact
method. Rather, it is a heuristic that provides a working
solution. Because of this:
The domain of a module may be over-partitioned, or
Inversely, it may be under-partitioned
We therefore assed the effectiveness of the method proposed
for generating data examples of 252 scientific modules
Notice that the availability of a pool of annotated instances
is crutial to our method.
We constructed such a pool by harvesting existing
provenance traces of scientific workflows.
13. Coverage
We were able to construct data examples that cover all
the partitions of the input parameters.
Moreover, the data examples generated were found to
cover most of the partitions of the output parameters.
Indeed, with the exception of the partitions of the
outputs of 19 modules. e.g., get_genes_by_enzyme,
link and binfo, all the partitions of the outputs of the
remaining 233 modules were covered by the data
examples generated.
16. Understanding the Behavior of a
Module Using Data Examples
Question: Do data examples allow human users understand
the behavior of scientific modules?
Evaluation exercise: given a module m, we adopted the
following two-step process:
1. In the first step, the user was asked to describe the
behavior of a module based on its name, the name of its
input and output parameters, and the structural and
semantic types of those parameters.
2. the user was given additionally the data examples that
characterize the module and was asked to update the
module’s behavior if he deems necessary
18. Understanding the Behavior of a
Module Using Data Examples
An analysis of the results and the modules showed that the ability for the
human users to identify or not the behavior of the module is correlated
with the nature of the transformation carried out by the module.
The human users identified correctly the behavior of modules
implementing data retrieval, format transformation and identifier
mappings.
On the other hand, they were less successful with modules implementing
data filtering and complex data analysis, such as text mining.
Kind of data manipulation # of modules
Format transformation 53
Dataretrieval 51
Mapping identifiersl 62
Filtering 27
Dataanalysis 59
Table 3: Kinds of data manipulation carried out by the scientific
modules.
complex dataanalysis, dataexamplesmay not havethesamevalue
as for other module kinds, as far as the human user is considered.
Note, however, that alargeproportion of scientific modules imple-
ment format transformation, dataretrieval and mapping identifiers,
which arerefereed to in thescientific workflow literature using the
term Shims [35]. For example, Table 3 classifies the modules that
we analyzed in the experiment. It shows that format transforma-
tion, data retrieval and mapping identifiers modules represent be-
tween them 66% of the total number of modules that weanalyzed.
That said, it is worth stressing, as we will demonstrate in the next
identified protein.
plemented to auto
three modules. Th
obtained fromthe
tion error and out
match. Given a
performs a homo
teins. The accessi
feed the execution
responding geneo
This workflow wa
which ended in 20
froma bioinforma
flow. However, b
for performing th
the user was unab
search for an ava
and that we can u
consuming. We f
homology searche
Japan13
, the Euro
20. Comparing Scientific Modules
Using Data Examples
As well as understanding
scientific modules, users may
be interested in comparing the
behavior of two or more
modules.
Module comparison, as a
functionality, is particularly
requested by workflows
curators to repair broken
workflows.
21. Comparing Scientific Modules
Using Data Examples
Consider two modules m and m’, and consider that
the inputs and outputs of those modules are
semantically and structurally compatible.
To be able to compare the behavior of m and m’, we
generate data examples that characterize their behavior
using the method presented earlier.
However, to make the comparison of their behavior
straightforward, we generate the data examples of m
and m’ in a way that their data examples have the same
input values.
22. Comparing Scientific Modules
Using Data Examples
By comparing the output values of the data examples
of m and m’ that have the same input values, we
determine if the two modules have behaviors that are:
Equivalent: the data examples of the two modules have
the same output values
Overlapping: Some (but not all) of the data examples of
the two modules have the same output values.
Disjoint: None of the data examples of the two modules
have the same output values.
23. Evaluation
To assess the effectiveness of the above method for
comparing modules’ behavior, we used it to assist in
the curation of broken workflows.
We were able to identify 72 modules that are in the
composition of scientific workflows (in the
myExperiment repository), that are no longer provided
by their suppliers, and for which we were able to
construct data examples.
We compared those modules with the 252 modules
that we characterized using data examples.
26. Conclusions
We showed that it is possible to characterize scientific
modules using data examples without relying on module
specifications.
We also presented two functionalities that utilize the
generated data examples.
Understanding the module behavior by human users
Module comparison
Research Question for future work:
How can we make data examples more concise (less redundant)?
How can we compose modules based only on data examples?
27. Annotating the Behavior of
Scientific Modules Using Data
Examples: A Practical Approach
Khalid Belhajjame
Université Paris-Dauphine, LAMSADE
Khalid.Belhajjame@dauphine.fr