SlideShare une entreprise Scribd logo
1  sur  27
Annotating the Behavior of
Scientific Modules Using Data
Examples: A Practical Approach
Khalid Belhajjame
Université Paris-Dauphine, LAMSADE
Khalid.Belhajjame@dauphine.fr
Scientific Workflows
We have recorded a dramatic
increase in the number of scientist
who utilize scientific modules as
building in the composition of their
experiments
In 2011, the EBI recorded 21
millions invocation to the
scientific modules they host
Typically, an experiment is designed
as a workflow, the steps of which
represent invocation to scientific
modules
Scientific Module Annotation
Semantic annotations can be used to describe scientific modules.
Existing semantic annotations are confined to the description of
modules parameters.
Annotations describing the
behavior of the modules as to the
task they play are rarely available
Designing an ontology that captures precisely the behavior of modules is
challenging.
Proposal: To describe the behavior of scientific modules using data examples
Data Example
Describes >
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Generating Data Examples
Data examples can be used as a means to
describe the behavior of scientific modules.
Enumerating all possible data examples that
can be used to describe a given module may be
expensive, and may contain redundant data
examples that describe the same behavior.
Issue: which data examples should be used to characterize the functionality
of a given module?
Solution: We show how software testing techniques can be adapted
to the problem of generating data examples without relying on the
availability of the module specification, which often is not accessible.
Identifying the Classes of
Behavior of a Scientific Module
To generate data examples, we start by identifying the classes of
behavior of the module.
Consider a module m with an input parameter i, the
domain of legal values of I is divided into partitions p1, …,
pn. The partitioning is performed in a way to cover all
classes of behavior of the module.
To do so, we need access to the module specification, which is
rarely available.
In this work, we use a different source of information, namely
the domain ontology used for annotating module parameters.
Identifying the Classes of
Behavior of a Scientific Module
An ontology can be viewed as a hierarchy of concepts.
We use this hierarchy to specify the classes of behavior
of scientific modules
Consider the module getAccession,
which given an input annotated as
biological sequence returns the
accession used for its identification.
a module can be partitioned into the following :
BiologicalSequence, NucleotideSequence, RNASequence,
DNASequence, and ProteinSequence.
Generating Data Examples Covering
Input Parameter Partitions
Given the partitions of input parameters identified
using the domain ontology, and given a pool of
annotated instances, the input values necessary for
constructing data examples can be automatically
identified:
Data examples covering the partitions in question can
then be constructed by invoking the model using the
input values identified.
hat cover thosepartitions. Such dataexamplescan bespecified by
soliciting from thehuman annotator examplesinput valuesthat be-
ong to the respective partitions, and then invoking the module m
o obtain thecorresponding output values, necessary for construct-
ng the data examples. The construction of such data examples
can, however, befully automated if apool of annotated instancesis
available. Specifically, given pl , apool of annotated instances, the
valuesof i necessary for constructing dataexamplesthat cover the
partitionsof theinput i of themodulemcanbeobtained asfollows:
{ hc, get I nst ance(c, pl )i s.t . c v sem(i )}
where get I nst ance(c, pl ) is a function that returns an instance
of theconcept c from theannotated pool of instancespl. Notethat
his function returns a realization of the concept in question [25],
n thesense that the instance of c chosen is not an instance of any
strict subconcept of c, i.e. not an instance of any concept c0
< c.
Generating Data Examples Covering
Output Parameter Partitions
The method for constructing data examples based on
the partitioning of the domains of output parameters is
can be difficult to implement.
Given a partition po of the output parameter o of a
module m, we need to find values that if used to feed
the inputs of m, the output o generates a value that
belongs to the partition po.
A source that we use for identifying (some of) data
examples that cover the output partitions, is the set of
data examples generated to cover the partitions of the
input parameters.
Evaluation
The method that we have just described is not an exact
method. Rather, it is a heuristic that provides a working
solution. Because of this:
The domain of a module may be over-partitioned, or
Inversely, it may be under-partitioned
We therefore assed the effectiveness of the method proposed
for generating data examples of 252 scientific modules
Notice that the availability of a pool of annotated instances
is crutial to our method.
We constructed such a pool by harvesting existing
provenance traces of scientific workflows.
Evaluation: Metrics
Coverage
Completeness
Conciseness
Coverage
We were able to construct data examples that cover all
the partitions of the input parameters.
Moreover, the data examples generated were found to
cover most of the partitions of the output parameters.
Indeed, with the exception of the partitions of the
outputs of 19 modules. e.g., get_genes_by_enzyme,
link and binfo, all the partitions of the outputs of the
remaining 233 modules were covered by the data
examples generated.
Completeness
Conciseness
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Understanding the Behavior of a
Module Using Data Examples
Question: Do data examples allow human users understand
the behavior of scientific modules?
Evaluation exercise: given a module m, we adopted the
following two-step process:
1. In the first step, the user was asked to describe the
behavior of a module based on its name, the name of its
input and output parameters, and the structural and
semantic types of those parameters.
2. the user was given additionally the data examples that
characterize the module and was asked to update the
module’s behavior if he deems necessary
Understanding the Behavior of a
Module Using Data Examples
Understanding the Behavior of a
Module Using Data Examples
An analysis of the results and the modules showed that the ability for the
human users to identify or not the behavior of the module is correlated
with the nature of the transformation carried out by the module.
The human users identified correctly the behavior of modules
implementing data retrieval, format transformation and identifier
mappings.
On the other hand, they were less successful with modules implementing
data filtering and complex data analysis, such as text mining.
Kind of data manipulation # of modules
Format transformation 53
Dataretrieval 51
Mapping identifiersl 62
Filtering 27
Dataanalysis 59
Table 3: Kinds of data manipulation carried out by the scientific
modules.
complex dataanalysis, dataexamplesmay not havethesamevalue
as for other module kinds, as far as the human user is considered.
Note, however, that alargeproportion of scientific modules imple-
ment format transformation, dataretrieval and mapping identifiers,
which arerefereed to in thescientific workflow literature using the
term Shims [35]. For example, Table 3 classifies the modules that
we analyzed in the experiment. It shows that format transforma-
tion, data retrieval and mapping identifiers modules represent be-
tween them 66% of the total number of modules that weanalyzed.
That said, it is worth stressing, as we will demonstrate in the next
identified protein.
plemented to auto
three modules. Th
obtained fromthe
tion error and out
match. Given a
performs a homo
teins. The accessi
feed the execution
responding geneo
This workflow wa
which ended in 20
froma bioinforma
flow. However, b
for performing th
the user was unab
search for an ava
and that we can u
consuming. We f
homology searche
Japan13
, the Euro
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Comparing Scientific Modules
Using Data Examples
As well as understanding
scientific modules, users may
be interested in comparing the
behavior of two or more
modules.
Module comparison, as a
functionality, is particularly
requested by workflows
curators to repair broken
workflows.
Comparing Scientific Modules
Using Data Examples
Consider two modules m and m’, and consider that
the inputs and outputs of those modules are
semantically and structurally compatible.
To be able to compare the behavior of m and m’, we
generate data examples that characterize their behavior
using the method presented earlier.
However, to make the comparison of their behavior
straightforward, we generate the data examples of m
and m’ in a way that their data examples have the same
input values.
Comparing Scientific Modules
Using Data Examples
By comparing the output values of the data examples
of m and m’ that have the same input values, we
determine if the two modules have behaviors that are:
Equivalent: the data examples of the two modules have
the same output values
Overlapping: Some (but not all) of the data examples of
the two modules have the same output values.
Disjoint: None of the data examples of the two modules
have the same output values.
Evaluation
To assess the effectiveness of the above method for
comparing modules’ behavior, we used it to assist in
the curation of broken workflows.
We were able to identify 72 modules that are in the
composition of scientific workflows (in the
myExperiment repository), that are no longer provided
by their suppliers, and for which we were able to
construct data examples.
We compared those modules with the 252 modules
that we characterized using data examples.
16
23
33
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Conclusions
We showed that it is possible to characterize scientific
modules using data examples without relying on module
specifications.
We also presented two functionalities that utilize the
generated data examples.
Understanding the module behavior by human users
Module comparison
Research Question for future work:
How can we make data examples more concise (less redundant)?
How can we compose modules based only on data examples?
Annotating the Behavior of
Scientific Modules Using Data
Examples: A Practical Approach
Khalid Belhajjame
Université Paris-Dauphine, LAMSADE
Khalid.Belhajjame@dauphine.fr

Contenu connexe

Tendances

Myanmar Alphabet Recognition System Based on Artificial Neural Network
Myanmar Alphabet Recognition System Based on Artificial Neural NetworkMyanmar Alphabet Recognition System Based on Artificial Neural Network
Myanmar Alphabet Recognition System Based on Artificial Neural Networkijtsrd
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Geant4_Web_Application_Update_and_Pion_Cross_Section_Simulation
Geant4_Web_Application_Update_and_Pion_Cross_Section_SimulationGeant4_Web_Application_Update_and_Pion_Cross_Section_Simulation
Geant4_Web_Application_Update_and_Pion_Cross_Section_SimulationRasheed Auguste
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Miningijdmtaiir
 
Protein structure prediction by means
Protein structure prediction by meansProtein structure prediction by means
Protein structure prediction by meansijaia
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
 
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATIONA NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATIONcscpconf
 
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...Rudradityo Saha
 
Iaetsd an enhanced feature selection for
Iaetsd an enhanced feature selection forIaetsd an enhanced feature selection for
Iaetsd an enhanced feature selection forIaetsd Iaetsd
 
Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique IOSR Journals
 
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd Iaetsd
 
Delineation of techniques to implement on the enhanced proposed model using d...
Delineation of techniques to implement on the enhanced proposed model using d...Delineation of techniques to implement on the enhanced proposed model using d...
Delineation of techniques to implement on the enhanced proposed model using d...ijdms
 
Adaptive web page content identification
Adaptive web page content identificationAdaptive web page content identification
Adaptive web page content identificationJhih-Ming Chen
 
Cheminformatics: An overview
Cheminformatics: An overviewCheminformatics: An overview
Cheminformatics: An overviewsubhasis banerjee
 
Session ii g2 overview metabolic network modeling mcc
Session ii g2 overview metabolic network modeling mccSession ii g2 overview metabolic network modeling mcc
Session ii g2 overview metabolic network modeling mccUSD Bioinformatics
 
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET Journal
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
 

Tendances (20)

Deliverable_5.1.2
Deliverable_5.1.2Deliverable_5.1.2
Deliverable_5.1.2
 
Myanmar Alphabet Recognition System Based on Artificial Neural Network
Myanmar Alphabet Recognition System Based on Artificial Neural NetworkMyanmar Alphabet Recognition System Based on Artificial Neural Network
Myanmar Alphabet Recognition System Based on Artificial Neural Network
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Geant4_Web_Application_Update_and_Pion_Cross_Section_Simulation
Geant4_Web_Application_Update_and_Pion_Cross_Section_SimulationGeant4_Web_Application_Update_and_Pion_Cross_Section_Simulation
Geant4_Web_Application_Update_and_Pion_Cross_Section_Simulation
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
 
NCRAST Talk on Clustering
NCRAST Talk on ClusteringNCRAST Talk on Clustering
NCRAST Talk on Clustering
 
Protein structure prediction by means
Protein structure prediction by meansProtein structure prediction by means
Protein structure prediction by means
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based Method
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
 
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATIONA NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
 
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
 
Iaetsd an enhanced feature selection for
Iaetsd an enhanced feature selection forIaetsd an enhanced feature selection for
Iaetsd an enhanced feature selection for
 
Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique
 
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithm
 
Delineation of techniques to implement on the enhanced proposed model using d...
Delineation of techniques to implement on the enhanced proposed model using d...Delineation of techniques to implement on the enhanced proposed model using d...
Delineation of techniques to implement on the enhanced proposed model using d...
 
Adaptive web page content identification
Adaptive web page content identificationAdaptive web page content identification
Adaptive web page content identification
 
Cheminformatics: An overview
Cheminformatics: An overviewCheminformatics: An overview
Cheminformatics: An overview
 
Session ii g2 overview metabolic network modeling mcc
Session ii g2 overview metabolic network modeling mccSession ii g2 overview metabolic network modeling mcc
Session ii g2 overview metabolic network modeling mcc
 
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 

En vedette

Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Khalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in SepublicaKhalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenanceKhalid Belhajjame
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame
 
Предиктивная аналитика и Big Data: методы, инструменты, решения
Предиктивная аналитика и Big Data: методы, инструменты, решенияПредиктивная аналитика и Big Data: методы, инструменты, решения
Предиктивная аналитика и Big Data: методы, инструменты, решенияDell_Russia
 

En vedette (9)

Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
 
Предиктивная аналитика и Big Data: методы, инструменты, решения
Предиктивная аналитика и Big Data: методы, инструменты, решенияПредиктивная аналитика и Big Data: методы, инструменты, решения
Предиктивная аналитика и Big Data: методы, инструменты, решения
 

Similaire à Annotating Scientific Modules Behavior Using Data Examples

Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01Sage Base
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
 
Data mining techniques a survey paper
Data mining techniques a survey paperData mining techniques a survey paper
Data mining techniques a survey papereSAT Publishing House
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniqueseSAT Journals
 
Visualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine LearningVisualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine LearningIRJET Journal
 
Object oriented methodologies
Object oriented methodologiesObject oriented methodologies
Object oriented methodologiesnaina-rani
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionSiddharth Shrivastava
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsIRJET Journal
 
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and ApplicationsIRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and ApplicationsIRJET Journal
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & associjerd
 
Integration of queuing network and idef3 for business process analysis
Integration of queuing network and idef3 for business process analysisIntegration of queuing network and idef3 for business process analysis
Integration of queuing network and idef3 for business process analysisPatricia Tavares Boralli
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningMLAI2
 
Data mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant AnalysisData mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant AnalysisIOSR Journals
 
Data mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant AnalysisData mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant AnalysisIOSR Journals
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelDr. Abdul Ahad Abro
 
Data mining techniques application for prediction in OLAP cube
Data mining techniques application for prediction in OLAP cubeData mining techniques application for prediction in OLAP cube
Data mining techniques application for prediction in OLAP cubeIJECEIAES
 

Similaire à Annotating Scientific Modules Behavior Using Data Examples (20)

T0 numtq0n tk=
T0 numtq0n tk=T0 numtq0n tk=
T0 numtq0n tk=
 
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
 
Data mining techniques a survey paper
Data mining techniques a survey paperData mining techniques a survey paper
Data mining techniques a survey paper
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
Visualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine LearningVisualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine Learning
 
Object oriented methodologies
Object oriented methodologiesObject oriented methodologies
Object oriented methodologies
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and ApplicationsIRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assoc
 
Integration of queuing network and idef3 for business process analysis
Integration of queuing network and idef3 for business process analysisIntegration of queuing network and idef3 for business process analysis
Integration of queuing network and idef3 for business process analysis
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
 
Data mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant AnalysisData mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant Analysis
 
Data mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant AnalysisData mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant Analysis
 
E017153342
E017153342E017153342
E017153342
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
Data mining techniques application for prediction in OLAP cube
Data mining techniques application for prediction in OLAP cubeData mining techniques application for prediction in OLAP cube
Data mining techniques application for prediction in OLAP cube
 

Plus de Khalid Belhajjame

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Khalid Belhajjame
 

Plus de Khalid Belhajjame (14)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 
Edbt 2010, Belhajjame
Edbt 2010, BelhajjameEdbt 2010, Belhajjame
Edbt 2010, Belhajjame
 

Dernier

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 

Dernier (20)

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 

Annotating Scientific Modules Behavior Using Data Examples

  • 1. Annotating the Behavior of Scientific Modules Using Data Examples: A Practical Approach Khalid Belhajjame Université Paris-Dauphine, LAMSADE Khalid.Belhajjame@dauphine.fr
  • 2. Scientific Workflows We have recorded a dramatic increase in the number of scientist who utilize scientific modules as building in the composition of their experiments In 2011, the EBI recorded 21 millions invocation to the scientific modules they host Typically, an experiment is designed as a workflow, the steps of which represent invocation to scientific modules
  • 3. Scientific Module Annotation Semantic annotations can be used to describe scientific modules. Existing semantic annotations are confined to the description of modules parameters. Annotations describing the behavior of the modules as to the task they play are rarely available Designing an ontology that captures precisely the behavior of modules is challenging. Proposal: To describe the behavior of scientific modules using data examples
  • 5. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 6. Generating Data Examples Data examples can be used as a means to describe the behavior of scientific modules. Enumerating all possible data examples that can be used to describe a given module may be expensive, and may contain redundant data examples that describe the same behavior. Issue: which data examples should be used to characterize the functionality of a given module? Solution: We show how software testing techniques can be adapted to the problem of generating data examples without relying on the availability of the module specification, which often is not accessible.
  • 7. Identifying the Classes of Behavior of a Scientific Module To generate data examples, we start by identifying the classes of behavior of the module. Consider a module m with an input parameter i, the domain of legal values of I is divided into partitions p1, …, pn. The partitioning is performed in a way to cover all classes of behavior of the module. To do so, we need access to the module specification, which is rarely available. In this work, we use a different source of information, namely the domain ontology used for annotating module parameters.
  • 8. Identifying the Classes of Behavior of a Scientific Module An ontology can be viewed as a hierarchy of concepts. We use this hierarchy to specify the classes of behavior of scientific modules Consider the module getAccession, which given an input annotated as biological sequence returns the accession used for its identification. a module can be partitioned into the following : BiologicalSequence, NucleotideSequence, RNASequence, DNASequence, and ProteinSequence.
  • 9. Generating Data Examples Covering Input Parameter Partitions Given the partitions of input parameters identified using the domain ontology, and given a pool of annotated instances, the input values necessary for constructing data examples can be automatically identified: Data examples covering the partitions in question can then be constructed by invoking the model using the input values identified. hat cover thosepartitions. Such dataexamplescan bespecified by soliciting from thehuman annotator examplesinput valuesthat be- ong to the respective partitions, and then invoking the module m o obtain thecorresponding output values, necessary for construct- ng the data examples. The construction of such data examples can, however, befully automated if apool of annotated instancesis available. Specifically, given pl , apool of annotated instances, the valuesof i necessary for constructing dataexamplesthat cover the partitionsof theinput i of themodulemcanbeobtained asfollows: { hc, get I nst ance(c, pl )i s.t . c v sem(i )} where get I nst ance(c, pl ) is a function that returns an instance of theconcept c from theannotated pool of instancespl. Notethat his function returns a realization of the concept in question [25], n thesense that the instance of c chosen is not an instance of any strict subconcept of c, i.e. not an instance of any concept c0 < c.
  • 10. Generating Data Examples Covering Output Parameter Partitions The method for constructing data examples based on the partitioning of the domains of output parameters is can be difficult to implement. Given a partition po of the output parameter o of a module m, we need to find values that if used to feed the inputs of m, the output o generates a value that belongs to the partition po. A source that we use for identifying (some of) data examples that cover the output partitions, is the set of data examples generated to cover the partitions of the input parameters.
  • 11. Evaluation The method that we have just described is not an exact method. Rather, it is a heuristic that provides a working solution. Because of this: The domain of a module may be over-partitioned, or Inversely, it may be under-partitioned We therefore assed the effectiveness of the method proposed for generating data examples of 252 scientific modules Notice that the availability of a pool of annotated instances is crutial to our method. We constructed such a pool by harvesting existing provenance traces of scientific workflows.
  • 13. Coverage We were able to construct data examples that cover all the partitions of the input parameters. Moreover, the data examples generated were found to cover most of the partitions of the output parameters. Indeed, with the exception of the partitions of the outputs of 19 modules. e.g., get_genes_by_enzyme, link and binfo, all the partitions of the outputs of the remaining 233 modules were covered by the data examples generated.
  • 15. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 16. Understanding the Behavior of a Module Using Data Examples Question: Do data examples allow human users understand the behavior of scientific modules? Evaluation exercise: given a module m, we adopted the following two-step process: 1. In the first step, the user was asked to describe the behavior of a module based on its name, the name of its input and output parameters, and the structural and semantic types of those parameters. 2. the user was given additionally the data examples that characterize the module and was asked to update the module’s behavior if he deems necessary
  • 17. Understanding the Behavior of a Module Using Data Examples
  • 18. Understanding the Behavior of a Module Using Data Examples An analysis of the results and the modules showed that the ability for the human users to identify or not the behavior of the module is correlated with the nature of the transformation carried out by the module. The human users identified correctly the behavior of modules implementing data retrieval, format transformation and identifier mappings. On the other hand, they were less successful with modules implementing data filtering and complex data analysis, such as text mining. Kind of data manipulation # of modules Format transformation 53 Dataretrieval 51 Mapping identifiersl 62 Filtering 27 Dataanalysis 59 Table 3: Kinds of data manipulation carried out by the scientific modules. complex dataanalysis, dataexamplesmay not havethesamevalue as for other module kinds, as far as the human user is considered. Note, however, that alargeproportion of scientific modules imple- ment format transformation, dataretrieval and mapping identifiers, which arerefereed to in thescientific workflow literature using the term Shims [35]. For example, Table 3 classifies the modules that we analyzed in the experiment. It shows that format transforma- tion, data retrieval and mapping identifiers modules represent be- tween them 66% of the total number of modules that weanalyzed. That said, it is worth stressing, as we will demonstrate in the next identified protein. plemented to auto three modules. Th obtained fromthe tion error and out match. Given a performs a homo teins. The accessi feed the execution responding geneo This workflow wa which ended in 20 froma bioinforma flow. However, b for performing th the user was unab search for an ava and that we can u consuming. We f homology searche Japan13 , the Euro
  • 19. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 20. Comparing Scientific Modules Using Data Examples As well as understanding scientific modules, users may be interested in comparing the behavior of two or more modules. Module comparison, as a functionality, is particularly requested by workflows curators to repair broken workflows.
  • 21. Comparing Scientific Modules Using Data Examples Consider two modules m and m’, and consider that the inputs and outputs of those modules are semantically and structurally compatible. To be able to compare the behavior of m and m’, we generate data examples that characterize their behavior using the method presented earlier. However, to make the comparison of their behavior straightforward, we generate the data examples of m and m’ in a way that their data examples have the same input values.
  • 22. Comparing Scientific Modules Using Data Examples By comparing the output values of the data examples of m and m’ that have the same input values, we determine if the two modules have behaviors that are: Equivalent: the data examples of the two modules have the same output values Overlapping: Some (but not all) of the data examples of the two modules have the same output values. Disjoint: None of the data examples of the two modules have the same output values.
  • 23. Evaluation To assess the effectiveness of the above method for comparing modules’ behavior, we used it to assist in the curation of broken workflows. We were able to identify 72 modules that are in the composition of scientific workflows (in the myExperiment repository), that are no longer provided by their suppliers, and for which we were able to construct data examples. We compared those modules with the 252 modules that we characterized using data examples.
  • 25. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 26. Conclusions We showed that it is possible to characterize scientific modules using data examples without relying on module specifications. We also presented two functionalities that utilize the generated data examples. Understanding the module behavior by human users Module comparison Research Question for future work: How can we make data examples more concise (less redundant)? How can we compose modules based only on data examples?
  • 27. Annotating the Behavior of Scientific Modules Using Data Examples: A Practical Approach Khalid Belhajjame Université Paris-Dauphine, LAMSADE Khalid.Belhajjame@dauphine.fr