eScience 2014, Guarujá (Brasil). Abstract—Scientific workflows provide the means to define, execute and reproduce computational experiments. However, reusing existing workflows still poses challenges for workflow designers. Workflows are often too large and too specific to reuse in their entirety, so reuse is more likely to happen for fragments of workflows. These fragments may be identified manually by users as sub-workflows, or detected automatically. In this paper we present the FragFlow approach, which detects workflow fragments automatically by analyzing existing workflow corpora with graph mining algorithms. FragFlow detects the most common workflow fragments, links them to the original workflows and visualizes them. We evaluate our approach by comparing FragFlow results against user-defined sub-workflows from three different corpora of the LONI Pipeline system. Based on this evaluation, we discuss how automated workflow fragment detection could facilitate workflow reuse
Generative AI on Enterprise Cloud with NiFi and Milvus
Frag Flow: Automated Fragment Detection in Scientific Workflows
1. Date: 24/10/2014
FragFlow: Automatic Fragment Detection in Scientific Workflows
Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ
* Universidad Politécnica de Madrid,
Ŧ USC Information Sciences Institute,
ⱡ USC Laboratory of Neuroimaging
2. 2
Overview
•Detecting common groups of tasks in corpus of scientific workflows
•Application of exact and inexact graph matching techniques
•Filtering and linking results to the input corpus
•Benefits: Discoverability, understandability, reuse, design, modularization,
visualization
Lab book
Digital Log
Laboratory Protocol
(recipe)
Workflow
Experiment
IEEE eScience 2014. Guarujá, Brasil
3. Background
•Workflows are software artifacts that capture computational experiments
•Addition to paper publication
•Provenance of results
•Reuse
•Existing repositories of workflows (Galaxy, myExperiment, the LONI Pipeline, CrowdLabs, etc.)
•Sharing workflows
•Exploring existing workflows
•PROBLEMS to address:
•Workflows have many detailed steps and may be difficult to understand
•The general method may not apparent
•How are different workflow related?
•What steps do they have in common?
3
IEEE eScience 2014. Guarujá, Brasil
4. Workflow Fragment: set of connected steps that are part of a workflow.
•Common Workflow Fragment: fragments that occur more than once in a corpus of workflows
•Grouping: Workflow fragment manually annotated by a user
•Sub-Grouping: Grouping included as part of another grouping
Workflow Fragments and Groupings
4
A
B
C
A
F
D
A
B
C
G
B
H
A
B
F
B
E
Common workflow fragments
Workflow 1
Workflow 2
Workflow 3
IEEE eScience 2014. Guarujá, Brasil
5. Our Goals
Our goal is to automatically detect useful workflow fragments to be reused by scientists. In this work, given a workflow corpus…
•Goal 1: Are automatically detected workflow fragments similar to user- defined groupings?
•Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful?
•Goal 3: How are workflows and groupings reused?
5
IEEE eScience 2014. Guarujá, Brasil
6. The LONI Pipeline
6
•Workflow system for neuroimaging analysis
•Active community of users creating workflows
•Enables users to define groupings in workflows
•Has a corpus of published workflows
•Has a library of (uniquely identified) components with a well defined functionality http://pipeline.loni.usc.edu/explore/library-navigator/
IEEE eScience 2014. Guarujá, Brasil
7. Workflow Mining in FragFlow
7
1
2
3
4
IEEE eScience 2014. Guarujá, Brasil
Corpus
8. Corpus Preparation
Workflows converted to Labeled Directed Acyclic Graphs (LDAG)
•The label of a node in the graph corresponds to the type of the step in the workflow
•Edges capture the dependencies between different steps
•Duplicated workflows are removed
•Single-step workflows are removed
8
IEEE eScience 2014. Guarujá, Brasil
9. Graph Mining
9
We use popular graph mining techniques:
•Inexact FGM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete
•SUBDUE
•2 heuristics: Minimum Description Length (MDL) and Size
•Frequency based
•Exact FGM: deliver all the possible fragments to be found the dataset.
•gSpan
•Depth first search strategy
•Support based
•FSG
•Breadth first search strategy
•Support based
IEEE eScience 2014. Guarujá, Brasil
10. Filtering Relevant Fragments
10
The number of resulting fragments can be very large. We distinguish:
•Multistep fragments:
•More than one step
•Filtered Multistep fragments:
•Multistep fragments
•Contain all smaller fragments with the same number of occurrences
IEEE eScience 2014. Guarujá, Brasil
11. Linking to the Corpus: Wf-fd
11
IEEE eScience 2014. Guarujá, Brasil
12. Linking to the Corpora: Example
12
IEEE eScience 2014. Guarujá, Brasil
Corpus
Fragment
13. Evaluation
13
Three workflow corpora: User Corpus 1 (WC1)
•Designed mostly by a single a single user
•General medial imaging
•790 workflows (475 after data preparation) User Corpus 2 (WC2)
•Created by a user, with collaborations of others
•Well documented workflows, meant for reuse
•113 workflows (96 after data preparation) Multi User Corpus 3 (WC3)
•Workflows submitted by 62 users during the month of Jan 2014
•Several executions of the same workflows
•5859 workflows (357 after data preparation)
IEEE eScience 2014. Guarujá, Brasil
14. Evaluation: Metrics
14
Goal 1: Are automatically detected workflow fragments similar to user-defined groupings ? Goal 2: Do users find useful the fragments that were NOT similar to their defined groupings?
IEEE eScience 2014. Guarujá, Brasil
22. Preliminary Evaluation: User based evaluation
22
•Manual evaluation: each user is given 16-18 common workflow fragments detected by FragFlow
•66% and 100% accuracy respectively
•Some of the reasons to not use fragments depended on the user preferences
•Currently evaluating additional users
IEEE eScience 2014. Guarujá, Brasil
User
Use as proposed
Use with minor changes
Use with major changes
Use
User1 (WC1)
11%
16,6%
38%
66,6%
User 2 (WC2)
44%
6%
50%
100%
23. Evaluation: Grouping analysis
23
•Workflows with groupings are more common in single user corpora (WC1 and WC2)
•Groupings are reused
•1463 groupings versus 209 unique groupings in WC1
•302 grouping versus 108 unique groupings in WC2
•456 groupings versus 175 unique groupings in WC3
•Grouping size ranges from 60 to 0
•Facilitate copy paste by users (large grouping size)
•Reducing unnecessary inputs (groupings with no steps)
IEEE eScience 2014. Guarujá, Brasil
Corpus
Total qroup.
Unique multistep qroup.
Wf with qroup.
Avg. group. per wf
Max nºof steps in qroup.
Min nº of steps in qroup.
WC1
1463
209
327
4
56
1
WC2
302
108
42
7
39
0
WC3
456
175
89
5
60
1
24. Findings
24
With respect to our goals…
•Goal 1: Are automatically detected workflow fragments similar to user-defined groupings?
•(with freq 10%, single user, inexact FGM) 30% to 75% of the total FragFlow fragments found correspond directly to user-defined groupings
•(multi user)Best results are 50% to 56% inexact FGM with minimum frequency. If we consider the overlap of 80% of the steps, the precision is 62% to 66%
•Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful?
•For one user 66% of the proposed fragments were useful, for another 100% were useful
•Further evaluation is needed
•Goal 3: How are workflows and groupings reused?
•Those workflows with groupings have at least 4 groupings
•Reuse of groupings (grouping numbers are up to 7 times more than the unique groupings in the corpora)
IEEE eScience 2014. Guarujá, Brasil
25. Limitations
25
•Graph mining is an NP-Complete problem
•Big fragments can take time to be recognized
•Errors derived from memory heap issues
•Detection of groupings may depend on user preferences on size and frequency
IEEE eScience 2014. Guarujá, Brasil
26. Conclusions and Future Work
26
•FragFlow: Approach to find the most common fragments in a corpus of workflows
•Several integrated graph mining techniques
•FragFlow can be used with different settings
•Minimum or maximum frequency and support.
•Size
•Type of the graph mining algorithm to be applied
•Evaluation of the results using corpora belonging to the LONI Pipeline system.
•New algorithms are being integrated!
•Sigma (inexact FGM), Gaston (exact FGM)
•Future work
•Test FragFlow with other workflow systems, domains, and perform further user evaluations.
•Evaluate how workflow quality improves when users are proposed automatically mined workflow fragments Evaluation and resources available here: http://purl.org/net/escience2014
IEEE eScience 2014. Guarujá, Brasil
27. 27
Who are we?
•Daniel Garijo, Oscar Corcho Ontology Engineering Group, UPM
•Yolanda Gil Information Sciences Institute, USC
•Boris A. Gutman, Ivo D. Dinov, Paul Thompson Arthur W. Toga. USC Laboratory of Neuro Imaging
IEEE eScience 2014. Guarujá, Brasil
28. Want to collaborate? Contact me at dgarijo@fi.upm.es
28
Questions?
IEEE eScience 2014. Guarujá, Brasil
29. Date: 24/10/2014
FragFlow: Automatic Fragment Detection in Scientific Workflows
Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ
* Universidad Politécnica de Madrid,
Ŧ USC Information Sciences Institute,
ⱡ USC Laboratory of Neuroimaging