1. Scientific Workflow Management System
Janus
Provenance
Towards systema-c informa-on exchange
and reuse in e‐laboratories
AGU Fall mee-ng, Dec. 2009
Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK
with additional material by Sean Bechhofer and Matthew Gamble,
e-Labs design group, University of Manchester
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
2. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
http://www.nature.com/news/specials/datasharing/index.html
2
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
3. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
http://www.nature.com/news/specials/datasharing/index.html
2
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
4. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
http://www.nature.com/news/specials/datasharing/index.html
• Debate is much further along in Earth Sciences
– ESIP - data preservation / stewardship, 2009
– Long established in some communities - Atmospheric sciences,
1998 [1]
• Science Commons recommendations for Open Science
– (July 2008) [link]
[1] Strebel DE, Landis DR, Huemmrich KF, Newcomer JA, Meeson BW: The FIFE Data
Publication Experiment. Journal of the Atmospheric Sciences 1998, 55:1277-1283 2
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
5. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
6. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
outcome
outcome (provenance)
(data)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
7. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
outcome
outcome (provenance)
(data)
Research
Object
Packaging
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
8. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
outcome
outcome (provenance)
(data)
Research
Object
Packaging
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
9. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
outcome
ul outcome
Pa (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
10. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
11. Collaboration in workflow-based science
What is needed for Paul to make sense of third party data?
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
12. Collaboration in workflow-based science
What is needed for Paul to make sense of third party data?
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
13. Collaboration in workflow-based science
What is needed for Paul to make sense of third party data?
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
14. Collaboration in workflow-based science
What is needed for Paul to make sense of third party data?
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
15. Paul’s
Paul’s Pack
QTL
Research
Object
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
16. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Results
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
17. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Representation
Results
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
18. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Representation
Results Domain Relations
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
19. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
produces Published in
Representation
Results Domain Relations
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
20. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
Metadata produces Published in
Representation
Results Domain Relations
Aggregation
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
21. ORE: representing generic aggregations
Resource Map Data structure
(descriptor)
http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations:
Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for
Information Science and Technology (JASIST), to appear, 2009.
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
23. Content: Workflow provenance
A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
24. Content: Workflow provenance
A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
25. Content: Workflow provenance
A detailed trace of workflow execution
lister
- tasks performed, data transformations
get pathways
by genes1 - inputs used, outputs produced
merge pathways
gene_id
concat gene pathway ids
output
pathway_genes
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
26. Why provenance matters, if done right
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
The W3C Incubator on Provenance has been collecting numerous use cases:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
27. What users expect to learn
• Causal relations:
- which pathways come from which genes?
- which processes contributed to producing an
lister image?
- which process(es) caused data to be incorrect?
get pathways
by genes1
- which data caused a process to fail?
merge pathways • Process and data analytics:
– analyze variations in output vs an input
gene_id parameter sweep (multiple process runs)
– how often has my favourite service been
concat gene pathway ids executed? on what inputs?
– who produced this data?
output
– how often does this pathway turn up when the
input genes range over a certain set S?
pathway_genes
9
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
28. Open Provenance Model
• graph of causal dependencies involving data and processors
• not necessarily generated by a workflow!
• v1.1 out soon
wasGeneratedBy (R)
A P
Goal:
used (R)
P A standardize causal dependencies
to enable provenance metadata exchange
wgb(R5)
A1 wgb(R1) used(R3) A3 P1
P3
wgb(R6)
A2 wgb(R2) used(R4) A4 P2
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
29. Additional requirements on OPM
• Artifact values require uniform common identifier
scheme
– Linked Data in OPM?
• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large
– reduce size by exporting only query results
• Taverna approach
– multiple levels of abstraction
• through OPM accounts (“points of view”)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
30. Additional requirements on OPM
• Artifact values require uniform common identifier
scheme
– Linked Data in OPM?
• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large
– reduce size by exporting only query results
• Taverna approach
– multiple levels of abstraction
• through OPM accounts (“points of view”)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
31. Query results as OPM graphs
prov(W)
execute
W run W
query Q
export Q(prov(W))
OPM(Q(prov(W)))
prov(WA)
Q(prov(W))
- Approach implemented in the Taverna 2.1 workflow system
- Internal provenance DB with ad hoc query language
Just released!
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
32. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
33. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
34. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
result A → input B
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
35. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
workflow B+
input B
Research
Object result
exp. B result B
provenance
result A → input B datasets B
B
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
36. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
37. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
Provenance composition
accounts for implicit
collaboration
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
38. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
Provenance composition
accounts for implicit
collaboration
Aligned with focus of upcoming Provenance Challenge 4:
“connect my provenance to yours" into a whole OPM provenance graph. - P.Missier
AGU Fall meeting, San Francisco, Dec. 2009
39. Contacts
The myGrid Consortium (Manchester, Southampton)
http://mygrid.org.uk
http://www.myexperiment.org
Janus Me: pmissier@acm.org
Provenance
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier