SlideShare a Scribd company logo
1 of 23
Provenance with a Purpose
Khalid Belhajjame
PSL, Université Paris-Dauphine, LAMSADE
kbelhajj@gmail.com
© K. Belhajjame 1
December 9th, 2022
We start with a short tale ... about provenance
Characters:
• Alice, a scientists who utilize workflows for their computational experiment and
analyses
• Bob, a believer in the greatness of provenance, who wants to spread the word
© K. Belhajjame 2
December 9th, 2022
Workflowsaregreat,buttheycanbedifficultto
makework,andeven when theydid, ittakesme a
long timetomake senseofthe results
© K. Belhajjame 3
December 9th, 2022
Youshoulduseprovenance.
It will helpyouwitha lotofstuff.
© K. Belhajjame 4
December 9th, 2022
Really!like what?
© K. Belhajjame 5
December 9th, 2022
Plentyofthings.
Debugging yourworkflows,understandingtheresults,experimentreporting,analysing
andoptimizingtheworkflow,verifying the results/findingsofothers,reusing the
(intermediate) results…younameit
© K. Belhajjame 6
December 9th, 2022
Soundslike I have foundmy hapiness,I will
definitlytryit
© K. Belhajjame 7
December 9th, 2022
Few months later
© K. Belhajjame 8
December 9th, 2022
Hello Alice, howdid it go?
© K. Belhajjame 9
December 9th, 2022
Hi Bob,tobehonnest,notgreat
© K. Belhajjame 10
December 9th, 2022
The provenancerecordedis toofinegrained, it
takesmeagestoget myheadaroundit, andeven
when I doit doesnothavealwayswhatI really
need
© K. Belhajjame 11
December 9th, 2022
Needless tosaythatI have even moretrouble
makingsense ofthe provenanceoftheexecutions
ofthe workflowsof mycolleagues
© K. Belhajjame 12
December 9th, 2022
Oh, andthe executionofmy workflowsis getting
slower, andI cannotaffordtostoreall collected
provenance… I justremove it afterfew workflow
executions
© K. Belhajjame 13
December 9th, 2022
Moral of the story …
• By and large, provenance in current systems is collected without really considering the
requirements of the applications that will be using it
• As a result, we end up collecting all sorts of things just to find later that:
• Interpretability. Collected provenance is difficult to understand
• Relevance. Most of collected provenance is not relevant for the task at hand,
• Completeness. It does not contains all the information needed for the task at hand.
• This conclusions are not limited to workflows
Capture
Provenance
Workflow System
Provenance Log
© K. Belhajjame 14
What can I do with
collected provenance?
December 9th, 2022
Here, I am arguing for (and by the way coining a new
term), that is “Provenance with a purpose”
© K. Belhajjame 15
December 9th, 2022
Debugging Workflows
• Scenario
• The workflow developer defines breakpoints. A breakpoint is associated with a step (an activity or
subworkflow) in the workflow.
• During the execution of the workflow, the execution of the workflow paused before and after the
activities associated with breakpoints
• Requirements provenance-wise
• Recording and displaying to the workflow developer the data bound to the input and output of
the steps associated with the breakpoint.
• May involve recording the state of objects that are outside the scope of the inputs and outputs of
the activity that is subject too breakpointing, e.g., a file or a database that is upated by the
activity in question
• One can imagine a situation, where the developer alter the input data of a given step that is
associated with a breakpoint
• Input provided by the workflow developer
• Breakpoints
• Optionally, s/he can provide values to use with given activity input values
© K. Belhajjame 16
Relevance
Completness
December 9th, 2022
Experiment Reporting
• Senario
• Summarization:
• Identify the subset of the wokflow (activities that are of interest)
• Retains the information relative only to a subset of the input of the workflow and/or its
output
• Abstraction: specify domain annotations to use
• Inputs provided by the user
• Template for reporting.
• For example, sections that needs to be filled, and the corresponding steps (or
subworkflows) in the overall workflow
• Source of annotations, it can be external resources, e.g., Bio.Tools, but it can be extracted in
certain cases from the data values itself
• Requirements provenance-wise
• Recording only the execution information that is necessary to feed the report
© K. Belhajjame 17
Relevance
Completness
Iterpretability
December 9th, 2022
Policy Verification
• Senario
• A number of policies on the data
• For example, before feeding sensitive data values to a remote analysis, they should be
anonymized or stripped of identifiers
• The way the data is used need to comply with the rights of the owners or the policies
defined on the data
• Provenance wise
• Some policies can be verified by directly analyzing the prospective provenance (workflow
specifications)
• Others can only be checked during the execution of the workflow through analysis of the
retrospective provenance of the workflow
• Not that in this case, the execution of a workflow can be halted if it is found to breach a policy
• Input provided by the user of the workflow
• Policies associated with the datasets that are fed to the workflow, as well as those associated
with the datasets underlying the execution of the activities of the workflow
© K. Belhajjame 18
Relevance
Completness
Iterpretability
December 9th, 2022
© K. Belhajjame 19
Workflow
Engine
Workflow
Exec Traces
Operating
System
Data
management
system
The Web
Information sources
Provenance Augmentation
Abstraction/Annotation
Provenance Layer
Wf
Debugger
Exp
Reporting
Policy
Checker/Enforcer
Applications Layer
Architecture Wf Designer Wf user
Reproducibility
checker
Users
December 9th, 2022
How Does it work ?
© K. Belhajjame 20
Choose your task
Provide necessary
inputs if any
Capture (only the)
necessary provenance
Assist the user in the
task at hand
User
System
System
User
December 9th, 2022
Of course this is far from being perfect…
© K. Belhajjame 21
December 9th, 2022
This is not entierly new
• Alban Gaignard, Hala Skaf-Molli, Khalid Belhajjame: Findable and reusable workflow data products: A genomic
workflow case study. Semantic Web 11(5): 751-763 (2020)
• Renan Souza, Marta Mattoso:Provenance of Dynamic Adaptations in User-Steered Dataflows. IPAW 2018: 16-29
• Timothy M. McPhillips et al. YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering
Workflow Information from Scripts. CoRR abs/1502.02403 (2015)
• Pinar Alper, Khalid Belhajjame, Carole A. Goble: Static analysis of Taverna workflows to predict provenance
patterns. Future Gener. Comput. Syst. 75: 310-329 (2017)
• Daniel Deutch, Amir Gilad, Yuval Moskovitch: Efficient provenance tracking for datalog using top-k queries.
VLDB J. 27(2): 245-269 (2018)
© K. Belhajjame 22
What new then?
A single framwork that caters and can be adaptaed for different
provenance usage scenarios
December 9th, 2022
Provenance with a Purpose
Khalid Belhajjame
PSL, Université Paris-Dauphine, LAMSADE
kbelhajj@gmail.com
© K. Belhajjame 23
December 9th, 2022

More Related Content

Similar to Provenance witha purpose

BDD Scenarios in a Testing & Traceability Strategy (Webinar 19/02/2021)
BDD Scenarios in a Testing & Traceability Strategy (Webinar 19/02/2021)BDD Scenarios in a Testing & Traceability Strategy (Webinar 19/02/2021)
BDD Scenarios in a Testing & Traceability Strategy (Webinar 19/02/2021)
Gáspár Nagy
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf
Ayele40
 

Similar to Provenance witha purpose (20)

Loras College 2014 Business Analytics Symposium | Aaron Lanzen: Creating Busi...
Loras College 2014 Business Analytics Symposium | Aaron Lanzen: Creating Busi...Loras College 2014 Business Analytics Symposium | Aaron Lanzen: Creating Busi...
Loras College 2014 Business Analytics Symposium | Aaron Lanzen: Creating Busi...
 
Best Practices in Moving Hyperion Planning to the Cloud
Best Practices in Moving Hyperion Planning to the CloudBest Practices in Moving Hyperion Planning to the Cloud
Best Practices in Moving Hyperion Planning to the Cloud
 
DOAG Oracle Unified Audit in Multitenant Environments
DOAG Oracle Unified Audit in Multitenant EnvironmentsDOAG Oracle Unified Audit in Multitenant Environments
DOAG Oracle Unified Audit in Multitenant Environments
 
SPM 3.pdf
SPM 3.pdfSPM 3.pdf
SPM 3.pdf
 
Introduction to the web engineering Process.pdf
Introduction to the web engineering Process.pdfIntroduction to the web engineering Process.pdf
Introduction to the web engineering Process.pdf
 
Scope management
Scope managementScope management
Scope management
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
 
Accelerate your SAP BusinessObjects to the Cloud
Accelerate your SAP BusinessObjects to the CloudAccelerate your SAP BusinessObjects to the Cloud
Accelerate your SAP BusinessObjects to the Cloud
 
BDD Scenarios in a Testing & Traceability Strategy (Webinar 19/02/2021)
BDD Scenarios in a Testing & Traceability Strategy (Webinar 19/02/2021)BDD Scenarios in a Testing & Traceability Strategy (Webinar 19/02/2021)
BDD Scenarios in a Testing & Traceability Strategy (Webinar 19/02/2021)
 
Three signs your architecture is too small for big data. Camp IT December 2014
Three signs your architecture is too small for big data.  Camp IT December 2014Three signs your architecture is too small for big data.  Camp IT December 2014
Three signs your architecture is too small for big data. Camp IT December 2014
 
Final Presentation FYP 1
Final Presentation FYP 1Final Presentation FYP 1
Final Presentation FYP 1
 
Sharing Blockchain Performance Knowledge for Edge Service Development
Sharing Blockchain Performance Knowledge for Edge Service DevelopmentSharing Blockchain Performance Knowledge for Edge Service Development
Sharing Blockchain Performance Knowledge for Edge Service Development
 
Software development planning and essentials
Software development planning and essentialsSoftware development planning and essentials
Software development planning and essentials
 
Software development planning and essentials
Software development planning and essentialsSoftware development planning and essentials
Software development planning and essentials
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf
 
2022 Blackbaud Technology Conference Aqueduct.pdf
2022 Blackbaud Technology Conference Aqueduct.pdf2022 Blackbaud Technology Conference Aqueduct.pdf
2022 Blackbaud Technology Conference Aqueduct.pdf
 
vodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testingvodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testing
 
SOUG Oracle Unified Audit for Multitenant Databases
SOUG Oracle Unified Audit for Multitenant DatabasesSOUG Oracle Unified Audit for Multitenant Databases
SOUG Oracle Unified Audit for Multitenant Databases
 
project planning components.pdf
project planning components.pdfproject planning components.pdf
project planning components.pdf
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 

More from Khalid Belhajjame

Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
Khalid Belhajjame
 

More from Khalid Belhajjame (20)

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
 
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 

Recently uploaded

SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
中 央社
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 

Recently uploaded (20)

PSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxPSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptx
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptx
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
 
Major project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesMajor project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategies
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge App
 
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMDEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
The Liver & Gallbladder (Anatomy & Physiology).pptx
The Liver &  Gallbladder (Anatomy & Physiology).pptxThe Liver &  Gallbladder (Anatomy & Physiology).pptx
The Liver & Gallbladder (Anatomy & Physiology).pptx
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
How to Manage Closest Location in Odoo 17 Inventory
How to Manage Closest Location in Odoo 17 InventoryHow to Manage Closest Location in Odoo 17 Inventory
How to Manage Closest Location in Odoo 17 Inventory
 

Provenance witha purpose

  • 1. Provenance with a Purpose Khalid Belhajjame PSL, Université Paris-Dauphine, LAMSADE kbelhajj@gmail.com © K. Belhajjame 1 December 9th, 2022
  • 2. We start with a short tale ... about provenance Characters: • Alice, a scientists who utilize workflows for their computational experiment and analyses • Bob, a believer in the greatness of provenance, who wants to spread the word © K. Belhajjame 2 December 9th, 2022
  • 3. Workflowsaregreat,buttheycanbedifficultto makework,andeven when theydid, ittakesme a long timetomake senseofthe results © K. Belhajjame 3 December 9th, 2022
  • 4. Youshoulduseprovenance. It will helpyouwitha lotofstuff. © K. Belhajjame 4 December 9th, 2022
  • 5. Really!like what? © K. Belhajjame 5 December 9th, 2022
  • 6. Plentyofthings. Debugging yourworkflows,understandingtheresults,experimentreporting,analysing andoptimizingtheworkflow,verifying the results/findingsofothers,reusing the (intermediate) results…younameit © K. Belhajjame 6 December 9th, 2022
  • 7. Soundslike I have foundmy hapiness,I will definitlytryit © K. Belhajjame 7 December 9th, 2022
  • 8. Few months later © K. Belhajjame 8 December 9th, 2022
  • 9. Hello Alice, howdid it go? © K. Belhajjame 9 December 9th, 2022
  • 10. Hi Bob,tobehonnest,notgreat © K. Belhajjame 10 December 9th, 2022
  • 11. The provenancerecordedis toofinegrained, it takesmeagestoget myheadaroundit, andeven when I doit doesnothavealwayswhatI really need © K. Belhajjame 11 December 9th, 2022
  • 12. Needless tosaythatI have even moretrouble makingsense ofthe provenanceoftheexecutions ofthe workflowsof mycolleagues © K. Belhajjame 12 December 9th, 2022
  • 13. Oh, andthe executionofmy workflowsis getting slower, andI cannotaffordtostoreall collected provenance… I justremove it afterfew workflow executions © K. Belhajjame 13 December 9th, 2022
  • 14. Moral of the story … • By and large, provenance in current systems is collected without really considering the requirements of the applications that will be using it • As a result, we end up collecting all sorts of things just to find later that: • Interpretability. Collected provenance is difficult to understand • Relevance. Most of collected provenance is not relevant for the task at hand, • Completeness. It does not contains all the information needed for the task at hand. • This conclusions are not limited to workflows Capture Provenance Workflow System Provenance Log © K. Belhajjame 14 What can I do with collected provenance? December 9th, 2022
  • 15. Here, I am arguing for (and by the way coining a new term), that is “Provenance with a purpose” © K. Belhajjame 15 December 9th, 2022
  • 16. Debugging Workflows • Scenario • The workflow developer defines breakpoints. A breakpoint is associated with a step (an activity or subworkflow) in the workflow. • During the execution of the workflow, the execution of the workflow paused before and after the activities associated with breakpoints • Requirements provenance-wise • Recording and displaying to the workflow developer the data bound to the input and output of the steps associated with the breakpoint. • May involve recording the state of objects that are outside the scope of the inputs and outputs of the activity that is subject too breakpointing, e.g., a file or a database that is upated by the activity in question • One can imagine a situation, where the developer alter the input data of a given step that is associated with a breakpoint • Input provided by the workflow developer • Breakpoints • Optionally, s/he can provide values to use with given activity input values © K. Belhajjame 16 Relevance Completness December 9th, 2022
  • 17. Experiment Reporting • Senario • Summarization: • Identify the subset of the wokflow (activities that are of interest) • Retains the information relative only to a subset of the input of the workflow and/or its output • Abstraction: specify domain annotations to use • Inputs provided by the user • Template for reporting. • For example, sections that needs to be filled, and the corresponding steps (or subworkflows) in the overall workflow • Source of annotations, it can be external resources, e.g., Bio.Tools, but it can be extracted in certain cases from the data values itself • Requirements provenance-wise • Recording only the execution information that is necessary to feed the report © K. Belhajjame 17 Relevance Completness Iterpretability December 9th, 2022
  • 18. Policy Verification • Senario • A number of policies on the data • For example, before feeding sensitive data values to a remote analysis, they should be anonymized or stripped of identifiers • The way the data is used need to comply with the rights of the owners or the policies defined on the data • Provenance wise • Some policies can be verified by directly analyzing the prospective provenance (workflow specifications) • Others can only be checked during the execution of the workflow through analysis of the retrospective provenance of the workflow • Not that in this case, the execution of a workflow can be halted if it is found to breach a policy • Input provided by the user of the workflow • Policies associated with the datasets that are fed to the workflow, as well as those associated with the datasets underlying the execution of the activities of the workflow © K. Belhajjame 18 Relevance Completness Iterpretability December 9th, 2022
  • 19. © K. Belhajjame 19 Workflow Engine Workflow Exec Traces Operating System Data management system The Web Information sources Provenance Augmentation Abstraction/Annotation Provenance Layer Wf Debugger Exp Reporting Policy Checker/Enforcer Applications Layer Architecture Wf Designer Wf user Reproducibility checker Users December 9th, 2022
  • 20. How Does it work ? © K. Belhajjame 20 Choose your task Provide necessary inputs if any Capture (only the) necessary provenance Assist the user in the task at hand User System System User December 9th, 2022
  • 21. Of course this is far from being perfect… © K. Belhajjame 21 December 9th, 2022
  • 22. This is not entierly new • Alban Gaignard, Hala Skaf-Molli, Khalid Belhajjame: Findable and reusable workflow data products: A genomic workflow case study. Semantic Web 11(5): 751-763 (2020) • Renan Souza, Marta Mattoso:Provenance of Dynamic Adaptations in User-Steered Dataflows. IPAW 2018: 16-29 • Timothy M. McPhillips et al. YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR abs/1502.02403 (2015) • Pinar Alper, Khalid Belhajjame, Carole A. Goble: Static analysis of Taverna workflows to predict provenance patterns. Future Gener. Comput. Syst. 75: 310-329 (2017) • Daniel Deutch, Amir Gilad, Yuval Moskovitch: Efficient provenance tracking for datalog using top-k queries. VLDB J. 27(2): 245-269 (2018) © K. Belhajjame 22 What new then? A single framwork that caters and can be adaptaed for different provenance usage scenarios December 9th, 2022
  • 23. Provenance with a Purpose Khalid Belhajjame PSL, Université Paris-Dauphine, LAMSADE kbelhajj@gmail.com © K. Belhajjame 23 December 9th, 2022