SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Provenance Central:
More Mileage from Provenance Metadata
Bertram Ludäscher
UC Davis, USA
ludaesch@ucdavis.edu
Paolo Missier
Newcastle University, UK
paolo.missier@ncl.ac.uk
Members of the DataONE Provenance Working Group
CAMP-4-DATA workshop @IPres 2013
Sept, 6, 2013
Lisbon, Portugal
Friday, 6 September 13
Outline
• A foundation for Provenance management: the PROV data model
– From the W3C. Recommendation as of Spring, 2013
– generic, extensible model
• The role of provenance in the DataONE project
– Provenance enables search and discovery, reuse, reproducibility
– PBase: Provenance warehousing
– Integration with the DataONE architecture
– Provenance mining: the social life of research data
2
Friday, 6 September 13
PROV: scope and structure
3 source: http://www.w3.org/TR/prov-overview/
Recommendation
track
Prov-dictionaryplus:
Friday, 6 September 13
PROV: scope and structure
3 source: http://www.w3.org/TR/prov-overview/
Recommendation
track
Prov-dictionaryplus:
Friday, 6 September 13
PROV Core Elements (graph depiction)
4
An entity is a physical, digital, conceptual, or other kind of thing with some fixed
aspects; entities may be real or imaginary.
Entity
Activity
Agent
An activity is something that occurs over a period of time and acts upon or with entities; it
may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place,
for the existence of an entity, or for another agent's activity.
drafting commenting
paper1
paper2
used
draft
v1
wasGeneratedBy used draft
comments
wasGeneratedBy
Alice
Bob
wasAssociatedWith
actedOnBehalfOf
Remote past Recent past
distribution=internal
status=draft
version=0.1
ex:role=main_editor
type=person
ex:role=sr_editor
prov:role=editor
time=...
time=...
Friday, 6 September 13
Summary of the PROV Core model
5
– PROV-DC mapping available
– Recent Tutorial @EDBT’13 (June, 2013) [1]
• Model, Constraints, Applications
[1] Missier, Paolo, Khalid Belhajjame, and James Cheney. “The W3C PROV Family of Specifications for
Modelling Provenance Metadata.” In Procs. EDBT’13 (Tutorial). Genova, Italy: ACM, 2013.
Friday, 6 September 13
PROV-DM relations at a glance
6
Friday, 6 September 13
Context: ProvWG@DataONE
• DataONE: Data Observation Network for Earth
– 5yr NSF DataNet data preservation project (current phase)
– Provides a large scale, federated data infrastructure to the Earth Sciences
community
• Provenance Working Group
– Active until July, 2014 (current phase, looking at extending)
– One/two interns per year since 2010
– One dedicated researcher (postdoc) since 2012
– 12 core members, additional guest members on a rotation
• specific focus on the provenance of workflow-based e-science data
7
Friday, 6 September 13
DataONE collaboration scenario - 2012
8
Alice’s Workflow: generates benchmark climate data for model comparison
Input is retrieved from DataONE to generate an output file
Friday, 6 September 13
DataONE collaboration scenario - 2012
8
."."." ."."." ."."."
The workflow, provenance, and other metadata are uploaded to DataONE
A data package is created and indexed
Friday, 6 September 13
Searching
9
Bob: Search based on keywords in the metadata
➡ including provenance terms
Bob discovers Alice’s workflow. He may be able to execute it again
Friday, 6 September 13
PBase and DataONE
10
System
Metadata
Extract-Align-AugmentMetadata
ScienceData
Search
API
Science
Metadata
Provenance Curation
Index
Identifiers/ Text fields
Graph Structure
ProvExplorer
Internal
Metadata
Index
Repository
PBase /D-PROV
Querying
– Provenance traces in PBase linked to DataONE packages
– Provenance traces indexed for searching
Friday, 6 September 13
DataOne Provenance components I: D-PROV
11
D-PROV extends PROV - Connects trace metadata to workflow structure
Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV
Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013.
Friday, 6 September 13
DataOne Provenance components I: D-PROV
onOutPort
T1Inv
d
onInPort
T2Inv
wasAssociatedWith
T1
wasAssociatedWith
T2
op1
ip1
wf
isTaskOf
isTaskOf
hasInputPort
hasOutputPort
wfInv
wasAssociatedWith
wasStartedBy
wasStartedBy
dataLink
12
D-PROV extends PROV
Connects trace metadata to workflow structure
Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the
PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013.
Friday, 6 September 13
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
Neo4J&loader& Graph&
storage&
Query&layer&
indexing&
Analy8cal&services&
Friday, 6 September 13
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
In-house components
Neo4J&loader& Graph&
storage&
Query&layer&
indexing&
Analy8cal&services&
Neo4J graph DBMS
[AllegroGraph]
[Graph-*] Can we do better
than the built-in Neo
indexing?
Friday, 6 September 13
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
In-house components
Neo4J&loader& Graph&
storage&
Query&layer&
indexing&
Analy8cal&services&
Neo4J graph DBMS
[AllegroGraph]
[Graph-*]
Cypher (Neo, declarative)
[Gremlin (procedural)]
can we do better? scaling
graph queries
Can we do better
than the built-in Neo
indexing?
to be developed
Friday, 6 September 13
Baseline provenance queries in PBase
14
Ancestors and descendents (lineage): [2,3]
– Which datasets were involved in the production of data at node “e33”?
– Reachability: was task “e11_miny” involved in producing data at node “e38”?
Execution analysis: [3]
– Which tasks did not execute to completion for execution X of a given workflow?
– Find all inputs [outputs] of a given workflow across all its executions
– Given a data item, find all workflows / tasks that have used it as input
– Suppose we discover that service S has a bug, which data products were impacted by it?
– How many times was task T activated across a pool of workflow executions?
Provenance differencing: [4]
– Why do the results from two executions of the same workflow differ?
Attribution: [5]
– Who was responsible for this {data {usage, production}, service invocation}?
[2] Dey, Saumen, Víctor Cuevas-Vicenttín, Sven Köhler, Eric Gribkoff, Michael Wang, and Bertram Ludäscher. "On
implementing provenance-aware regular path queries with relational query engines." In Proceedings of the Joint
EDBT/ICDT 2013 Workshops, pp. 214-223. ACM, 2013.
[3] Dey, Saumen, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. "Datalog as a lingua franca for
provenance querying and reasoning." In Workshop on the theory and practice of provenance (TaPP). 2012.
[4] Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for
Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience, 2013
[5] Missier, Paolo, Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, Shawn Bowers, Michael Agun,
and Ilkay Altintas. "Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance
Repository." International Journal of Digital Curation 7, no. 1 (2012): 139-150.
Friday, 6 September 13
Application - The social life of research data
• We know all about searching in the publications space
– who else is working on problems similar to mine?
– which results are available?
• In the data and process space:
1.Search and discovery
• Who else has used the {datasets, services, workflows,...} I am using?
– how do others rate them?
• Who used my {datasets, services, workflows,...}? How were they used?
2.Reuse, reproduction, validation
• Can I reproduce these results?
– using the same exact method
– using a variation of the method
• How do I apply this method to my data?
• ...
15
Social provenance for community building
Friday, 6 September 13
From Pull (client queries) to Push (notifications)
• Uncovering latent connections amongst services / data / people:
– Ranking, clustering, association rules
– Requires new similarity metrics
• A recommender system for scientists
– Analytics layer activated when new traces are added
• Challenges:
– How large a corpus of provenance graphs is needed?
– How global should the community be?
• Little new to discover in a small community
– Requires graphs with rich attribution / association relations
16
Graph&
storage&
Query&layer&
indexing&
Analy5cal&services&
Friday, 6 September 13
Credits
17
Current members of the DataONE Provenance Working Group:
In the USA:
Bertram Ludaescher, UC Davis (co-lead)
Victor Cuevas Vicenttin, UC Davis (DataONE postdoc researcher)
Saumen Dey, UC Davis (researcher)
Parisa Kianmajd, UC Davis (intern)
Juliana Freire, NYU-Poly
David Koop, NYU-Poly
Fernando Chirigati, NYU-Poly
Shawn Bowers, Gonzaga University
Ilkay Altintas, SDSC/UCSD
Karthik Ram, UC Berkeley
Yolanda Gil,USC - ISI
Yaxing Wei, ORNL
Dave Vieglais, DataONE Technical Lead
In the UK:
Paolo Missier, Newcastle University
James Cheney, University of Edinburgh
Khalid Belhajjame, University of Manchester
Friday, 6 September 13

Contenu connexe

Tendances

Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesLearning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Stefan Dietze
 
Online Learning and Linked Data: An Introduction
Online Learning and Linked Data: An IntroductionOnline Learning and Linked Data: An Introduction
Online Learning and Linked Data: An Introduction
EUCLID project
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiative
Herbert Van de Sompel
 

Tendances (20)

Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesLearning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
 
Online Learning and Linked Data: An Introduction
Online Learning and Linked Data: An IntroductionOnline Learning and Linked Data: An Introduction
Online Learning and Linked Data: An Introduction
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiative
 
LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)
 
Linked data presentation for libraries (COMO)
Linked data presentation for libraries (COMO)Linked data presentation for libraries (COMO)
Linked data presentation for libraries (COMO)
 
Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...
 
Better Software, Better Research
Better Software, Better ResearchBetter Software, Better Research
Better Software, Better Research
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked Data
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Semantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsSemantic Publishing and Nanopublications
Semantic Publishing and Nanopublications
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
Benefits and practice of open science
Benefits and practice of open scienceBenefits and practice of open science
Benefits and practice of open science
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
Opportunistic Persistent Data Storage
Opportunistic Persistent Data StorageOpportunistic Persistent Data Storage
Opportunistic Persistent Data Storage
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 

En vedette

Nesc invited presentation: Semantic Provenance and Linked Open Data
Nesc invited presentation: Semantic Provenance and Linked Open DataNesc invited presentation: Semantic Provenance and Linked Open Data
Nesc invited presentation: Semantic Provenance and Linked Open Data
Paolo Missier
 

En vedette (19)

provenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialprovenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorial
 
Paper talk: Idcc 11
Paper talk: Idcc 11  Paper talk: Idcc 11
Paper talk: Idcc 11
 
Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)
 
Tapp11 presentation
Tapp11 presentationTapp11 presentation
Tapp11 presentation
 
Paper presentation @ CGW ‘06 workshop, 2006
Paper presentation @ CGW ‘06 workshop, 2006Paper presentation @ CGW ‘06 workshop, 2006
Paper presentation @ CGW ‘06 workshop, 2006
 
Paper presentation @ SEBD '09
Paper presentation @ SEBD '09Paper presentation @ SEBD '09
Paper presentation @ SEBD '09
 
Prezentacia website malka bezplatna biblioteka
Prezentacia website malka bezplatna bibliotekaPrezentacia website malka bezplatna biblioteka
Prezentacia website malka bezplatna biblioteka
 
Internal seminar @Newcastle University, Feb 2011
Internal seminar @Newcastle University, Feb 2011Internal seminar @Newcastle University, Feb 2011
Internal seminar @Newcastle University, Feb 2011
 
Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07
 
Paper presentation: Taverna, reloaded
Paper presentation: Taverna, reloadedPaper presentation: Taverna, reloaded
Paper presentation: Taverna, reloaded
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 
Sydney talk-6-2015
Sydney talk-6-2015Sydney talk-6-2015
Sydney talk-6-2015
 
Охота на Работу!EXCLUSIVE
Охота на Работу!EXCLUSIVEОхота на Работу!EXCLUSIVE
Охота на Работу!EXCLUSIVE
 
Invited talk @ ESIP summer meeting, 2009
Invited talk @ ESIP summer meeting, 2009Invited talk @ ESIP summer meeting, 2009
Invited talk @ ESIP summer meeting, 2009
 
Nesc invited presentation: Semantic Provenance and Linked Open Data
Nesc invited presentation: Semantic Provenance and Linked Open DataNesc invited presentation: Semantic Provenance and Linked Open Data
Nesc invited presentation: Semantic Provenance and Linked Open Data
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08
 
създаване на сайт малка безплатна библиотека
създаване на  сайт малка безплатна библиотекасъздаване на  сайт малка безплатна библиотека
създаване на сайт малка безплатна библиотека
 
Invited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopInvited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshop
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016
 

Similaire à Camp 4-data workshop presentation

The Story of the Semantic Grid
The Story of the Semantic GridThe Story of the Semantic Grid
The Story of the Semantic Grid
butest
 

Similaire à Camp 4-data workshop presentation (20)

Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
The Story of the Semantic Grid
The Story of the Semantic GridThe Story of the Semantic Grid
The Story of the Semantic Grid
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
 
Metadata for Research Objects
Metadata for Research ObjectsMetadata for Research Objects
Metadata for Research Objects
 
Virtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceVirtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open science
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Modeling Data Life Cycles with PROV
Modeling Data Life Cycles with PROVModeling Data Life Cycles with PROV
Modeling Data Life Cycles with PROV
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Dolap13 v9 7.docx
Dolap13 v9 7.docxDolap13 v9 7.docx
Dolap13 v9 7.docx
 
Linked Data Workshop Stanford University
Linked Data Workshop Stanford University Linked Data Workshop Stanford University
Linked Data Workshop Stanford University
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
Jisc Research Data Discovery Service Project
Jisc Research Data Discovery Service ProjectJisc Research Data Discovery Service Project
Jisc Research Data Discovery Service Project
 
Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...
 
Research data spring: streamlining deposit
Research data spring: streamlining depositResearch data spring: streamlining deposit
Research data spring: streamlining deposit
 
Global Research Data Initiatives
Global Research Data InitiativesGlobal Research Data Initiatives
Global Research Data Initiatives
 
ICWE 2013 - Slides From The Poster And Demo Session
ICWE 2013 - Slides From The Poster And Demo SessionICWE 2013 - Slides From The Poster And Demo Session
ICWE 2013 - Slides From The Poster And Demo Session
 

Plus de Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Plus de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Dernier

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Camp 4-data workshop presentation

  • 1. Provenance Central: More Mileage from Provenance Metadata Bertram Ludäscher UC Davis, USA ludaesch@ucdavis.edu Paolo Missier Newcastle University, UK paolo.missier@ncl.ac.uk Members of the DataONE Provenance Working Group CAMP-4-DATA workshop @IPres 2013 Sept, 6, 2013 Lisbon, Portugal Friday, 6 September 13
  • 2. Outline • A foundation for Provenance management: the PROV data model – From the W3C. Recommendation as of Spring, 2013 – generic, extensible model • The role of provenance in the DataONE project – Provenance enables search and discovery, reuse, reproducibility – PBase: Provenance warehousing – Integration with the DataONE architecture – Provenance mining: the social life of research data 2 Friday, 6 September 13
  • 3. PROV: scope and structure 3 source: http://www.w3.org/TR/prov-overview/ Recommendation track Prov-dictionaryplus: Friday, 6 September 13
  • 4. PROV: scope and structure 3 source: http://www.w3.org/TR/prov-overview/ Recommendation track Prov-dictionaryplus: Friday, 6 September 13
  • 5. PROV Core Elements (graph depiction) 4 An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary. Entity Activity Agent An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities. An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity. drafting commenting paper1 paper2 used draft v1 wasGeneratedBy used draft comments wasGeneratedBy Alice Bob wasAssociatedWith actedOnBehalfOf Remote past Recent past distribution=internal status=draft version=0.1 ex:role=main_editor type=person ex:role=sr_editor prov:role=editor time=... time=... Friday, 6 September 13
  • 6. Summary of the PROV Core model 5 – PROV-DC mapping available – Recent Tutorial @EDBT’13 (June, 2013) [1] • Model, Constraints, Applications [1] Missier, Paolo, Khalid Belhajjame, and James Cheney. “The W3C PROV Family of Specifications for Modelling Provenance Metadata.” In Procs. EDBT’13 (Tutorial). Genova, Italy: ACM, 2013. Friday, 6 September 13
  • 7. PROV-DM relations at a glance 6 Friday, 6 September 13
  • 8. Context: ProvWG@DataONE • DataONE: Data Observation Network for Earth – 5yr NSF DataNet data preservation project (current phase) – Provides a large scale, federated data infrastructure to the Earth Sciences community • Provenance Working Group – Active until July, 2014 (current phase, looking at extending) – One/two interns per year since 2010 – One dedicated researcher (postdoc) since 2012 – 12 core members, additional guest members on a rotation • specific focus on the provenance of workflow-based e-science data 7 Friday, 6 September 13
  • 9. DataONE collaboration scenario - 2012 8 Alice’s Workflow: generates benchmark climate data for model comparison Input is retrieved from DataONE to generate an output file Friday, 6 September 13
  • 10. DataONE collaboration scenario - 2012 8 ."."." ."."." ."."." The workflow, provenance, and other metadata are uploaded to DataONE A data package is created and indexed Friday, 6 September 13
  • 11. Searching 9 Bob: Search based on keywords in the metadata ➡ including provenance terms Bob discovers Alice’s workflow. He may be able to execute it again Friday, 6 September 13
  • 12. PBase and DataONE 10 System Metadata Extract-Align-AugmentMetadata ScienceData Search API Science Metadata Provenance Curation Index Identifiers/ Text fields Graph Structure ProvExplorer Internal Metadata Index Repository PBase /D-PROV Querying – Provenance traces in PBase linked to DataONE packages – Provenance traces indexed for searching Friday, 6 September 13
  • 13. DataOne Provenance components I: D-PROV 11 D-PROV extends PROV - Connects trace metadata to workflow structure Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013. Friday, 6 September 13
  • 14. DataOne Provenance components I: D-PROV onOutPort T1Inv d onInPort T2Inv wasAssociatedWith T1 wasAssociatedWith T2 op1 ip1 wf isTaskOf isTaskOf hasInputPort hasOutputPort wfInv wasAssociatedWith wasStartedBy wasStartedBy dataLink 12 D-PROV extends PROV Connects trace metadata to workflow structure Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013. Friday, 6 September 13
  • 15. DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Friday, 6 September 13
  • 16. DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv In-house components Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Neo4J graph DBMS [AllegroGraph] [Graph-*] Can we do better than the built-in Neo indexing? Friday, 6 September 13
  • 17. DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv In-house components Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Neo4J graph DBMS [AllegroGraph] [Graph-*] Cypher (Neo, declarative) [Gremlin (procedural)] can we do better? scaling graph queries Can we do better than the built-in Neo indexing? to be developed Friday, 6 September 13
  • 18. Baseline provenance queries in PBase 14 Ancestors and descendents (lineage): [2,3] – Which datasets were involved in the production of data at node “e33”? – Reachability: was task “e11_miny” involved in producing data at node “e38”? Execution analysis: [3] – Which tasks did not execute to completion for execution X of a given workflow? – Find all inputs [outputs] of a given workflow across all its executions – Given a data item, find all workflows / tasks that have used it as input – Suppose we discover that service S has a bug, which data products were impacted by it? – How many times was task T activated across a pool of workflow executions? Provenance differencing: [4] – Why do the results from two executions of the same workflow differ? Attribution: [5] – Who was responsible for this {data {usage, production}, service invocation}? [2] Dey, Saumen, Víctor Cuevas-Vicenttín, Sven Köhler, Eric Gribkoff, Michael Wang, and Bertram Ludäscher. "On implementing provenance-aware regular path queries with relational query engines." In Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 214-223. ACM, 2013. [3] Dey, Saumen, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. "Datalog as a lingua franca for provenance querying and reasoning." In Workshop on the theory and practice of provenance (TaPP). 2012. [4] Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience, 2013 [5] Missier, Paolo, Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, Shawn Bowers, Michael Agun, and Ilkay Altintas. "Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository." International Journal of Digital Curation 7, no. 1 (2012): 139-150. Friday, 6 September 13
  • 19. Application - The social life of research data • We know all about searching in the publications space – who else is working on problems similar to mine? – which results are available? • In the data and process space: 1.Search and discovery • Who else has used the {datasets, services, workflows,...} I am using? – how do others rate them? • Who used my {datasets, services, workflows,...}? How were they used? 2.Reuse, reproduction, validation • Can I reproduce these results? – using the same exact method – using a variation of the method • How do I apply this method to my data? • ... 15 Social provenance for community building Friday, 6 September 13
  • 20. From Pull (client queries) to Push (notifications) • Uncovering latent connections amongst services / data / people: – Ranking, clustering, association rules – Requires new similarity metrics • A recommender system for scientists – Analytics layer activated when new traces are added • Challenges: – How large a corpus of provenance graphs is needed? – How global should the community be? • Little new to discover in a small community – Requires graphs with rich attribution / association relations 16 Graph& storage& Query&layer& indexing& Analy5cal&services& Friday, 6 September 13
  • 21. Credits 17 Current members of the DataONE Provenance Working Group: In the USA: Bertram Ludaescher, UC Davis (co-lead) Victor Cuevas Vicenttin, UC Davis (DataONE postdoc researcher) Saumen Dey, UC Davis (researcher) Parisa Kianmajd, UC Davis (intern) Juliana Freire, NYU-Poly David Koop, NYU-Poly Fernando Chirigati, NYU-Poly Shawn Bowers, Gonzaga University Ilkay Altintas, SDSC/UCSD Karthik Ram, UC Berkeley Yolanda Gil,USC - ISI Yaxing Wei, ORNL Dave Vieglais, DataONE Technical Lead In the UK: Paolo Missier, Newcastle University James Cheney, University of Edinburgh Khalid Belhajjame, University of Manchester Friday, 6 September 13