SlideShare une entreprise Scribd logo
1  sur  73
Paul Groth
Elsevier Labs
@pgroth | pgroth.com
Provenance for
Data Munging Environments
Information Sciences Institute – August 13, 2015
Outline
• What’s data munging and why it’s
important?
• The role of provenance
• The reality….
• Desktop data munging & provenance
• Database data munging & provenance
• Declarative data munging (?)
60 % of time is spent on data
preparation
What to do?
Data Sources
Compound
Disease
PathwayTarget ✔
✔
✔
✔Tissue ✔
7
Open PHACTS Explorer
8
Open PHACTS Explorer
?
Tension:
Integrated &
Summarized
Data
Transparenc
y& Trust
Solution:
Tracking and exposing
provenance*
* a record that describes the people, institutions,
entities, and activities involved in producing,
influencing, or delivering a piece of data”
The PROV Data Model
(W3C Recommendation)
explorer.openphacts.org
What if you’re not a large
organization?
karma.isi.edu
wings-workflows.org
The reality…
Adventures in word2vec (1)
Adventures in word2vec (2)
The model:
Adventures in word2vec (3)
The model:
Adventures in word2vec (3)
Look provenance informatio
http://ivory.idyll.org/blog/replication-i.html
DESKTOP DATA MUNGING &
PROVENANCE
References
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.
Looking Inside the Black-Box: Capturing Data
Provenance Using Dynamic Instrumentation.
5th International Provenance and Annotation Workshop
(IPAW'14)
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.
Decoupling Provenance Capture and Analysis from
Execution.
7th USENIX Workshop on the Theory and Practice of
Provenance (TaPP'15)
23
Capturing Provenance
Disclosed Provenance
+ Accuracy
+ High-level semantics
– Intrusive
– Manual Effort
Observed Provenance
– False positives
– Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)
Trio (Widom ‘09)
Wings (Gil ‘11)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
24
Challenge
• Can we capture provenance
– with low false positive ratio?
– without manual/obtrusive integration effort?
• We have to rely on observed provenance.
25
State of the art
Application
• Observed provenance systems treat programs as black-
boxes.
• Can’t tell if an input file was actually used.
• Can’t quantify the influence of input to output.
26
Taint Tracking
Geology Computer Science
27
Our solution: DataTracker
• Captures high-fidelity provenance using Taint
Tracking.
• Key building blocks:
– libdft (Kemerlis ‘12) ➞ Reusable taint-tracking
framework.
– Intel Pin (Luk ‘05) ➞ Dynamic instrumentation
framework.
• Does not require modification of applications.
• Does not require knowledge of application
semantics.
28
DataTracker Architecture
29
Evaluation: tackling the n×m
problem
30
• DataTracker is able
to track the actual
use of the input data.
• Read data ≠ Use
data.
• Eliminates false
positives (---->)
present in other
observed
provenance capture
methods.
Evaluation: vim
31
DataTracker attributes individual bytes of the output to the input.
Demo video: http://bit.ly/dtracker-demo
Can we do good enough?
• Can taint tracking
a. become an “always-on” feature?
b. be turned on for all running processes?
• What if we want to also run other analysis
code?
• Can we pre-determine the right analysis
code?
32
Re-execution
Common tactic in provenance:
• DB: Reenactment queries (Glavic ‘14)
• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13),
DistTape (Zhao ‘12)
• Workflows: Pegasus (Groth ‘09)
• PL: Slicing (Perera ‘12)
• OS: pTrace (Guo ‘11)
• Desktop: Excel (Asuncion ‘11)
33
Record and Replay
34
Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
35
Prototype Implementation
• PANDA: an open-
source Platform for
Architecture-Neutral
Dynamic Analysis.
(Dolan-Gavitt ‘14)
• Based on the QEMU
virtualization platform.
36
• PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged  can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA Initial RAM
Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
37
Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch.
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with
SPARQL.
PANDA
Executio
n Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
38
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime
OS Introspection
• What processes are currently executing?
• Which libraries are used?
• What files are used?
• Possible approaches:
– Execute code inside the guest-OS.
– Reproduce guest-OS semantics purely from
the hardware state (RAM/registers).
39
The PROV-Tracer Plugin
• Registers for process creation/destruction
events.
• Decodes executed system calls.
• Keeps track of what files are used as
input/output by each process.
• Emits provenance in an intermediate
format when a process terminates.
40
More Analysis Plugins
• ProcStrMatch plugin.
– Which processes contained string S in their
memory?
• Other possible types of analysis:
– Taint tracking
– Dynamic slicing
41
Overhead (again) (1/2)
• QEMU incurs a 5x slowdown.
• PANDA recording imposes an additional
1.1x – 1.2x slowdown.
Virtualization is the dominant overhead
factor.
42
Overhead (again) (2/2)
• QEMU is a suboptimal virtualization
option.
• ReVirt – User Mode Linux (Dunlap ‘02)
– Slowdown: 1.08x rec. + 1.58x virt.
• ReTrace – VMWare (Xu ‘07)
– Slowdown: 1.05x-2.6x rec. + ??? virt.
Virtualization slowdown is considered
acceptable.
Recording overhead is fairly low. 43
Storage Requirements
• Storage requirements vary with the
workload.
• For PANDA (Dolan-Gavitt ‘14):
– 17-915 instructions per byte.
• In practice: O(10MB/min) uncompressed.
• Different approaches to reduce/manage
storage requirements.
– Compression, HD rotation, VM snapshots.
• 24/7 recording seems within limits of
todays’ technology. 44
Highlights
• Taint tracking analysis is a powerful method
for capturing provenance.
– Eliminates many false positives.
– Tackles the “n×m problem”.
• Decoupling provenance analysis from
execution is possible by the use of VM record
& replay.
• Execution traces can be used for post-hoc
provenance analysis.
45
DB DATA MUNGING
References
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth
TripleProv: Efficient Processing of Lineage Queries
over a Native RDF Store
World Wide Web Conference 2014
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth
Executing Provenance-Enabled Queries over Web
Data
World Wide Web Conference 2015
47
RDF is great for munging data
➢ Ability to arbitrarily add new
information (schemaless)
➢ Syntaxes are easy to concatenate
new data
➢ Information has a well defined
structure
➢ Identifiers are distributed but
controlled
48
What’s the provenance of my query
result?
Qr
Graph-based Query
select ?lat ?long ?g1 ?g2 ?g3 ?g4
where {
graph ?g1 {?a [] "Eiffel Tower" . }
graph ?g2 {?a inCountry FR . }
graph ?g3 {?a lat ?lat . }
graph ?g4 {?a long ?long . }
}
lat long l1 l2 l4 l4,
lat long l1 l2 l4 l5,
lat long l1 l2 l5 l4,
lat long l1 l2 l5 l5,
lat long l1 l3 l4 l4,
lat long l1 l3 l4 l5,
lat long l1 l3 l5 l4,
lat long l1 l3 l5 l5,
lat long l2 l2 l4 l4,
lat long l2 l2 l4 l5,
lat long l2 l2 l5 l4,
lat long l2 l2 l5 l5,
lat long l2 l3 l4 l4,
lat long l2 l3 l4 l5,
lat long l2 l3 l5 l4,
lat long l2 l3 l5 l5,
lat long l3 l2 l4 l4,
lat long l3 l2 l4 l5,
lat long l3 l2 l5 l4,
lat long l3 l2 l5 l5,
lat long l3 l3 l4 l4,
lat long l3 l3 l4 l5,
lat long l3 l3 l5 l4,
lat long l3 l3 l5 l5,
Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result
Polynomials Operators
➢ Union (⊕)
○ constraint or projection satisfied with multiple sources
l1 ⊕ l2 ⊕ l3
○ multiple entities satisfy a set of constraints or
projections
➢ Join (⊗)
○ sources joined to handle a constraint or a projection
○ OS and OO joins between few sets of constraints
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)
Example Polynomial
select ?lat ?long where {
?a [] ``Eiffel Tower''.
?a inCountry FR .
?a lat ?lat .
?a long ?long .
}
(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕
System Architecture
Experiments
How expensive it is to trace
provenance?
What is the overhead on query
execution time?
Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked
open data cloud
○ Web Data Commons (WDC): RDFa, Microdata
extracted from common crawl
➢ Typical collections gathered from multiple sources
➢ sampled subsets of ~110 million triples each; ~25GB each
Workloads
➢ 8 Queries defined for BTC
○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/tripleprov
Results
Overhead of tracking provenance compared to
vanilla version of the system for BTC dataset
source-level co-
located
source-level
annotated
triple-level co-
located
triple-level
annotated
TripleProv: Query Execution
Pipeline
input: provenance-enable query
➢ execute the provenance query
➢ optionally pre-materializing or co-locating data
➢ optionally rewrite the workload queries
➢ execute the workload queries
output: the workload query results, restricted to those which were derived
from data specified by the provenance query 59
Experiments
What is the most efficient query
execution strategy for provenance-
enabled queries?
60
Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked open data
cloud
○ Web Data Commons (WDC): RDFa, Microdata extracted from
common crawl
➢ Typical collections gathered from multiple sources
➢ Sampled subsets of ~40 million triples each; ~10GB each
➢ Added provenance specific triples (184 for WDC and 360 for BTC); that
the provenance queries do not modify the result sets of the workload
queries
61
Results for BTC
➢ Full Materialization: 44x faster
than the vanilla version of the
system
➢ Partial Materialization: 35x faster
➢ Pre-Filtering: 23x faster
➢ Adaptive Partial Materialization
executes a provenance query and
materialize data 475 times faster
than Full Materialization
➢ Query Rewriting and Post-
Filtering strategies perform
significantly slower
62
Data Analysis
➢ How many context values refer
to how many triples? How
selective it is?
➢ 6'819'826 unique context values
in the BTC dataset.
➢ The majority of the context
values are highly selective.
63
➢ average selectivity
○ 5.8 triples per context value
○ 2.3 molecules per context value
DECLARATIVE DATA
MUNGING (?)
64
References
Sara Magliacane, Philip Stutz, Paul Groth, Abraham
Bernstein
foxPSL: A Fast, Optimized and eXtended PSL
implementation
International Journal of Approximate Reasoning (2015)
65
Why logic?
- Concise & natural way to represent relations
- Declarative representation:
- Can reuse, extend, combine rules
- Experts can write rules
- First order logic:
- Can exploit symmetries to avoid duplicated
computation (e.g. lifted inference)
Let the reasoner munge the
data.
See Sebastien Riedel’s etc. work towards
pushing more NLP problems in to the
reasoner.
http://cl.naist.jp/~kevinduh/z/acltutorialslide
s/matrix_acl2015tutorial.pdf
Statistical Relational Learning
● Several flavors:
o Markov Logic Networks,
o Bayesian Logic Programs
o Probabilistic Soft Logic (PSL) [Broecheler, Getoor,
UAI 2010]
● PSL was successfully applied:
o Entity resolution, Link prediction
o Ontology alignment, Knowledge graph
identification
o Computer vision, trust propagation, …
Probabilistic Soft Logic (PSL)
● Probabilistic logic with soft truth values ∈ [0,1]
friends(anna, bob)= 0.8
votes(anna, demo) = 0.99
● Weighted rules:
[weight = 0.7] friends(A,B) && votes(A,P) => votes(B,P)
● Inference as constrained convex minimization:
votes(bob, demo) = 0.8
FoxPSL: Fast Optimized eXtended PSL
classes ∃partially
grounded rules
optimizations
DSL:
FoxPSL
lang
Experiments: comparison with ACO
SLURM cluster: 4 nodes, each with 2x10 cores and 128GB RAM
ACO = implementation of consensus optimization on
GraphLab used for grounded PSL
Conclusions
• Data munging is a central task
• Provenance is a requirement
• Now:
• Provenance by stealth (ack Carole Goble)
• Separate provenance analysis from
instrumentation.
• Future:
• The computer should do the work
Future Research
• Explore optimizations of taint tracking for
capturing provenance.
• Provenance analysis of real-world traces
(e.g. from rrshare.org).
• Tracking provenance across environments
• Traces/logs as central provenance
primitive
• Declarative data munging
73

Contenu connexe

Tendances

Hierarchical Temporal Memory for Real-time Anomaly Detection
Hierarchical Temporal Memory for Real-time Anomaly DetectionHierarchical Temporal Memory for Real-time Anomaly Detection
Hierarchical Temporal Memory for Real-time Anomaly DetectionIhor Bobak
 
Information and network security 11 cryptography and cryptanalysis
Information and network security 11 cryptography and cryptanalysisInformation and network security 11 cryptography and cryptanalysis
Information and network security 11 cryptography and cryptanalysisVaibhav Khanna
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
A novel approach to prevent cache based side-channel attack in the cloud (1)
A novel approach to prevent cache based side-channel attack in the cloud (1)A novel approach to prevent cache based side-channel attack in the cloud (1)
A novel approach to prevent cache based side-channel attack in the cloud (1)mrigakshi goel
 
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...Flink Forward
 

Tendances (11)

Hierarchical Temporal Memory for Real-time Anomaly Detection
Hierarchical Temporal Memory for Real-time Anomaly DetectionHierarchical Temporal Memory for Real-time Anomaly Detection
Hierarchical Temporal Memory for Real-time Anomaly Detection
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Deadlock
DeadlockDeadlock
Deadlock
 
OSCh8
OSCh8OSCh8
OSCh8
 
Information and network security 11 cryptography and cryptanalysis
Information and network security 11 cryptography and cryptanalysisInformation and network security 11 cryptography and cryptanalysis
Information and network security 11 cryptography and cryptanalysis
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
A novel approach to prevent cache based side-channel attack in the cloud (1)
A novel approach to prevent cache based side-channel attack in the cloud (1)A novel approach to prevent cache based side-channel attack in the cloud (1)
A novel approach to prevent cache based side-channel attack in the cloud (1)
 
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
 
OS_Ch8
OS_Ch8OS_Ch8
OS_Ch8
 
CNS_poster12
CNS_poster12CNS_poster12
CNS_poster12
 

En vedette

Data Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionData Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionPaul Groth
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply ChainPaul Groth
 
Altmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactAltmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactPaul Groth
 
"Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited "Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited Paul Groth
 
Telling your research story with (alt)metrics
Telling your research story with (alt)metricsTelling your research story with (alt)metrics
Telling your research story with (alt)metricsPaul Groth
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at ElsevierPaul Groth
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialPaul Groth
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at ElsevierPaul Groth
 
Open PHACTS API Walkthrough
Open PHACTS API WalkthroughOpen PHACTS API Walkthrough
Open PHACTS API WalkthroughPaul Groth
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centrejatin batra
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in sparkPeng Cheng
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaPaul Groth
 
Building Highly Flexible, High Performance Query Engines
Building Highly Flexible, High Performance Query EnginesBuilding Highly Flexible, High Performance Query Engines
Building Highly Flexible, High Performance Query EnginesMapR Technologies
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Faculdade max planck_- virtude moral e ética
Faculdade max planck_- virtude moral e éticaFaculdade max planck_- virtude moral e ética
Faculdade max planck_- virtude moral e éticaDenise Juanilla Camargo
 

En vedette (20)

Data Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionData Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tension
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
 
Altmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactAltmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impact
 
"Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited "Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited
 
Telling your research story with (alt)metrics
Telling your research story with (alt)metricsTelling your research story with (alt)metrics
Telling your research story with (alt)metrics
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at Elsevier
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchers
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at Elsevier
 
Open PHACTS API Walkthrough
Open PHACTS API WalkthroughOpen PHACTS API Walkthrough
Open PHACTS API Walkthrough
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centre
 
search engines
search enginessearch engines
search engines
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Building Highly Flexible, High Performance Query Engines
Building Highly Flexible, High Performance Query EnginesBuilding Highly Flexible, High Performance Query Engines
Building Highly Flexible, High Performance Query Engines
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Faculdade max planck_- virtude moral e ética
Faculdade max planck_- virtude moral e éticaFaculdade max planck_- virtude moral e ética
Faculdade max planck_- virtude moral e ética
 

Similaire à Provenance for Data Munging Environments

Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)packetloop
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTraceGraeme Jenkinson
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...TigerGraph
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
Developing applications with rules, workflow and event processing (it@cork 2010)
Developing applications with rules, workflow and event processing (it@cork 2010)Developing applications with rules, workflow and event processing (it@cork 2010)
Developing applications with rules, workflow and event processing (it@cork 2010)Geoffrey De Smet
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityRaffael Marty
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redactedRyan Breed
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 

Similaire à Provenance for Data Munging Environments (20)

Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)
 
Py tables
Py tablesPy tables
Py tables
 
PyTables
PyTablesPyTables
PyTables
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTrace
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Developing applications with rules, workflow and event processing (it@cork 2010)
Developing applications with rules, workflow and event processing (it@cork 2010)Developing applications with rules, workflow and event processing (it@cork 2010)
Developing applications with rules, workflow and event processing (it@cork 2010)
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redacted
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 
PyTables
PyTablesPyTables
PyTables
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Plus de Paul Groth

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of DataPaul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data ShowcasingPaul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 

Plus de Paul Groth (20)

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Provenance for Data Munging Environments

  • 1. Paul Groth Elsevier Labs @pgroth | pgroth.com Provenance for Data Munging Environments Information Sciences Institute – August 13, 2015
  • 2. Outline • What’s data munging and why it’s important? • The role of provenance • The reality…. • Desktop data munging & provenance • Database data munging & provenance • Declarative data munging (?)
  • 3.
  • 4. 60 % of time is spent on data preparation
  • 10. Solution: Tracking and exposing provenance* * a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data” The PROV Data Model (W3C Recommendation)
  • 12.
  • 13. What if you’re not a large organization?
  • 19. The model: Adventures in word2vec (3)
  • 20. The model: Adventures in word2vec (3) Look provenance informatio
  • 22. DESKTOP DATA MUNGING & PROVENANCE
  • 23. References Manolis Stamatogiannakis, Paul Groth, Herbert Bos. Looking Inside the Black-Box: Capturing Data Provenance Using Dynamic Instrumentation. 5th International Provenance and Annotation Workshop (IPAW'14) Manolis Stamatogiannakis, Paul Groth, Herbert Bos. Decoupling Provenance Capture and Analysis from Execution. 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15) 23
  • 24. Capturing Provenance Disclosed Provenance + Accuracy + High-level semantics – Intrusive – Manual Effort Observed Provenance – False positives – Semantic Gap + Non-intrusive + Minimal manual effort CPL (Macko ‘12) Trio (Widom ‘09) Wings (Gil ‘11) Taverna (Oinn ‘06) VisTrails (Fraire ‘06) ES3 (Frew ‘08) Trec (Vahdat ‘98) PASSv2 (Holland ‘08) DTrace Tool (Gessiou ‘12) 24
  • 25. Challenge • Can we capture provenance – with low false positive ratio? – without manual/obtrusive integration effort? • We have to rely on observed provenance. 25
  • 26. State of the art Application • Observed provenance systems treat programs as black- boxes. • Can’t tell if an input file was actually used. • Can’t quantify the influence of input to output. 26
  • 28. Our solution: DataTracker • Captures high-fidelity provenance using Taint Tracking. • Key building blocks: – libdft (Kemerlis ‘12) ➞ Reusable taint-tracking framework. – Intel Pin (Luk ‘05) ➞ Dynamic instrumentation framework. • Does not require modification of applications. • Does not require knowledge of application semantics. 28
  • 30. Evaluation: tackling the n×m problem 30 • DataTracker is able to track the actual use of the input data. • Read data ≠ Use data. • Eliminates false positives (---->) present in other observed provenance capture methods.
  • 31. Evaluation: vim 31 DataTracker attributes individual bytes of the output to the input. Demo video: http://bit.ly/dtracker-demo
  • 32. Can we do good enough? • Can taint tracking a. become an “always-on” feature? b. be turned on for all running processes? • What if we want to also run other analysis code? • Can we pre-determine the right analysis code? 32
  • 33. Re-execution Common tactic in provenance: • DB: Reenactment queries (Glavic ‘14) • DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) • Workflows: Pegasus (Groth ‘09) • PL: Slicing (Perera ‘12) • OS: pTrace (Guo ‘11) • Desktop: Excel (Asuncion ‘11) 33
  • 36. Prototype Implementation • PANDA: an open- source Platform for Architecture-Neutral Dynamic Analysis. (Dolan-Gavitt ‘14) • Based on the QEMU virtualization platform. 36
  • 37. • PANDA logs self-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. • Logging happens at virtual CPU I/O ports. – Virtual device state is not logged  can’t “go-live”. Prototype Implementation (2/3) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 37
  • 38. Prototype Implementation (3/3) • Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch. • Debian Linux guest. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Executio n Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 38 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime
  • 39. OS Introspection • What processes are currently executing? • Which libraries are used? • What files are used? • Possible approaches: – Execute code inside the guest-OS. – Reproduce guest-OS semantics purely from the hardware state (RAM/registers). 39
  • 40. The PROV-Tracer Plugin • Registers for process creation/destruction events. • Decodes executed system calls. • Keeps track of what files are used as input/output by each process. • Emits provenance in an intermediate format when a process terminates. 40
  • 41. More Analysis Plugins • ProcStrMatch plugin. – Which processes contained string S in their memory? • Other possible types of analysis: – Taint tracking – Dynamic slicing 41
  • 42. Overhead (again) (1/2) • QEMU incurs a 5x slowdown. • PANDA recording imposes an additional 1.1x – 1.2x slowdown. Virtualization is the dominant overhead factor. 42
  • 43. Overhead (again) (2/2) • QEMU is a suboptimal virtualization option. • ReVirt – User Mode Linux (Dunlap ‘02) – Slowdown: 1.08x rec. + 1.58x virt. • ReTrace – VMWare (Xu ‘07) – Slowdown: 1.05x-2.6x rec. + ??? virt. Virtualization slowdown is considered acceptable. Recording overhead is fairly low. 43
  • 44. Storage Requirements • Storage requirements vary with the workload. • For PANDA (Dolan-Gavitt ‘14): – 17-915 instructions per byte. • In practice: O(10MB/min) uncompressed. • Different approaches to reduce/manage storage requirements. – Compression, HD rotation, VM snapshots. • 24/7 recording seems within limits of todays’ technology. 44
  • 45. Highlights • Taint tracking analysis is a powerful method for capturing provenance. – Eliminates many false positives. – Tackles the “n×m problem”. • Decoupling provenance analysis from execution is possible by the use of VM record & replay. • Execution traces can be used for post-hoc provenance analysis. 45
  • 47. References Marcin Wylot, Philip Cudré-Mauroux, Paul Groth TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store World Wide Web Conference 2014 Marcin Wylot, Philip Cudré-Mauroux, Paul Groth Executing Provenance-Enabled Queries over Web Data World Wide Web Conference 2015 47
  • 48. RDF is great for munging data ➢ Ability to arbitrarily add new information (schemaless) ➢ Syntaxes are easy to concatenate new data ➢ Information has a well defined structure ➢ Identifiers are distributed but controlled 48
  • 49. What’s the provenance of my query result? Qr
  • 50. Graph-based Query select ?lat ?long ?g1 ?g2 ?g3 ?g4 where { graph ?g1 {?a [] "Eiffel Tower" . } graph ?g2 {?a inCountry FR . } graph ?g3 {?a lat ?lat . } graph ?g4 {?a long ?long . } } lat long l1 l2 l4 l4, lat long l1 l2 l4 l5, lat long l1 l2 l5 l4, lat long l1 l2 l5 l5, lat long l1 l3 l4 l4, lat long l1 l3 l4 l5, lat long l1 l3 l5 l4, lat long l1 l3 l5 l5, lat long l2 l2 l4 l4, lat long l2 l2 l4 l5, lat long l2 l2 l5 l4, lat long l2 l2 l5 l5, lat long l2 l3 l4 l4, lat long l2 l3 l4 l5, lat long l2 l3 l5 l4, lat long l2 l3 l5 l5, lat long l3 l2 l4 l4, lat long l3 l2 l4 l5, lat long l3 l2 l5 l4, lat long l3 l2 l5 l5, lat long l3 l3 l4 l4, lat long l3 l3 l4 l5, lat long l3 l3 l5 l4, lat long l3 l3 l5 l5,
  • 51. Provenance Polynomials ➢ Ability to characterize ways each source contributed ➢ Pinpoint the exact source to each result ➢ Trace back the list of sources the way they were combined to deliver a result
  • 52. Polynomials Operators ➢ Union (⊕) ○ constraint or projection satisfied with multiple sources l1 ⊕ l2 ⊕ l3 ○ multiple entities satisfy a set of constraints or projections ➢ Join (⊗) ○ sources joined to handle a constraint or a projection ○ OS and OO joins between few sets of constraints (l1 ⊕ l2) ⊗ (l3 ⊕ l4)
  • 53. Example Polynomial select ?lat ?long where { ?a [] ``Eiffel Tower''. ?a inCountry FR . ?a lat ?lat . ?a long ?long . } (l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕
  • 55. Experiments How expensive it is to trace provenance? What is the overhead on query execution time?
  • 56. Datasets ➢ Two collections of RDF data gathered from the Web ○ Billion Triple Challenge (BTC): Crawled from the linked open data cloud ○ Web Data Commons (WDC): RDFa, Microdata extracted from common crawl ➢ Typical collections gathered from multiple sources ➢ sampled subsets of ~110 million triples each; ~25GB each
  • 57. Workloads ➢ 8 Queries defined for BTC ○ T. Neumann and G. Weikum. Scalable join processing on very large rdf graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009. ➢ Two additional queries with UNION and OPTIONAL clauses ➢ 7 various new queries for WDC http://exascale.info/tripleprov
  • 58. Results Overhead of tracking provenance compared to vanilla version of the system for BTC dataset source-level co- located source-level annotated triple-level co- located triple-level annotated
  • 59. TripleProv: Query Execution Pipeline input: provenance-enable query ➢ execute the provenance query ➢ optionally pre-materializing or co-locating data ➢ optionally rewrite the workload queries ➢ execute the workload queries output: the workload query results, restricted to those which were derived from data specified by the provenance query 59
  • 60. Experiments What is the most efficient query execution strategy for provenance- enabled queries? 60
  • 61. Datasets ➢ Two collections of RDF data gathered from the Web ○ Billion Triple Challenge (BTC): Crawled from the linked open data cloud ○ Web Data Commons (WDC): RDFa, Microdata extracted from common crawl ➢ Typical collections gathered from multiple sources ➢ Sampled subsets of ~40 million triples each; ~10GB each ➢ Added provenance specific triples (184 for WDC and 360 for BTC); that the provenance queries do not modify the result sets of the workload queries 61
  • 62. Results for BTC ➢ Full Materialization: 44x faster than the vanilla version of the system ➢ Partial Materialization: 35x faster ➢ Pre-Filtering: 23x faster ➢ Adaptive Partial Materialization executes a provenance query and materialize data 475 times faster than Full Materialization ➢ Query Rewriting and Post- Filtering strategies perform significantly slower 62
  • 63. Data Analysis ➢ How many context values refer to how many triples? How selective it is? ➢ 6'819'826 unique context values in the BTC dataset. ➢ The majority of the context values are highly selective. 63 ➢ average selectivity ○ 5.8 triples per context value ○ 2.3 molecules per context value
  • 65. References Sara Magliacane, Philip Stutz, Paul Groth, Abraham Bernstein foxPSL: A Fast, Optimized and eXtended PSL implementation International Journal of Approximate Reasoning (2015) 65
  • 66. Why logic? - Concise & natural way to represent relations - Declarative representation: - Can reuse, extend, combine rules - Experts can write rules - First order logic: - Can exploit symmetries to avoid duplicated computation (e.g. lifted inference)
  • 67. Let the reasoner munge the data. See Sebastien Riedel’s etc. work towards pushing more NLP problems in to the reasoner. http://cl.naist.jp/~kevinduh/z/acltutorialslide s/matrix_acl2015tutorial.pdf
  • 68. Statistical Relational Learning ● Several flavors: o Markov Logic Networks, o Bayesian Logic Programs o Probabilistic Soft Logic (PSL) [Broecheler, Getoor, UAI 2010] ● PSL was successfully applied: o Entity resolution, Link prediction o Ontology alignment, Knowledge graph identification o Computer vision, trust propagation, …
  • 69. Probabilistic Soft Logic (PSL) ● Probabilistic logic with soft truth values ∈ [0,1] friends(anna, bob)= 0.8 votes(anna, demo) = 0.99 ● Weighted rules: [weight = 0.7] friends(A,B) && votes(A,P) => votes(B,P) ● Inference as constrained convex minimization: votes(bob, demo) = 0.8
  • 70. FoxPSL: Fast Optimized eXtended PSL classes ∃partially grounded rules optimizations DSL: FoxPSL lang
  • 71. Experiments: comparison with ACO SLURM cluster: 4 nodes, each with 2x10 cores and 128GB RAM ACO = implementation of consensus optimization on GraphLab used for grounded PSL
  • 72. Conclusions • Data munging is a central task • Provenance is a requirement • Now: • Provenance by stealth (ack Carole Goble) • Separate provenance analysis from instrumentation. • Future: • The computer should do the work
  • 73. Future Research • Explore optimizations of taint tracking for capturing provenance. • Provenance analysis of real-world traces (e.g. from rrshare.org). • Tracking provenance across environments • Traces/logs as central provenance primitive • Declarative data munging 73

Notes de l'éditeur

  1. NASA, A.40 Computational Modeling Algorithms and Cyberinfrastructure, tech. report, NASA, 19 Dec. 2011
  2. You use provenance enabled tools…
  3. Disclosed provenance methods require knowledge of application semantics and modification of the application. OTOH observed provenance methods usually have a high false positive ratio.
  4. Let’s look on a physical-world provenance problem. Geologists want to know the provenance of streams flowing out of the foothills of a mountain. To do so they pour dye on the suspected source of the stream. We can apply a similar method, called taint tracking to finding the provenance of data streams. Taint tracking allows us to examine the flow of data in what was previously a black box.
  5. We built a tool based on taint tracking to capture provenance. Our tool is called DataTracker and has two key building blocks.
  6. We evaluated DataTracker with some sample programs to show that it can tackle the nxm problem and eliminate false positives present in other observed provenance capture methods. The nxm problem is a major drawback of other observed provenance methods. In summary, it means that in the presence of n inputs and m outputs, the provenance graph will include nxm derivation edges.
  7. Decouple analysis from execution. Has been proposed for security analysis on mobile phones. (Paranoid Android, Portokalidis ‘10)
  8. Execution Capture: happens realtime Instrumentation: applied on the captured trace to generate provenance information Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries) Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
  9. We implemented our methodology using PANDA.
  10. PANDA is based on QEMU. Input includes both executed instructions and data. RAM snapshot + ND log are enough to accurately replay the whole execution. ND log conists of inputs to CPU/RAM and other device status is not logged  we can replay but we cannot “go live” (i.e. resume execution)
  11. Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state. Plugins are implemented as dynamic libraries. We focus on the highlighted plugins in this presentation.
  12. Typical information that can be retrieved through VM introspections. In general, executing code inside the guest OS is complex. Moreover, in the case of PANDA we don’t have access to the state of devices. This makes injection and execution of new code even more complex and also more limited.
  13. QEMU is a good choice for prototyping, but overall submoptimal as virtualization option. Xu et al. do not give any numbers for virtualization slowdown. They (rightfully) consider it acceptable for most cases. 1.05x is for CPU bound processing. 2.6x is for I/O bound processing.
  14. A few dozens of GBs per day.
  15. nowadays…. as we integrate a myriad of datasets from the Web we need a solution: trace which pieces of data and how were combined to deliver the result (previous work) tailor query execution process with information on data provenance, to filter pieces of data used in processing a query (this work) ------------------ we have to deal with issues like: ascertaining trust, establishing transparency, and estimating costs of a query answer
  16. before moving to our way of dealing with it …. I’d like to have a look….. if it couldn’t be done with some of existing systems….. let’s try use named graphs to store the source for each triple…. - we can load quads, 4th element is taken as named graph - we can even query it to retrieve some kind of provenance information…. on the picture, g1,q2,q3,q4 - named graphs, we use to store source of data as a result we have a huge list of permuted elements, l - lineage, source of the triples used to produce a particular entity - standard query results, enriched with named graphs - simple list of concatenated sources - permutations of values bound to variables referring to data used to answer the query - no formal, compact representation of provenance - no detailed full-fledged provenance polynomials, and how would it be with TripleProv?? ……. voila….
  17. the question is: How to represent provenance information? it must fulfill three main conditions characterize ways each source contributed to the result pinpoint the exact sources to each result we need a capacity….. to trace back the list of sources and the way they were combined to deliver a result
  18. in our polynomials, we use two logical operators Union constraint or projection is satisfied with multiple sources (same triple from multiple sources) multiple entities satisfy a set of constraints or projections (the answer is composed of multiple records) Join sources joined to handle a set constraints or a projections, joins based on subject… OS and OO joins between few sets of constraints Let me now give you some examples…..
  19. As a first example we take a simple star query the polynomial shows that - the first constraint was satisfied with lineage l1, l2 or l3, => Union of multiple sources, the constraint was satisfied with triples from multiple sources - the second was satisfied with l4 or l5. - the first projection was processed with elements having a lineage of l6 or l7, - the second one was processed with elements from l8 or l9. All the triples involved were joined on variable ?a, which is expressed in the polynomial…..by the join operators
  20. TripleProv is built on top of a NATIVE rdf store named Diplodocus, it has a modular architecture containing 6 main subcomponents query executor responsible for parsing the incoming query, rewriting the query plans, collecting and finally returning the results along with the provenance polynomials lexicographik tree in charge of encoding URIs and literals into compact system identifiers and of translating them back; type index clusters all keys based on their RDF types; RDF molecules the main storing structure, it stores RDF data as very compact subgraphs, along with the source for each piece of data in molecule index for each key we store a list of molecules where the key can be found.
  21. the main question in the database world is how fast it is? we transfer it to…... how expensive it is to trace provenance….. what is the overhead of tracking provenance
  22. Two subsets…. sampled from collections of RDF data gathered from the Web Billion Triple Challenge Web Data Commons Typical collections gathered from multiple sources tracking provenance for them seems to precisely address the problem we focus, what is the provenance of a query answer in a dataset integrated from many sources
  23. as a workload for BTC we used - 8 Queries from the work of Thomas Neumann, SIGMOD 2009 - two extra queries with UNION and OPTIONAL clauses for WDC we prepared 7 various queries they represent different kinds of typical query patterns including star-queries up to 5 joins, object-object joins, object-subject joins, and triangular joins all of them are available on the project web page,
  24. now we can have a quick look at the performance on the picture you can see the overhead over the vanilla version of the system (w/o provenance) for BTC dataset horizontal axis: queries vertical axis: overhead you can see results for 4 variants of the system, those are permutations of gramulality levels and storage models -------------------------------------------------------------------------------------------- Overall, the performance penalty created by tracking provenance ranges from a few percents to almost 350%. we observe a significant difference between the two storage models implemented -retrieving data from co-located structures takes about 10%-20% more time than from simply annotated graph nodes caused by the additional look-ups and loops that have to be considered when reading from extra physical data containers We also notice difference between the two granularity levels. more detailed triple-level requires more time
  25. such simple post execution join would of course result in poor performance, in our methods the query execution process can vary depending on the exact strategy typically we start by executing the blue provenance query and optionally pre-materializing or co-locating data; the green workload queries are then optionally rewritten….. by taking into account results of the provenance query and finally they get executed The process returns as an output the workload query results, restricted to those which are following the specification expressed in the provenance query
  26. the main question in the database world is how fast it is? in our case we will try to answer the question, what is the most efficient query execution strategy for provenance-enabled queries?
  27. for our experiments, we used…. Two subsets sampled from collections of RDF data gathered from the Web Billion Triple Challenge Web Data Commons those are… typical collections gathered from multiple sources executing provenance-enabled queries for them seems to precisely address the problem we focus, our goal is to fairly compare our provenance aware query execution strategies and the vanilla version of the system, that's why... for the datasets we added some triples so that the provenance queries do not change the results of workload queries
  28. overall… Full Materialization: 44x faster than the vanilla version of the system Partial Materialization: 35x faster Pre-Filtering: 23x faster The advantage of the Partial Materialization strategy over the Full Materialization strategy… is that for the Partial Materialization, the time to execute a provenance query and materialize data is 475 times lower. it’s basically faster to prepare data for executing workload queries Query Rewriting and Post-Filtering strategies perform significantly slower
  29. to better understand the influence of provenance queries on performance, So to find the reason of such performance gain over the pure triplestore we analysed the BTC dataset and provenance distribution the figure shows how many context values refer to how many triples we found that there are only a handful of context values that are widespread (left-hand side of the figure) and that the vast majority of the context values are highly selective (right-hand side of the figure) we leveraged those properties during the query execution, our strategies prune molecules early based on their context values