Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Streamlining Python Development: A Guide to a Modern Project Setup
Provenance for Data Munging Environments
1. Paul Groth
Elsevier Labs
@pgroth | pgroth.com
Provenance for
Data Munging Environments
Information Sciences Institute – August 13, 2015
2. Outline
• What’s data munging and why it’s
important?
• The role of provenance
• The reality….
• Desktop data munging & provenance
• Database data munging & provenance
• Declarative data munging (?)
10. Solution:
Tracking and exposing
provenance*
* a record that describes the people, institutions,
entities, and activities involved in producing,
influencing, or delivering a piece of data”
The PROV Data Model
(W3C Recommendation)
23. References
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.
Looking Inside the Black-Box: Capturing Data
Provenance Using Dynamic Instrumentation.
5th International Provenance and Annotation Workshop
(IPAW'14)
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.
Decoupling Provenance Capture and Analysis from
Execution.
7th USENIX Workshop on the Theory and Practice of
Provenance (TaPP'15)
23
25. Challenge
• Can we capture provenance
– with low false positive ratio?
– without manual/obtrusive integration effort?
• We have to rely on observed provenance.
25
26. State of the art
Application
• Observed provenance systems treat programs as black-
boxes.
• Can’t tell if an input file was actually used.
• Can’t quantify the influence of input to output.
26
30. Evaluation: tackling the n×m
problem
30
• DataTracker is able
to track the actual
use of the input data.
• Read data ≠ Use
data.
• Eliminates false
positives (---->)
present in other
observed
provenance capture
methods.
32. Can we do good enough?
• Can taint tracking
a. become an “always-on” feature?
b. be turned on for all running processes?
• What if we want to also run other analysis
code?
• Can we pre-determine the right analysis
code?
32
36. Prototype Implementation
• PANDA: an open-
source Platform for
Architecture-Neutral
Dynamic Analysis.
(Dolan-Gavitt ‘14)
• Based on the QEMU
virtualization platform.
36
37. • PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA Initial RAM
Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
37
38. Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch.
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with
SPARQL.
PANDA
Executio
n Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
38
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime
39. OS Introspection
• What processes are currently executing?
• Which libraries are used?
• What files are used?
• Possible approaches:
– Execute code inside the guest-OS.
– Reproduce guest-OS semantics purely from
the hardware state (RAM/registers).
39
40. The PROV-Tracer Plugin
• Registers for process creation/destruction
events.
• Decodes executed system calls.
• Keeps track of what files are used as
input/output by each process.
• Emits provenance in an intermediate
format when a process terminates.
40
41. More Analysis Plugins
• ProcStrMatch plugin.
– Which processes contained string S in their
memory?
• Other possible types of analysis:
– Taint tracking
– Dynamic slicing
41
42. Overhead (again) (1/2)
• QEMU incurs a 5x slowdown.
• PANDA recording imposes an additional
1.1x – 1.2x slowdown.
Virtualization is the dominant overhead
factor.
42
43. Overhead (again) (2/2)
• QEMU is a suboptimal virtualization
option.
• ReVirt – User Mode Linux (Dunlap ‘02)
– Slowdown: 1.08x rec. + 1.58x virt.
• ReTrace – VMWare (Xu ‘07)
– Slowdown: 1.05x-2.6x rec. + ??? virt.
Virtualization slowdown is considered
acceptable.
Recording overhead is fairly low. 43
44. Storage Requirements
• Storage requirements vary with the
workload.
• For PANDA (Dolan-Gavitt ‘14):
– 17-915 instructions per byte.
• In practice: O(10MB/min) uncompressed.
• Different approaches to reduce/manage
storage requirements.
– Compression, HD rotation, VM snapshots.
• 24/7 recording seems within limits of
todays’ technology. 44
45. Highlights
• Taint tracking analysis is a powerful method
for capturing provenance.
– Eliminates many false positives.
– Tackles the “n×m problem”.
• Decoupling provenance analysis from
execution is possible by the use of VM record
& replay.
• Execution traces can be used for post-hoc
provenance analysis.
45
47. References
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth
TripleProv: Efficient Processing of Lineage Queries
over a Native RDF Store
World Wide Web Conference 2014
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth
Executing Provenance-Enabled Queries over Web
Data
World Wide Web Conference 2015
47
48. RDF is great for munging data
➢ Ability to arbitrarily add new
information (schemaless)
➢ Syntaxes are easy to concatenate
new data
➢ Information has a well defined
structure
➢ Identifiers are distributed but
controlled
48
50. Graph-based Query
select ?lat ?long ?g1 ?g2 ?g3 ?g4
where {
graph ?g1 {?a [] "Eiffel Tower" . }
graph ?g2 {?a inCountry FR . }
graph ?g3 {?a lat ?lat . }
graph ?g4 {?a long ?long . }
}
lat long l1 l2 l4 l4,
lat long l1 l2 l4 l5,
lat long l1 l2 l5 l4,
lat long l1 l2 l5 l5,
lat long l1 l3 l4 l4,
lat long l1 l3 l4 l5,
lat long l1 l3 l5 l4,
lat long l1 l3 l5 l5,
lat long l2 l2 l4 l4,
lat long l2 l2 l4 l5,
lat long l2 l2 l5 l4,
lat long l2 l2 l5 l5,
lat long l2 l3 l4 l4,
lat long l2 l3 l4 l5,
lat long l2 l3 l5 l4,
lat long l2 l3 l5 l5,
lat long l3 l2 l4 l4,
lat long l3 l2 l4 l5,
lat long l3 l2 l5 l4,
lat long l3 l2 l5 l5,
lat long l3 l3 l4 l4,
lat long l3 l3 l4 l5,
lat long l3 l3 l5 l4,
lat long l3 l3 l5 l5,
51. Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result
52. Polynomials Operators
➢ Union (⊕)
○ constraint or projection satisfied with multiple sources
l1 ⊕ l2 ⊕ l3
○ multiple entities satisfy a set of constraints or
projections
➢ Join (⊗)
○ sources joined to handle a constraint or a projection
○ OS and OO joins between few sets of constraints
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)
56. Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked
open data cloud
○ Web Data Commons (WDC): RDFa, Microdata
extracted from common crawl
➢ Typical collections gathered from multiple sources
➢ sampled subsets of ~110 million triples each; ~25GB each
57. Workloads
➢ 8 Queries defined for BTC
○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/tripleprov
58. Results
Overhead of tracking provenance compared to
vanilla version of the system for BTC dataset
source-level co-
located
source-level
annotated
triple-level co-
located
triple-level
annotated
59. TripleProv: Query Execution
Pipeline
input: provenance-enable query
➢ execute the provenance query
➢ optionally pre-materializing or co-locating data
➢ optionally rewrite the workload queries
➢ execute the workload queries
output: the workload query results, restricted to those which were derived
from data specified by the provenance query 59
60. Experiments
What is the most efficient query
execution strategy for provenance-
enabled queries?
60
61. Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked open data
cloud
○ Web Data Commons (WDC): RDFa, Microdata extracted from
common crawl
➢ Typical collections gathered from multiple sources
➢ Sampled subsets of ~40 million triples each; ~10GB each
➢ Added provenance specific triples (184 for WDC and 360 for BTC); that
the provenance queries do not modify the result sets of the workload
queries
61
62. Results for BTC
➢ Full Materialization: 44x faster
than the vanilla version of the
system
➢ Partial Materialization: 35x faster
➢ Pre-Filtering: 23x faster
➢ Adaptive Partial Materialization
executes a provenance query and
materialize data 475 times faster
than Full Materialization
➢ Query Rewriting and Post-
Filtering strategies perform
significantly slower
62
63. Data Analysis
➢ How many context values refer
to how many triples? How
selective it is?
➢ 6'819'826 unique context values
in the BTC dataset.
➢ The majority of the context
values are highly selective.
63
➢ average selectivity
○ 5.8 triples per context value
○ 2.3 molecules per context value
65. References
Sara Magliacane, Philip Stutz, Paul Groth, Abraham
Bernstein
foxPSL: A Fast, Optimized and eXtended PSL
implementation
International Journal of Approximate Reasoning (2015)
65
66. Why logic?
- Concise & natural way to represent relations
- Declarative representation:
- Can reuse, extend, combine rules
- Experts can write rules
- First order logic:
- Can exploit symmetries to avoid duplicated
computation (e.g. lifted inference)
67. Let the reasoner munge the
data.
See Sebastien Riedel’s etc. work towards
pushing more NLP problems in to the
reasoner.
http://cl.naist.jp/~kevinduh/z/acltutorialslide
s/matrix_acl2015tutorial.pdf
68. Statistical Relational Learning
● Several flavors:
o Markov Logic Networks,
o Bayesian Logic Programs
o Probabilistic Soft Logic (PSL) [Broecheler, Getoor,
UAI 2010]
● PSL was successfully applied:
o Entity resolution, Link prediction
o Ontology alignment, Knowledge graph
identification
o Computer vision, trust propagation, …
70. FoxPSL: Fast Optimized eXtended PSL
classes ∃partially
grounded rules
optimizations
DSL:
FoxPSL
lang
71. Experiments: comparison with ACO
SLURM cluster: 4 nodes, each with 2x10 cores and 128GB RAM
ACO = implementation of consensus optimization on
GraphLab used for grounded PSL
72. Conclusions
• Data munging is a central task
• Provenance is a requirement
• Now:
• Provenance by stealth (ack Carole Goble)
• Separate provenance analysis from
instrumentation.
• Future:
• The computer should do the work
73. Future Research
• Explore optimizations of taint tracking for
capturing provenance.
• Provenance analysis of real-world traces
(e.g. from rrshare.org).
• Tracking provenance across environments
• Traces/logs as central provenance
primitive
• Declarative data munging
73
Disclosed provenance methods require knowledge of application semantics and modification of the application.
OTOH observed provenance methods usually have a high false positive ratio.
Let’s look on a physical-world provenance problem.
Geologists want to know the provenance of streams flowing out of the foothills of a mountain. To do so they pour dye on the suspected source of the stream.
We can apply a similar method, called taint tracking to finding the provenance of data streams.
Taint tracking allows us to examine the flow of data in what was previously a black box.
We built a tool based on taint tracking to capture provenance. Our tool is called DataTracker and has two key building blocks.
We evaluated DataTracker with some sample programs to show that it can tackle the nxm problem and eliminate false positives present in other observed provenance capture methods.
The nxm problem is a major drawback of other observed provenance methods. In summary, it means that in the presence of n inputs and m outputs, the provenance graph will include nxm derivation edges.
Decouple analysis from execution.
Has been proposed for security analysis on mobile phones. (Paranoid Android, Portokalidis ‘10)
Execution Capture: happens realtime
Instrumentation: applied on the captured trace to generate provenance information
Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries)
Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
We implemented our methodology using PANDA.
PANDA is based on QEMU.
Input includes both executed instructions and data.
RAM snapshot + ND log are enough to accurately replay the whole execution.
ND log conists of inputs to CPU/RAM and other device status is not logged we can replay but we cannot “go live” (i.e. resume execution)
Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state.
Plugins are implemented as dynamic libraries.
We focus on the highlighted plugins in this presentation.
Typical information that can be retrieved through VM introspections.
In general, executing code inside the guest OS is complex.
Moreover, in the case of PANDA we don’t have access to the state of devices. This makes injection and execution of new code even more complex and also more limited.
QEMU is a good choice for prototyping, but overall submoptimal as virtualization option.
Xu et al. do not give any numbers for virtualization slowdown. They (rightfully) consider it acceptable for most cases.
1.05x is for CPU bound processing. 2.6x is for I/O bound processing.
A few dozens of GBs per day.
nowadays…. as we integrate a myriad of datasets from the Web
we need a solution:
trace which pieces of data and how were combined to deliver the result (previous work)
tailor query execution process with information on data provenance, to filter pieces of data used in processing a query (this work)
------------------
we have to deal with issues like: ascertaining trust, establishing transparency, and estimating costs of a query answer
before moving to our way of dealing with it …. I’d like to have a look….. if it couldn’t be done with some of existing systems…..
let’s try use named graphs to store the source for each triple….
- we can load quads, 4th element is taken as named graph
- we can even query it to retrieve some kind of provenance information….
on the picture,
g1,q2,q3,q4 - named graphs, we use to store source of data
as a result we have a huge list of permuted elements,
l - lineage, source of the triples used to produce a particular entity
- standard query results, enriched with named graphs
- simple list of concatenated sources
- permutations of values bound to variables referring to data used to answer the query
- no formal, compact representation of provenance
- no detailed full-fledged provenance polynomials,
and how would it be with TripleProv?? ……. voila….
the question is: How to represent provenance information?
it must fulfill three main conditions
characterize ways each source contributed to the result
pinpoint the exact sources to each result
we need a capacity….. to trace back the list of sources and the way they were combined to deliver a result
in our polynomials, we use two logical operators
Union
constraint or projection is satisfied with multiple sources (same triple from multiple sources)
multiple entities satisfy a set of constraints or projections (the answer is composed of multiple records)
Join
sources joined to handle a set constraints or a projections, joins based on subject…
OS and OO joins between few sets of constraints
Let me now give you some examples…..
As a first example we take a simple star query
the polynomial shows that
- the first constraint was satisfied with lineage l1, l2 or l3, => Union of multiple sources, the constraint was satisfied with triples from multiple sources
- the second was satisfied with l4 or l5.
- the first projection was processed with elements having a lineage of l6 or l7,
- the second one was processed with elements from l8 or l9.
All the triples involved were joined on variable ?a, which is expressed in the polynomial…..by the join operators
TripleProv is built on top of a NATIVE rdf store named Diplodocus,
it has a modular architecture
containing 6 main subcomponents
query executor responsible for parsing the incoming query, rewriting the query plans, collecting and finally returning the results along with the provenance polynomials
lexicographik tree in charge of encoding URIs and literals into compact system identifiers and of translating them back;
type index clusters all keys based on their RDF types;
RDF molecules the main storing structure, it stores RDF data as very compact subgraphs, along with the source for each piece of data
in molecule index for each key we store a list of molecules where the key can be found.
the main question in the database world is how fast it is?
we transfer it to…...
how expensive it is to trace provenance…..
what is the overhead of tracking provenance
Two subsets…. sampled from collections of RDF data gathered from the Web
Billion Triple Challenge
Web Data Commons
Typical collections gathered from multiple sources
tracking provenance for them seems to precisely address the problem we focus,
what is the provenance of a query answer in a dataset integrated from many sources
as a workload for BTC we used
- 8 Queries from the work of Thomas Neumann, SIGMOD 2009
- two extra queries with UNION and OPTIONAL clauses
for WDC we prepared 7 various queries
they represent different kinds of typical query patterns including
star-queries up to 5 joins,
object-object joins,
object-subject joins,
and triangular joins
all of them are available on the project web page,
now we can have a quick look at the performance
on the picture you can see the overhead over the vanilla version of the system (w/o provenance) for BTC dataset
horizontal axis: queries
vertical axis: overhead
you can see results for 4 variants of the system, those are permutations of gramulality levels and storage models
--------------------------------------------------------------------------------------------
Overall, the performance penalty created by tracking provenance ranges from a few percents to almost 350%.
we observe a significant difference between the two storage models implemented
-retrieving data from co-located structures takes about 10%-20% more time than from simply annotated graph nodes
caused by the additional look-ups and loops that have to be considered when reading from extra physical data containers
We also notice difference between the two granularity levels.
more detailed triple-level requires more time
such simple post execution join would of course result in poor performance,
in our methods the query execution process can vary depending on the exact strategy
typically we start by executing the blue provenance query and optionally pre-materializing or co-locating data;
the green workload queries are then optionally rewritten…..
by taking into account results of the provenance query
and finally they get executed
The process returns as an output the workload query results, restricted to those which are following the specification expressed in the provenance query
the main question in the database world is how fast it is?
in our case we will try to answer the question,
what is the most efficient query execution strategy for provenance-enabled queries?
for our experiments, we used….
Two subsets sampled from collections of RDF data gathered from the Web
Billion Triple Challenge
Web Data Commons
those are… typical collections gathered from multiple sources
executing provenance-enabled queries for them seems to precisely address the problem we focus,
our goal is to fairly compare our provenance aware query execution strategies and the vanilla version of the system, that's why...
for the datasets we added some triples so that the provenance queries do not change the results of workload queries
overall…
Full Materialization: 44x faster than the vanilla version of the system
Partial Materialization: 35x faster
Pre-Filtering: 23x faster
The advantage of the Partial Materialization strategy over the Full Materialization strategy…
is that for the Partial Materialization, the time to execute a provenance query and materialize data is 475 times lower.
it’s basically faster to prepare data for executing workload queries
Query Rewriting and Post-Filtering strategies perform significantly slower
to better understand the influence of provenance queries on performance,
So to find the reason of such performance gain over the pure triplestore
we analysed the BTC dataset and provenance distribution
the figure shows how many context values refer to how many triples
we found that
there are only a handful of context values that are widespread (left-hand side of the figure)
and that the vast majority of the context values are highly selective (right-hand side of the figure)
we leveraged those properties during the query execution,
our strategies prune molecules early based on their context values