Tradeoffs in Automatic Provenance Capture

Trade-offs in Automatic
Provenance Capture
Manolis Stamatogiannakis, Hasanat Kazmi,
Hashim Sharif, Remco Vermeulen,
Ashish Gehani, Herbert Bos, and Paul Groth

Capturing Provenance
Disclosed Provenance
+ Accuracy
+ High-level semantics
– Intrusive
– Manual Effort
Observed Provenance
– False positives
– Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)
Trio (Widom ‘09)
PrIME (Miles ‘09)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
2
OPUS (Balakrishnan ‘ 13)

https://github.com/ashish-
gehani/SPADE/wiki
• Strace Reporter
– Programs run under strace. Produced log is parsed
to extract provenance.
• LLVMTrace
– Instrumentation added to function boundaries at
compile time.
• DataTracker
– Dynamic Taint Analysis. Bytes associated with
metadata which are propagated as the program
executes.
3
SPADEv2 – Provenance
Collection

Incomplete Picture
• Faster, but how much?
• What is the performance “price” for fewer
false positives?
• Does a compile-time solution worth the
effort?
7

How can one get more
insight?
Run a benchmark!
8

Which one?
• LMBench, UnixBench, Postmark, BLAST,
SPECint…
• [Traeger 08]: “Most popular benchmarks
are flawed.”
• No-matter what you chose, there will be
blind spots.
9

Start simple: UnixBench
• Well understood sub-benchmarks.
• Emphasizes on performance of system calls.
• System calls are commonly used for the
extraction of provenance.
• More insight on which collection backend
would suit specific applications.
• We’ll have a performance baseline to
improve the specific implementations.
10

Performance vs. Integration
Effort
• Capturing provenance from completely
unmodified programs may degrade
performance.
• Modification of either the source
(LLVMTrace) or the platform (LPM, Hi-Fi)
should be considered for a production
deployment.
13

Performance vs. Provenance
Granularity
• We couldn’t verify this intuition for the case
of strace reoporter compared to
LLVMTrace.
– Strace reporter implementation is not optimal.
• Tracking fine-grained provenance may
interfere with existing optimizations.
– E.g. buffering I/O does not benefit
DataTracker.
14

Performance vs.
False Positives/Analysis Scope
• “Brute-forcing” a low false-positive ratio with the
“track everything” approach of DataTracker is
prohibitively expensive.
• Limiting the analysis scope gives a performance
boost.
• If we exploit known semantics, we can have the
best of both worlds.
– Pre-existing semantic knowledge: LLVMTrace
– Dynamically acquired knowledge: ProTracer [Ma
2016]
15

Takeaway: System Event
Tracing
• A good start for quick deployments
• Simple versions may be expensive
• What happens in the binary?
17

Takeaway: Compile-time
Instrumentation
• Middle-ground between disclosed and
automatic provenance collection.
• But you have to have access to source
18

Takeaway: Taint Analysis
• Prohibitively expensive
for computation-
intensive programs.
• Likely to remain so,
even after optimizations.
• Reserved for
provenance analysis of
unknown/legacy
software.
• Offline approach
(Stamatogiannakis
TAPP’15)
19

Generalizing the Results
• Only one implementation
was tested for each method.
• Repeating testing with
alternative implementations
will provide confidence for
the insights gained.
• More confidence when
choosing a specific collection
method.
20
Different methods
Differentimplementations

Implementation Details Matter
• Our results are influenced by the specifics
of the implementation.
• Anecdote: The initial implementation of
LLVMTrace was actually slower than
strace reporter.
21

Provenance Quality
• Qualitative features of the
provenance are also very
important.
• How many vertices/edges are
contained in the generated
provenance graph?
• Precision/Recall based on
provenance ground truth.
22
Performance Benchmarks
QualitativeBenchmarks

Where to go next?
• UnixBench is a basic benchmark.
• SPEC: Comprehensive in terms of
performance evaluation.
– Hard to get the provenance ground truth –
assess quality of captured provenance.
• Better directions:
– Coreutils based micro-benchmarks.
– Macro-benchmarks (e.g. Postmark,
compilation benchmarks).
23

Conclusion
• Automatic provenance capture is an
important part of the ecosystem
• Trade-offs in different capture modes
• Benchmarking – to inform
• Common platforms are essential
24

Tradeoffs in Automatic Provenance Capture

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (20)

Similar to Tradeoffs in Automatic Provenance Capture

Similar to Tradeoffs in Automatic Provenance Capture (20)

More from Paul Groth

More from Paul Groth (20)

Recently uploaded

Recently uploaded (20)

Tradeoffs in Automatic Provenance Capture

Editor's Notes