TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
LDV: Light-weight Database Virtualization
1. LDV: Light-weight
DatabaseVirtualization
Quan Pham2,Tanu Malik1, Boris Glavic3 and Ian Foster1,2
Computation Institute1and Department of Computer Science2,3
University of Chicago1,2,Argonne National Laboratory1
Illinois Institute of Technology3
Thursday, April 23, 15
2. Share and Reproduce
Alice wants to share her models and
simulation output with Bob, and Bob wants to
re-execute Alice’s application to validate her
inputs and outputs.
Alice Bob
Thursday, April 23, 15
3. Significance
|reportingchecklistforlifesciencesarticles
1. Howwasthesamplesizechosentoensureadequatepower
todetectapre-specifiedeffectsize?
Foranimalstudies,includeastatementaboutsamplesize
estimateevenifnostatisticalmethodswereused.
2. Describeinclusion/exclusioncriteriaifsamplesoranimalswere
excludedfromtheanalysis.Werethecriteriapre-established?
3. Ifamethodofrandomizationwasusedtodeterminehow
samples/animalswereallocatedtoexperimentalgroupsand
processed,describeit.
Foranimalstudies,includeastatementaboutrandomization
evenifnorandomizationwasused.
4. Iftheinvestigatorwasblindedtothegroupallocationduring
theexperimentand/orwhenassessingtheoutcome,state
theextentofblinding.
Foranimalstudies,includeastatementaboutblindingeven
ReportingChecklistForLifeSciencesArticles
Thischecklistisusedtoensuregoodreportingstandardsandtoimprovethereproducibilityofpublishedresults.Formoreinformation,
pleasereadReportingLifeSciencesResearch.
Figurelegends
Eachfigurelegendshouldcontain,foreachpanelwheretheyarerelevant:
theexactsamplesize(n)foreachexperimentalgroup/condition,givenasanumber,notarange;
a description of the sample collection allowing the reader to understand whether the samples represent technical or biological
replicates(includinghowmanyanimals,litters,cultures,etc.);
astatementofhowmanytimestheexperimentshownwasreplicatedinthelaboratory;
definitionsofstatisticalmethodsandmeasures:
○ verycommontests,suchast-test,simpleχ2 tests,WilcoxonandMann-Whitneytests,canbeunambiguouslyidentifiedbynameonly,
butmorecomplextechniquesshouldbedescribedinthemethodssection;
○ aretestsone-sidedortwo-sided?
○ arethereadjustmentsformultiplecomparisons?
○ statisticaltestresults,e.g.,Pvalues;
○ definitionof‘centervalues’asmedianoraverage;
○ definitionoferrorbarsass.d.ors.e.m.
Anydescriptionstoolongforthefigurelegendshouldbeincludedinthemethodssection.
Pleaseensurethattheanswerstothefollowingquestionsarereportedinthemanuscriptitself.Weencourageyoutoincludeaspecific
subsectioninthemethodssectionforstatistics,reagentsandanimalmodels.Below,providethepagenumber(s)orfigurelegend(s)
wheretheinformationcanbelocated.
Statisticsandgeneralmethods
Reportedonpage(s)orfigurelegend(s):
CorrespondingAuthorName: ________________________________________
ManuscriptNumber: ______________________________
Metrics aims to improve the reproducibility of scientific research.
NY Times, Dec, 2014
Thursday, April 23, 15
4. Alice’s Options
1.A tar and gzip
2. Submit to a repository
3. Build website with code, parameters, and data
4. Create a virtual machine
Thursday, April 23, 15
5. Bob’s Frustration
1-3. I do not find the lib.so required for
building the model.
4. How do I?
Lack of easy and efficient methods for sharing
and reproducibility
Amount of pain
Bob suffers
Amount of
pain Alice suffers
Thursday, April 23, 15
10. ApplicationVirtualization for
DB Applications
• Applications that interact with a relational database
• Examples:
• Text-mining applications that download data,
preprocess and insert into a personal DB
• Analysis scripts using parts of a hosted database
Application
Operating System
File System File System
Slice
Pkg
Copy
AV
Alice's
Computer
chdir(“/usr”)
open
(“lib/libc.so.6”)DB Server
Thursday, April 23, 15
11. ApplicationVirtualization for
DB Applications
• Applications that interact with a relational database
• Examples:
• Text-mining applications that download data,
preprocess and insert into a personal DB
• Analysis scripts using parts of a hosted database
Application
Operating System
File System File System
Slice
Pkg
Copy
AV
Alice's
Computer
chdir(“/usr”)
open
(“lib/libc.so.6”)DB Server
Thursday, April 23, 15
12. Why doesn’t it work?
• Application virtualization methods are
oblivious to semantics of data in a database
system
• The database state at the time of sharing
the application may not be the same as the
start of the application
ared among multiple users and
Thus, to re-execute an applica-
as of the start of the application,
to understand a shared applica-
application provenance are well
these two types of provenance
ned methods - companion web-
cation virtualization - addresses
o automatic mechanism for cap-
on and DB provenance, these
s for determining which data is
they do not solve the issue of
vious state, and do not address
ring the binaries of commercial
irtualization is currently limited
not communicate to server pro-
r or a database server. In fact,
nicates with a database server,
ord the communication between
ver. This is not sufficient for
sed by the application (and, thus,
ackage) and to be able to reset
re application execution started.
a solution for the later problem,
share with Bob (Figure 1). Alice would preferably like to share
this application in the form of package P with Bob, who may
want to re-execute the application in its entirety or may want to
validate, just the analysis task, or provide his own data inputs
to examine the analysis result.
If Alice wants Bob to re-execute and build upon her
database application, then Bob must have access to an en-
vironment that consists of application binaries and data, any
extension modules that the code depends upon (e.g., dynam-
ically linked libraries), a database server and a database on
which the application can be re-executed. Ideally, it would
be useful if Alice’s environment can be virtualized and thus
automatically set up for Bob.
P3 P4Other experiments
f1
P1 Insert
t1
t2
t3
Query P2
t4
f2
Alice’s
experiment
Database
Fig. 1: Alice’s experiment with processes P1 and P2 uses tupleThursday, April 23, 15
14. Key Ideas
• DB application = Application (OS) part + DB part
• Use data provenance to capture interactions from/to the
application side to the database side
• Limited formal mechanisms so far to combine the two kinds
of provenance models
• Create a virtualized package that can be re-
executed
• Either include the server and data, or replay interactions
(for licensed databases)
• No virtualization mechanism for database replay
Thursday, April 23, 15
15. Related Work
• Application virtualization
• Linux Containers, CDE[Usenix’11]
• Packaging with annotations
• Docker
• Packaging with provenance
• PTU1[TaPP’13], ReproZip[TaPP’13], Research Objects
• Unified provenance models
• based on program instrumentation [TaPP’12]
1 Q. Pham,T. Malik, and I. Foster. Using provenance for repeatability. In Theory and Practice of Provenance (TaPP), 2013.
Thursday, April 23, 15
16. How does LDV work?
Application
Operating System
File System
DB Server
Execution
Trace
DB Server
DB Slice
File System
Slice
Pkg
CopyLDV
Alice's
Computer
Alice’s Machine
ldv-audit db-app
• Monitoring system calls
• Monitoring SQL
• Server-included packages
• Server-excluded packages
• Execution traces
• Relevant DB and filesystem
slices
Thursday, April 23, 15
17. • Redirecting file access
• Redirecting DB access
• Server-included packages
• Server-excluded packages
File System
Bob's
ComputerUser Application
Operating System
DB Server
Execution
Trace
DB Server
DB Slice
File System
Slice
Pkg
LDV Redirect
Bob’s Machine
ldv-exec db-app
How does LDV work?
Thursday, April 23, 15
18. Example
Alice:~$ ldv-audit app.sh
Application package created as app-pkg
Alice:~$ ls
app-pkg app.sh src data
Alice:~$echo "Hi Bob, Please find the pkg --Alice" |
mutt -s "Sharing DB Application -a "./app-pkg"
-- bob-vldb2015@gmail.com
Bob:~$ ls .
app-pkg
Bob:~$ cd app-pkg
Bob:~$ ls
app.sh src data
Bob:~$ldv-exec app.sh
Running app-pkg....
Ubuntu 14.04
(Kernel 3.13)
+
Postgres 9.1
CentOS 6.2
(Kernel 2.6.32)
+
MySQL
Thursday, April 23, 15
19. LDV Issues
• Monitoring system calls
• Monitoring SQL
• Execution traces
• Relevant DB slices
• Redirecting file access
• Server-included packages
• Server-excluded packages
• Redirecting DB access
Thursday, April 23, 15
20. LDV Issues
• Monitoring system calls
• Monitoring SQL
• Execution traces
• Relevant DB slices
• Redirecting file access
• Server-included packages
• Server-excluded packages
• Redirecting DB access
Thursday, April 23, 15
21. An Execution Trace
A B
P1
Insert1
Insert2
t1
t2
t3
Query
t4
t5
P2
C
[1, 6] [7, 8]
[5, 5]
[8, 8]
[5, 5]
[5, 5]
[8, 8]
[9, 9]
[9, 9]
[9, 9]
[9, 9]
[9, 9]
[9, 9]
[9, 9]
[7, 12]
Fig. 2: An execution trace with processes and database operations
t1
t2
t3
Q1
[4, 4]
[4, 4]
[4, 4]
[4, 4]
Fig. 3: PLin trace and data de
A
B
P1
[1, 5]
[5, 7]
[2, 3]
[8, 8]
Fig. 4: PBB trace and data de
(a) the first process reads file f0
and the last process writes
file f0
, and (b) each process Pi was executed by process Pi 1.
Example 6. Consider the trace shown in Figure 4. Process
P1 reads files A and B and writes files C and D. Thus, both
graph. In contrast, we assume the temporal c
given (recorded when creating an execution tr
these annotations to restrict what edges have to
Similarly, Dey et al. [8] determine all possible ord
a file
a process
a tuple
a query
temporal
annotations
Uses provenance entities and activities to model the
execution of a DB application
Thursday, April 23, 15
22. Data Dependencies from Provenance
Systems
t1
t2
t3
Query
t4
t5
P2
C
[9, 9]
[9, 9]
[9, 9]
[9, 9]
[9, 9]
[9, 9]
[9, 9]
[7, 12]
t1
t2
t3
Q1 t4
[4, 4]
[4, 4]
[4, 4]
[4, 4]
Fig. 3: PLin trace and data dependenci
A
B
P1
C
D
[1, 5]
[5, 7]
[2, 3]
[8, 8]
Fine-Grained DB Provenance
t1
t2
t3
Q1 t4
[4, 4]
[4, 4]
[4, 4]
[4, 4]
Fig. 3: PLin trace and data dependencies.
A
B
P1
C
D
[1, 5]
[5, 7]
[2, 3]
[8, 8]
Fig. 4: PBB trace and data dependencies.
st, we assume the temporal constraints as
when creating an execution trace) and use
s to restrict what edges have to be inferred.
al. [8] determine all possible orders of events
le for an OPM provenance graph.
File Operations
A DB execution trace has more edges than those
determined by individual provenance systems
A combined execution trace models the execution of a DB
application including its processes, file operations, and DB
accesses based on a OS and a DB provenance model.
Definition 6 (Combined Execution Trace). Let PDB and POS
be DB and OS provenance models. Every execution trace for
PDB+OS is a combined execution trace for PDB and POS.
Example 3. A combined execution trace for the PLin and
PBB models is shown in Figure 2. This trace models the
execution of two processes P1 and P2. Process P1 reads two
files A and B, and executes two insert statements (at time 5
and 8 respectively). These insert statements create three tuple
versions t1, t2, and t3. Process P2 executes a query which
returns tuples t4 and t5. These tuples depend on tuples t1 and
t3. Finally, process P2 writes file C.
VI. DATA DEPENDENCIES
The above definitions describe interactions of activities and
entities in an execution trace of a provenance model, but do not
model data dependencies, i.e., dependencies between entities.
In our model, a dependency is an edge between two entities e
and e0
where a change to the input node (e0
) may result in a
change to the output node (e). Given a provenance model, de-
pendency information may or may not be explicitly available;
it depends upon the granularity at which information about
entities and activities is tracked and stored. For instance, the
blackbox provenance model PBB operates at the granularity
of processes and files and may not compute exact dependency
information. Consider a process P that reads from files A and
sales
id price
{t1} 1 5
{t2} 2 11
{t3} 3 14
result
ttl
{t2, t3} 25
Fig. 5: Annotated relation sales and query result
compute provenance polynomials (and thus also Lineage) on
demand for an input query. In the following we will us
Lin(Q, t) to denote the Lineage of a tuple t in the result of
query Q.
Example 4. Consider the sales table shown in Fig
ure 5. The Lineage of each tuple in the sales ta
ble is a singleton set containing the tuple’s identi
fier. The result of a query SELECT sum(value) AS ttl
FROM sales WHERE price > 10 is a single row with ttl =
11+14 = 25. The Lineage contains all tuples (t2 and t3) tha
were used to compute this results.
We define data dependencies in the PLin model based on
Lineage. We connect each tuple t in the result of a query Q to
all input tuples of the query that are in t’s Lineage. Similarly
we connect a modified tuple t in the result of an update to th
corresponding tuple t0
in the input of the update.
Definition 7 (PLin Data Dependencies). Let G be a PLin
trace. Let Lin(s, t) denote the Lineage of tuple t in the resul
of DB operation s, and let t and t0
denote entities (tuples)
The dependencies D(G) ⇢ D ⇥ D of G are defined as:
Using PTU1
Using Perm2
1 Q. Pham,T. Malik, and I. Foster. Using provenance for repeatability. In Theory and Practice of Provenance (TaPP), 2013.
2 B. Glavic et al. Perm: Processing Provenance and Data on the same Data Model through Query Rewriting. In ICDE, 2009.
Thursday, April 23, 15
23. Can we use temporal annotations and
known direct data dependencies to infer
a sound and complete set of
dependencies that helps us determine
the smallest size repeatability package?
Key Question
Thursday, April 23, 15
24. Axioms for
Dependency Inference
• no direct data dependencies implies there
is no data flow
• state of node at point in time depends on
past interactions only
• flow of data should not violate temporal
causality
Thursday, April 23, 15
25. Inferring Dependencies
(a) No Dependency between C and A
A P1 B P2 C[2, 3] [6, 7] [1, 5] [6, 6]
(b) C depends on A at time 4
A P1 B P2 C[1, 1] [4, 7] [2, 5] [1, 6]
(c) No Dependency between C and A
(a) No Dependency between C and A
A P1 B P2 C[2, 3] [6, 7] [1, 5] [6, 6]
(b) C depends on A at time 4
A P1 B P2 C[1, 1] [4, 7] [2, 5] [1, 6]
(c) No Dependency between C and A1 4 5 6
2 6 5
no such
sequence exists
• to determine whether information has
flown from A to C
• find increasing sequence of times for edges
so each time lies in edge’s interval
sequence shown
on the left
Thursday, April 23, 15
26. Experiments
• 3 Metrics
• Performance
• Usability
• Generality
1e-05
0.0001
Test
Prepare
Inserts First
Select
Other
Selects
Updates
1e-05
0.0001
Test
Prepare
Inserts First
Select
Other
Selects
U
Fig. 7: Execution time of each step in an execution of TPC-H
(a) Q1
1e-05
0.0001
0.001
0.01
0.1
1
10
100
1000
Test
Prepare
Inserts First
Select
Other
Selects
Updates
Executiontime(seconds)
PostgreSQL
0.03
0.00357
0.562
0.3753
0.00084
Open-Source DB Server scenario
178.2
0.003042
0.492
0.3921
0.001
Proprietary DB Server scenario
0.016
4E-05
0.016
0.0003
0.00018
(b) Q2
1e-05
0.0001
0.001
0.01
0.1
1
10
100
1000
Test
Prepare
Inserts First
Select
Other
Selects
U
Executiontime(seconds)
Postgre
0.03
0.00348
0.088
0.02872
Open-Source DB Server sce
34.09
0.003072
0.18
0.08532
Proprietary DB Server sce
0.036
0.000376
0.218
0.1555
Fig. 8: Re-Execution time of each step in an execution of TPC-H
Package Software
binaries
Server
binaries
Data
directory
Database
provenance
PTU 3 3 3(full) 7
Open-Source DBS 3 3 3(empty) 3
Proprietary DBS 3 7 7 3
TABLE III: Content of PTU and LDV packages: PTU pack-
ages contain data directory of the full database, whereas Open-
Source DBS LDV packages contain a data directory of an
empty database (created by the initdb command) 100
200
300
400
500
Totalpackagesize(MB)
ApplicationVirtualization + Database
LDV + Open-source DB
LDV + Proprietary DB
Thursday, April 23, 15
27. TPC-H Queries
• Most TPC-H queries touch large fractions
of tables
• Modified by varying parameters and
selectivity
TABLE II: The 18 TPC-H benchmark queries used in our experiments
Queries SQL PARAM Sel.
Q1-1 to
Q1-5
SELECT l quantity, l partkey , l extendedprice , l shipdate , l receiptdate FROM lineitem
WHERE l suppkey BETWEEN 1 AND PARAM
10, 20, 50, 100,
250
1%, 2%, 5%,
10%, 25%
Q2-1 to
Q2-4
SELECT o comment, l comment FROM lineitem l, orders o, customer c WHERE l.l orderkey
= o.o orderkey AND o.o custkey = c.c custkey AND c.c name LIKE ’%PARAM%’;
0000000, 000000,
00000, 0000
66%, 6.6%,
0.66%, 0.06%
Q3-1 to
Q3-4
SELECT count(⇤) FROM lineitem l, orders o, customer c WHERE l.l orderkey = o.o orderkey
AND o.o custkey = c.c custkey AND c.c name LIKE ’%PARAM%’;
0000000, 000000,
00000, 0000
66%, 6.6%,
0.66%, 0.06%
Q4-1 to
Q4-5
SELECT o orderkey, AVG(l quantity) AS avgQ FROM lineitem l, orders o WHERE l.l orderkey
= o.o orderkey AND l suppkey BETWEEN 1 AND PARAM GROUP BY o orderkey;
10, 20, 50, 100,
250
1%, 2%, 5%,
10%, 25%
(a) Audit
0.001
0.01
0.1
1
10
100
1000
10000
Executiontime(seconds)
PostgreSQL + PTU
Server-included package
Server-excluded package
(b) Replay
0.001
0.01
0.1
1
10
100
1000
10000
Executiontime(seconds)
PostgreSQL + PTU
0.01001
0.08
0.053
01
Server-included package
4.19
0.00063
0.05
0.025
.0003
Server-excluded package
0.01
0.01
0.009
01
Thursday, April 23, 15
28. Size Comparison
10
100
1000
10000
1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5
Packagesize(MB)
Query
PTU package
Server-included package
Server-excluded package
Fig. 9: LDV packages are significantly smaller than PTU
packages when queries have low selectivity.
and the LDV packages. The VMI is 8.2 GB: 80 times larger
than the average LDV package (100MB). To evaluate runtime
performance, we instantiate this VMI using the same number
of cores and memory as in our machine to execute our queries.
[4] C. T. Bro
ivory.idy
[5] J. Chene
Foundati
[6] F. Chirig
provenan
[7] F. S. Ch
to suppo
[8] S. C. De
provenan
[9] J. Freire
reproduc
14(4), 20
[10] B. Glavi
Data Mo
[11] B. Glavi
provenan
practice
[12] C. A. G
for work
on Workfl
[13] P. J. Guo
create p
Conferen
[14] B. How
research.
• LDV packages are significantly smaller than PTU
packages when queries have low selectivity
• TheVMI is 8.2 GB: 80 times larger than the
average LDV package (100MB).
Thursday, April 23, 15
29. Audit and Replay1e-05
0.0001
0.001
0.01
0.1
Inserts First
Select
Other
Selects
Updates
Executio
1e-05
0.0001
0.001
0.01
0.1
Initialization Inserts First
Select
Other
Selects
Updates
Executio
1E-05
0.0
0.0001
0.00063
0
0.0003
0.0
2E-05
0.0
0.0
0.0001
Fig. 7: Execution time of each step in an application with query Q1-1
(a) Audit
0.001
0.01
0.1
1
10
100
1000
10000
1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5
Executiontime(seconds)
Query
PostgreSQL + PTU
Server-included package
Server-excluded package
(b) Replay
0.001
0.01
0.1
1
10
100
1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5
Executiontime(seconds)
Query
PostgreSQL + PTU
Server-included package
Server-excluded package
VM
Fig. 8: Execution time for each query, during audit (left) and replay (right)
LE III: Package Contents: PTU packages contain all data
of the full DB, whereas server-included LDV packages
in the data files of an empty DB.
kage type Software
binaries
DB
server
Data
files
DB
provenance
U 3 3 3(full) 7
V server-included 3 3 3(empty) 3
V server-excluded 3 7 7 3
tuples needed to re-execute the application—which, fo
queries, is at most ⇠25% of all tuples. Server-excluded
packages are often yet smaller, because they contain on
query results—which, for many of our experiment queri
smaller than the tuples required for re-execution. Ho
recall that server-excluded packages have less flexibilit
do server-included packages.
LDV amortizes audit cost significantly at replay time
Thursday, April 23, 15
30. Summary
• LDV permits sharing and repeating DB
applications
• LDV combines OS and DB provenance to
determine file and DB slices
• LDV creates light-weight virtualized
packages based on combined provenance
• Results show LDV is efficient, usable, and
general
• LDV at http://github.com/lordpretzel/ldv.git
Thursday, April 23, 15
32. Inferring Dependencies
s:
ncies
nter-
ns of
entity
n the
nance
ncies,
h do
there
C).
e file
(a) No Dependency between C and A
A P1 B P2 C[2, 3] [6, 7] [1, 5] [6, 6]
(b) C depends on A at time 4
A P1 B P2 C[1, 1] [4, 7] [2, 5] [1, 6]
(c) No Dependency between C and A
A P1 B P2 C[1, 1] [4, 7] [2, 5] [1, 6]
Fig. 6: Example traces with different temporal annotations
s:
ncies
nter-
ns of
ntity
n the
ance
cies,
h do
there
C).
e file
(a) No Dependency between C and A
A P1 B P2 C[2, 3] [6, 7] [1, 5] [6, 6]
(b) C depends on A at time 4
A P1 B P2 C[1, 1] [4, 7] [2, 5] [1, 6]
(c) No Dependency between C and A
A P1 B P2 C[1, 1] [4, 7] [2, 5] [1, 6]
Fig. 6: Example traces with different temporal annotations
between e0
and e, because if there is no path between e0
and
s:
ncies
inter-
ns of
entity
n the
nance
ncies,
th do
there
! C).
e file
(a) No Dependency between C and A
A P1 B P2 C[2, 3] [6, 7] [1, 5] [6, 6]
(b) C depends on A at time 4
A P1 B P2 C[1, 1] [4, 7] [2, 5] [1, 6]
(c) No Dependency between C and A
A P1 B P2 C[1, 1] [4, 7] [2, 5] [1, 6]
Fig. 6: Example traces with different temporal annotations
between e0
and e, because if there is no path between e0
and
No Depedency between C and A
C Depends on A at time 4
No Dependency between C and A
Thursday, April 23, 15