We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To pro- vide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms
1. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
University of Toronto, Illinois Institute of Technology,
Università della Basilicata, Arizona State University
Sep 7th 2016
2. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Overview
2
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
3. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Motivation
• Data quality is a crucial task in data
management
• Many automatic and semi-automatic data-
cleaning algorithm have been proposed
3
constraint-based
Beskales et al. VLDB10
Bohannon et al. SIGMOD05
Chu et al. ICDE13
Cong et al. VLDB07
Geerts et al. VLDB14
…
statistics-based
Berti-Equille et al. ICDE1
Dasu et al. VLDB12
Prokoshyna et al. VLDB1
Yakout et al. SIGMOD13
…
4. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Motivation
• Data quality is a crucial task in data
management
• Many automatic and semi-automatic data-
cleaning algorithm have been proposed
4
constraint-based
Beskales et al. VLDB10
Bohannon et al. SIGMOD05
Chu et al. ICDE13
Cong et al. VLDB07
Geerts et al. VLDB14
…
statistics-based
Berti-Equille et al. ICDE1
Dasu et al. VLDB12
Prokoshyna et al. VLDB1
Yakout et al. SIGMOD13
…
“What is the right tool for my
data-cleaning task?”
5. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Challenges
• No openly-available tools or datasets for
benchmarking data-cleaning algorithms
• Usually approaches are evaluated by
using either
• manually generated errors: very expensive!
• automatically introduced errors in clean data:
algorithms are highly sensitive to the
characteristics of the errors!
• Need for scalable and robust evaluation
5
6. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Contribution
• Benchmarking Algorithms for data Repairing and
Translation
• open-source error-generation system with an high level of
control over the errors
• Input: a clean database wrt
a set of data-quality rules
and a set of configuration
parameters
• Output: a dirty database
(using a set of cell changes)
and an estimate of how hard it will
be to restore the original values
6
7. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Overview
7
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
‣ Detectability
‣ Repairability
‣ Violation-Generation Queries
8. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
A Motivating Example
8
Player
Name Season Team Stadium
Goal
s
t1 Giovinco
2013-
14
Juventu
s
Juventus
Stadium
3
t2 Giovinco
2014-
15
Toronto BMO Field 23
t3 Pirlo
2014-
15
Juventu
s
Juventus
Stadium
5
t4 Pirlo
2015-
16
N.Y. City Yankee St. 0
t5 Vidal
2014-
15
Juventu
s
Juventus
Stadium
5
t6 Vidal
2015-
16
Bayern Allianz Arena 3
functional dependency
Name, Season → Team
Team → Stadium
Quality Rules
Represented as Denial Constraints
a very expressive language to capture most
data-quality rules used for data repairing:
FDs, CFDs, Cleaning EGDs, Editing Rules,
Fixing Rules, Ordering Constraints
dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’,
s=s’, t ≠ t’ )
dc2: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠
st’ )
Violation
An instance I violates ¬(φ(x)) if
there is an assignment m s.t.
I ⊨ φ(m(x))
1
2
2
1
9. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
A Motivating Example
9
Player
Name Season Team Stadium
Goal
s
t1 Giovinco
2013-
14
Juventu
s
Juventus
Stadium
3
t2 Giovinco
2014-
15
Toronto BMO Field 23
t3 Pirlo
2014-
15
Juventu
s
Juventus
Stadium
5
t4 Pirlo
2015-
16
N.Y. City Yankee St. 0
t5 Vidal
2014-
15
Juventu
s
Juventus
Stadium
5
t6 Vidal
2015-
16
Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’
)
dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Camp Nou
Cell Changes
ch1: t5. Stadium := “Camp Nou”
✔ ch1 is a detectable change: dc2 is
violated since t1, t3 and t5 have same
team, but different stadiums
we call {t1, t3, t5}
context equivalence class
✔ easy to correct: the original value
“Juventus Stadium” appears in t1,t3
Repairability: the probability of
restoring t5.Stadium to its original value
by uniformly at random picking a
Stadium value from its context
equivalence class
Rep = 2 / 3 = 0.66
functional dependency
Name, Season → Team
Team → Stadium
1
2
10. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
A Motivating Example
10
Player
Name Season Team Stadium
Goal
s
t1 Giovinco
2013-
14
Juventu
s
Juventus
Stadium
3
t2 Giovinco
2014-
15
Toronto BMO Field 23
t3 Pirlo
2014-
15
Juventu
s
Juventus
Stadium
5
t4 Pirlo
2015-
16
N.Y. City Yankee St. 0
t5 Vidal
2014-
15
Juventu
s
Juventus
Stadium
5
t6 Vidal
2015-
16
Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’
)
dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Cell Changes
ch2: t1. Season:= “2014-15”
✔ ch2 is a detectable change: dc1 is
violated: t1 and t2 have same name
and season, but different teams,
stadium and goals
2014-
15
✘ hard to correct: the original value
“2013-14” disappears from the instance
Repairability: 0 / 2 = 0
functional dependency
Name, Season → Team
Team → Stadium
1
2
11. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
A Motivating Example
11
Player
Name Season Team Stadium
Goal
s
t1 Giovinco
2013-
14
Juventu
s
Juventus
Stadium
3
t2 Giovinco
2014-
15
Toronto BMO Field 23
t3 Pirlo
2014-
15
Juventu
s
Juventus
Stadium
5
t4 Pirlo
2015-
16
N.Y. City Yankee St. 0
t5 Vidal
2014-
15
Juventu
s
Juventus
Stadium
5
t6 Vidal
2015-
16
Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’
)
dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Cell Changes
ch3: t5. Name:= “Pirlo”
✘ is a undetectable
change
Pirlo ch2: t1. Season:= “2014-15” ✔
2014-
15
ch4: t3.Name:= “Pirlo” ✔
Pirlo
✘
2014-
15
We need to keep track of the
context of each change
12. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Violation-Generation Queries
• Each comparison of a dc suggests a different strategy for
finding cells to modify to generate detectable errors
• Starting from a dc we generate a set of vio-gen queries
12
Name Season Team
t1 Giovinco 2013-14 Juventus
t2 Giovinco 2013-14 Juventus
t3 Pirlo 2013-14 N.Y. City
dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’
)
Player(n, s, t, st, g), Player(n’, s’, t’, st’,
g’),
n=n’, s=s’, t = t’
Player(n, s, t, st, g), Player(n’, s’, t’, st’,
g’),
n ≠ n’, s=s’, t ≠ t’
vio-gen query vio-gen query
Result of the query: t1, t2
We’ll have a detectable change by
making t1.Team and t2.Team
different
t1. Team:= “Juve” ✔
Result of the query: t2, t3
We’ll have a detectable change by
making t2.Name and t3.Name equal
t3. Name:= “Giovinco” ✔
13. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Error-Generation Task
13
• S: relational schema
• Σ: a set of denial constraints over S
• I: an instance over schema S clean wrt Σ
• CONF: configuration parameters
• % of detectable errors, % of random errors
• Theorem 1: Generating the requested number of
detectable errors is NP-Complete (data complexity)
EG-Task E={S, Σ, I, CONF}
14. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Overview
14
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
15. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Optimizations
• Greedy PTIME algorithm
• two cell changes cannot share a context
• sound but not complete
• in practice for low error ratios (~10-20%) the
probability of success is very high
• Main cost factor
• executing vio-gen queries on DBMS
• optimizations for symmetric constraints and
cross-products
15
16. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Symmetric Constraints
• Computing joins may be expensive!
• We identify a class of DCs (that includes FDs and most of
CFDs) where group-by can be used to reduce the size of
join inputs
• Idea: to find and execute isomorphic subqueries to avoid
redundant work
16
Player(n, s, t, st), Player(n’, s’, t’, st’),
n=n’, s=s’, t ≠ t’
1. Formula Graph
Player
n s t st
Player
t’ s’ n’st’
=
=
≠
2. Reduced Formula
with adornments
Player(n=, s=, t ≠,
st)
3. Group-By Query
SELECT name, season, team FROM player
WHERE name, season IN
(SELECT name, season FROM player
GROUP BY name, season
HAVING count(DISTINCT team) > 1)
ORDER BY name, season
17. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Cross Products
17
A Common Pattern
dc4: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t ≠ t’,
st ≠ st’
The result of the vio-gen query will be all possible pairs of players with
different team and different stadium quadratic cost
However: we are typically only interested in a small set of cells
Solution: we materialize a random sample of the tuples in Player in main-
memory
and compute the cross product to identify cells to change and their
contexts
18. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Overview
18
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
19. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Evaluation of the Tools
Tools
- Llunatic: Geerts et al. VLDB14
- Holistic: Chu et al. ICDE13
- Greedy: Bohannon et al. SIGMOD05,
Cong et al. VLDB07
- Sampling: Beskales et al. VLDB10
Tasks
- Constraint-based with 5% errors and
different repairability levels: High (~ 0.8),
Med (~0.5), and Low (~0.25)
19
20. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Scalability Results
20
21. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Lessons Learned
• Automated tools are essential for robust
and broad empirical evaluations
• Data-repairing is not yet mature: no
definitive automatic data-repairing
algorithm yet
• Repairability matters
• We need to document our dirty data
• Algorithms are sensitive to error characteristics!
• Generating errors is hard
21