SlideShare une entreprise Scribd logo
1  sur  22
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
University of Toronto, Illinois Institute of Technology,
Università della Basilicata, Arizona State University
Sep 7th 2016
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Overview
2
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Motivation
• Data quality is a crucial task in data
management
• Many automatic and semi-automatic data-
cleaning algorithm have been proposed
3
constraint-based
Beskales et al. VLDB10
Bohannon et al. SIGMOD05
Chu et al. ICDE13
Cong et al. VLDB07
Geerts et al. VLDB14
…
statistics-based
Berti-Equille et al. ICDE1
Dasu et al. VLDB12
Prokoshyna et al. VLDB1
Yakout et al. SIGMOD13
…
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Motivation
• Data quality is a crucial task in data
management
• Many automatic and semi-automatic data-
cleaning algorithm have been proposed
4
constraint-based
Beskales et al. VLDB10
Bohannon et al. SIGMOD05
Chu et al. ICDE13
Cong et al. VLDB07
Geerts et al. VLDB14
…
statistics-based
Berti-Equille et al. ICDE1
Dasu et al. VLDB12
Prokoshyna et al. VLDB1
Yakout et al. SIGMOD13
…
“What is the right tool for my
data-cleaning task?”
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Challenges
• No openly-available tools or datasets for
benchmarking data-cleaning algorithms
• Usually approaches are evaluated by
using either
• manually generated errors: very expensive!
• automatically introduced errors in clean data:
algorithms are highly sensitive to the
characteristics of the errors!
• Need for scalable and robust evaluation
5
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Contribution
• Benchmarking Algorithms for data Repairing and
Translation
• open-source error-generation system with an high level of
control over the errors
• Input: a clean database wrt
a set of data-quality rules
and a set of configuration
parameters
• Output: a dirty database
(using a set of cell changes)
and an estimate of how hard it will
be to restore the original values
6
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Overview
7
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
‣ Detectability
‣ Repairability
‣ Violation-Generation Queries
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
A Motivating Example
8
Player
Name Season Team Stadium
Goal
s
t1 Giovinco
2013-
14
Juventu
s
Juventus
Stadium
3
t2 Giovinco
2014-
15
Toronto BMO Field 23
t3 Pirlo
2014-
15
Juventu
s
Juventus
Stadium
5
t4 Pirlo
2015-
16
N.Y. City Yankee St. 0
t5 Vidal
2014-
15
Juventu
s
Juventus
Stadium
5
t6 Vidal
2015-
16
Bayern Allianz Arena 3
functional dependency
Name, Season → Team
Team → Stadium
Quality Rules
Represented as Denial Constraints
a very expressive language to capture most
data-quality rules used for data repairing:
FDs, CFDs, Cleaning EGDs, Editing Rules,
Fixing Rules, Ordering Constraints
dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’,
s=s’, t ≠ t’ )
dc2: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠
st’ )
Violation
An instance I violates ¬(φ(x)) if
there is an assignment m s.t.
I ⊨ φ(m(x))
1
2
2
1
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
A Motivating Example
9
Player
Name Season Team Stadium
Goal
s
t1 Giovinco
2013-
14
Juventu
s
Juventus
Stadium
3
t2 Giovinco
2014-
15
Toronto BMO Field 23
t3 Pirlo
2014-
15
Juventu
s
Juventus
Stadium
5
t4 Pirlo
2015-
16
N.Y. City Yankee St. 0
t5 Vidal
2014-
15
Juventu
s
Juventus
Stadium
5
t6 Vidal
2015-
16
Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’
)
dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Camp Nou
Cell Changes
ch1: t5. Stadium := “Camp Nou”
✔ ch1 is a detectable change: dc2 is
violated since t1, t3 and t5 have same
team, but different stadiums
we call {t1, t3, t5}
context equivalence class
✔ easy to correct: the original value
“Juventus Stadium” appears in t1,t3
Repairability: the probability of
restoring t5.Stadium to its original value
by uniformly at random picking a
Stadium value from its context
equivalence class
Rep = 2 / 3 = 0.66
functional dependency
Name, Season → Team
Team → Stadium
1
2
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
A Motivating Example
10
Player
Name Season Team Stadium
Goal
s
t1 Giovinco
2013-
14
Juventu
s
Juventus
Stadium
3
t2 Giovinco
2014-
15
Toronto BMO Field 23
t3 Pirlo
2014-
15
Juventu
s
Juventus
Stadium
5
t4 Pirlo
2015-
16
N.Y. City Yankee St. 0
t5 Vidal
2014-
15
Juventu
s
Juventus
Stadium
5
t6 Vidal
2015-
16
Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’
)
dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Cell Changes
ch2: t1. Season:= “2014-15”
✔ ch2 is a detectable change: dc1 is
violated: t1 and t2 have same name
and season, but different teams,
stadium and goals
2014-
15
✘ hard to correct: the original value
“2013-14” disappears from the instance
Repairability: 0 / 2 = 0
functional dependency
Name, Season → Team
Team → Stadium
1
2
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
A Motivating Example
11
Player
Name Season Team Stadium
Goal
s
t1 Giovinco
2013-
14
Juventu
s
Juventus
Stadium
3
t2 Giovinco
2014-
15
Toronto BMO Field 23
t3 Pirlo
2014-
15
Juventu
s
Juventus
Stadium
5
t4 Pirlo
2015-
16
N.Y. City Yankee St. 0
t5 Vidal
2014-
15
Juventu
s
Juventus
Stadium
5
t6 Vidal
2015-
16
Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’
)
dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Cell Changes
ch3: t5. Name:= “Pirlo”
✘ is a undetectable
change
Pirlo ch2: t1. Season:= “2014-15” ✔
2014-
15
ch4: t3.Name:= “Pirlo” ✔
Pirlo
✘
2014-
15
We need to keep track of the
context of each change
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Violation-Generation Queries
• Each comparison of a dc suggests a different strategy for
finding cells to modify to generate detectable errors
• Starting from a dc we generate a set of vio-gen queries
12
Name Season Team
t1 Giovinco 2013-14 Juventus
t2 Giovinco 2013-14 Juventus
t3 Pirlo 2013-14 N.Y. City
dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’
)
Player(n, s, t, st, g), Player(n’, s’, t’, st’,
g’),
n=n’, s=s’, t = t’
Player(n, s, t, st, g), Player(n’, s’, t’, st’,
g’),
n ≠ n’, s=s’, t ≠ t’
vio-gen query vio-gen query
Result of the query: t1, t2
We’ll have a detectable change by
making t1.Team and t2.Team
different
t1. Team:= “Juve” ✔
Result of the query: t2, t3
We’ll have a detectable change by
making t2.Name and t3.Name equal
t3. Name:= “Giovinco” ✔
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Error-Generation Task
13
• S: relational schema
• Σ: a set of denial constraints over S
• I: an instance over schema S clean wrt Σ
• CONF: configuration parameters
• % of detectable errors, % of random errors
• Theorem 1: Generating the requested number of
detectable errors is NP-Complete (data complexity)
EG-Task E={S, Σ, I, CONF}
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Overview
14
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Optimizations
• Greedy PTIME algorithm
• two cell changes cannot share a context
• sound but not complete
• in practice for low error ratios (~10-20%) the
probability of success is very high
• Main cost factor
• executing vio-gen queries on DBMS
• optimizations for symmetric constraints and
cross-products
15
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Symmetric Constraints
• Computing joins may be expensive!
• We identify a class of DCs (that includes FDs and most of
CFDs) where group-by can be used to reduce the size of
join inputs
• Idea: to find and execute isomorphic subqueries to avoid
redundant work
16
Player(n, s, t, st), Player(n’, s’, t’, st’),
n=n’, s=s’, t ≠ t’
1. Formula Graph
Player
n s t st
Player
t’ s’ n’st’
=
=
≠
2. Reduced Formula
with adornments
Player(n=, s=, t ≠,
st)
3. Group-By Query
SELECT name, season, team FROM player
WHERE name, season IN
(SELECT name, season FROM player
GROUP BY name, season
HAVING count(DISTINCT team) > 1)
ORDER BY name, season
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Cross Products
17
A Common Pattern
dc4: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t ≠ t’,
st ≠ st’
The result of the vio-gen query will be all possible pairs of players with
different team and different stadium  quadratic cost
However: we are typically only interested in a small set of cells
Solution: we materialize a random sample of the tuples in Player in main-
memory
and compute the cross product to identify cells to change and their
contexts
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Overview
18
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Evaluation of the Tools
Tools
- Llunatic: Geerts et al. VLDB14
- Holistic: Chu et al. ICDE13
- Greedy: Bohannon et al. SIGMOD05,
Cong et al. VLDB07
- Sampling: Beskales et al. VLDB10
Tasks
- Constraint-based with 5% errors and
different repairability levels: High (~ 0.8),
Med (~0.5), and Low (~0.25)
19
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Scalability Results
20
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th
Lessons Learned
• Automated tools are essential for robust
and broad empirical evaluations
• Data-repairing is not yet mature: no
definitive automatic data-repairing
algorithm yet
• Repairability matters
• We need to document our dirty data
• Algorithms are sensitive to error characteristics!
• Generating errors is hard
21
2
2

Contenu connexe

Plus de Boris Glavic

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...Boris Glavic
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationBoris Glavic
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningBoris Glavic
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...Boris Glavic
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
 

Plus de Boris Glavic (18)

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested Subqueries
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data Mining
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 

Dernier

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 

Dernier (20)

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 

2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

  • 1. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro University of Toronto, Illinois Institute of Technology, Università della Basilicata, Arizona State University Sep 7th 2016
  • 2. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Overview 2 ‣ Motivations and Goals ‣ Main Ideas ‣ Optimizations ‣ Experimental Results
  • 3. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Motivation • Data quality is a crucial task in data management • Many automatic and semi-automatic data- cleaning algorithm have been proposed 3 constraint-based Beskales et al. VLDB10 Bohannon et al. SIGMOD05 Chu et al. ICDE13 Cong et al. VLDB07 Geerts et al. VLDB14 … statistics-based Berti-Equille et al. ICDE1 Dasu et al. VLDB12 Prokoshyna et al. VLDB1 Yakout et al. SIGMOD13 …
  • 4. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Motivation • Data quality is a crucial task in data management • Many automatic and semi-automatic data- cleaning algorithm have been proposed 4 constraint-based Beskales et al. VLDB10 Bohannon et al. SIGMOD05 Chu et al. ICDE13 Cong et al. VLDB07 Geerts et al. VLDB14 … statistics-based Berti-Equille et al. ICDE1 Dasu et al. VLDB12 Prokoshyna et al. VLDB1 Yakout et al. SIGMOD13 … “What is the right tool for my data-cleaning task?”
  • 5. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Challenges • No openly-available tools or datasets for benchmarking data-cleaning algorithms • Usually approaches are evaluated by using either • manually generated errors: very expensive! • automatically introduced errors in clean data: algorithms are highly sensitive to the characteristics of the errors! • Need for scalable and robust evaluation 5
  • 6. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Contribution • Benchmarking Algorithms for data Repairing and Translation • open-source error-generation system with an high level of control over the errors • Input: a clean database wrt a set of data-quality rules and a set of configuration parameters • Output: a dirty database (using a set of cell changes) and an estimate of how hard it will be to restore the original values 6
  • 7. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Overview 7 ‣ Motivations and Goals ‣ Main Ideas ‣ Optimizations ‣ Experimental Results ‣ Detectability ‣ Repairability ‣ Violation-Generation Queries
  • 8. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th A Motivating Example 8 Player Name Season Team Stadium Goal s t1 Giovinco 2013- 14 Juventu s Juventus Stadium 3 t2 Giovinco 2014- 15 Toronto BMO Field 23 t3 Pirlo 2014- 15 Juventu s Juventus Stadium 5 t4 Pirlo 2015- 16 N.Y. City Yankee St. 0 t5 Vidal 2014- 15 Juventu s Juventus Stadium 5 t6 Vidal 2015- 16 Bayern Allianz Arena 3 functional dependency Name, Season → Team Team → Stadium Quality Rules Represented as Denial Constraints a very expressive language to capture most data-quality rules used for data repairing: FDs, CFDs, Cleaning EGDs, Editing Rules, Fixing Rules, Ordering Constraints dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) dc2: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Violation An instance I violates ¬(φ(x)) if there is an assignment m s.t. I ⊨ φ(m(x)) 1 2 2 1
  • 9. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th A Motivating Example 9 Player Name Season Team Stadium Goal s t1 Giovinco 2013- 14 Juventu s Juventus Stadium 3 t2 Giovinco 2014- 15 Toronto BMO Field 23 t3 Pirlo 2014- 15 Juventu s Juventus Stadium 5 t4 Pirlo 2015- 16 N.Y. City Yankee St. 0 t5 Vidal 2014- 15 Juventu s Juventus Stadium 5 t6 Vidal 2015- 16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Camp Nou Cell Changes ch1: t5. Stadium := “Camp Nou” ✔ ch1 is a detectable change: dc2 is violated since t1, t3 and t5 have same team, but different stadiums we call {t1, t3, t5} context equivalence class ✔ easy to correct: the original value “Juventus Stadium” appears in t1,t3 Repairability: the probability of restoring t5.Stadium to its original value by uniformly at random picking a Stadium value from its context equivalence class Rep = 2 / 3 = 0.66 functional dependency Name, Season → Team Team → Stadium 1 2
  • 10. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th A Motivating Example 10 Player Name Season Team Stadium Goal s t1 Giovinco 2013- 14 Juventu s Juventus Stadium 3 t2 Giovinco 2014- 15 Toronto BMO Field 23 t3 Pirlo 2014- 15 Juventu s Juventus Stadium 5 t4 Pirlo 2015- 16 N.Y. City Yankee St. 0 t5 Vidal 2014- 15 Juventu s Juventus Stadium 5 t6 Vidal 2015- 16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Cell Changes ch2: t1. Season:= “2014-15” ✔ ch2 is a detectable change: dc1 is violated: t1 and t2 have same name and season, but different teams, stadium and goals 2014- 15 ✘ hard to correct: the original value “2013-14” disappears from the instance Repairability: 0 / 2 = 0 functional dependency Name, Season → Team Team → Stadium 1 2
  • 11. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th A Motivating Example 11 Player Name Season Team Stadium Goal s t1 Giovinco 2013- 14 Juventu s Juventus Stadium 3 t2 Giovinco 2014- 15 Toronto BMO Field 23 t3 Pirlo 2014- 15 Juventu s Juventus Stadium 5 t4 Pirlo 2015- 16 N.Y. City Yankee St. 0 t5 Vidal 2014- 15 Juventu s Juventus Stadium 5 t6 Vidal 2015- 16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Cell Changes ch3: t5. Name:= “Pirlo” ✘ is a undetectable change Pirlo ch2: t1. Season:= “2014-15” ✔ 2014- 15 ch4: t3.Name:= “Pirlo” ✔ Pirlo ✘ 2014- 15 We need to keep track of the context of each change
  • 12. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Violation-Generation Queries • Each comparison of a dc suggests a different strategy for finding cells to modify to generate detectable errors • Starting from a dc we generate a set of vio-gen queries 12 Name Season Team t1 Giovinco 2013-14 Juventus t2 Giovinco 2013-14 Juventus t3 Pirlo 2013-14 N.Y. City dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t = t’ Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n ≠ n’, s=s’, t ≠ t’ vio-gen query vio-gen query Result of the query: t1, t2 We’ll have a detectable change by making t1.Team and t2.Team different t1. Team:= “Juve” ✔ Result of the query: t2, t3 We’ll have a detectable change by making t2.Name and t3.Name equal t3. Name:= “Giovinco” ✔
  • 13. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Error-Generation Task 13 • S: relational schema • Σ: a set of denial constraints over S • I: an instance over schema S clean wrt Σ • CONF: configuration parameters • % of detectable errors, % of random errors • Theorem 1: Generating the requested number of detectable errors is NP-Complete (data complexity) EG-Task E={S, Σ, I, CONF}
  • 14. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Overview 14 ‣ Motivations and Goals ‣ Main Ideas ‣ Optimizations ‣ Experimental Results
  • 15. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Optimizations • Greedy PTIME algorithm • two cell changes cannot share a context • sound but not complete • in practice for low error ratios (~10-20%) the probability of success is very high • Main cost factor • executing vio-gen queries on DBMS • optimizations for symmetric constraints and cross-products 15
  • 16. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Symmetric Constraints • Computing joins may be expensive! • We identify a class of DCs (that includes FDs and most of CFDs) where group-by can be used to reduce the size of join inputs • Idea: to find and execute isomorphic subqueries to avoid redundant work 16 Player(n, s, t, st), Player(n’, s’, t’, st’), n=n’, s=s’, t ≠ t’ 1. Formula Graph Player n s t st Player t’ s’ n’st’ = = ≠ 2. Reduced Formula with adornments Player(n=, s=, t ≠, st) 3. Group-By Query SELECT name, season, team FROM player WHERE name, season IN (SELECT name, season FROM player GROUP BY name, season HAVING count(DISTINCT team) > 1) ORDER BY name, season
  • 17. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Cross Products 17 A Common Pattern dc4: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t ≠ t’, st ≠ st’ The result of the vio-gen query will be all possible pairs of players with different team and different stadium  quadratic cost However: we are typically only interested in a small set of cells Solution: we materialize a random sample of the tuples in Player in main- memory and compute the cross product to identify cells to change and their contexts
  • 18. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Overview 18 ‣ Motivations and Goals ‣ Main Ideas ‣ Optimizations ‣ Experimental Results
  • 19. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Evaluation of the Tools Tools - Llunatic: Geerts et al. VLDB14 - Holistic: Chu et al. ICDE13 - Greedy: Bohannon et al. SIGMOD05, Cong et al. VLDB07 - Sampling: Beskales et al. VLDB10 Tasks - Constraint-based with 5% errors and different repairability levels: High (~ 0.8), Med (~0.5), and Low (~0.25) 19
  • 20. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Scalability Results 20
  • 21. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Lessons Learned • Automated tools are essential for robust and broad empirical evaluations • Data-repairing is not yet mature: no definitive automatic data-repairing algorithm yet • Repairability matters • We need to document our dirty data • Algorithms are sensitive to error characteristics! • Generating errors is hard 21
  • 22. 2 2