1. Data Cleaning
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 2
April 7, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 1 / 22
2. Introduction
Introduction
Removal of inconsistencies and errors from original data sets.
Extraction Transformation Loading (ETL) and data cleaning tools.
Modeled as graphs of data transformations.
Data integration problem.
Derive structured and clean textual records.
To be able to perform meaningful queries.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 2 / 22
3. Introduction
Motivation
Explanation for the reasoning behind the cleaning results.
Interactive facilities to tune a data cleaning program.
A language, an execution model, and algorithms.
To express data cleaning specifications declaratively.
To perform the cleaning efficiently.
Data cleaning graph with data quality constraints.
Support for user involvement in data cleaning.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 3 / 22
4. Introduction
Challenges in Existing Technology
Lack of separation [...].
Lack of data lineage and user interaction facilities.
Lack of logical matching operation.
User-provided criteria.
Non-exhaustive.
Lack of documentation of the matching algorithms.
Lack of user consultation.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 4 / 22
5. Contributions
AJAX Data Cleaning Framework and Strategy
Separation of Framework:
Logical Level.
Graph of transformations specified in declarative language.
Expressible with SQL99.
Explicit user interaction and stepwise refinement
Using a data lineage mechanism.
Physical Level.
Specific optimization algorithms chosen to implement the
transformations.
Notation:
To specify the properties of the approximate matching function.
To select an optimized implementation.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 5 / 22
6. AJAX
Logical Level
Data Flow Graph: Main constituent of a data cleaning program.
Input Output flows of operators logically modeled as database
relations.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 6 / 22
7. AJAX
Framework for the bibliographic references
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 7 / 22
8. Data Cleaning Strategy
1. Add a key to every input record
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 8 / 22
9. Data Cleaning Strategy
2. Extract from each input record ..
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 9 / 22
10. Data Cleaning Strategy
3. Extract from each input record ..
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 10 / 22
11. Data Cleaning Strategy
4. Duplicate Elimination
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 11 / 22
13. Data Cleaning Strategy
Exception Handling
External functions written in a 3 GL language such as Java.
Exceptions autogenerated by the external functions.
Mark tuples that cannot be automatically handled by an operator.
Data lineage mechanism enables user inspection of exceptions.
Corrected data re-integrated into the data flow graph.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 13 / 22
14. Specification Language
Logical Operators
Arbitrary clustering operations.
More general than the SQL group-by.
Merging operator with user defined aggregation functions.
Not expressible in SQL99.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 14 / 22
15. Specification Language
Implementation of Matching
Optimization Problem.
Pre-select the elements of the Cartesian product.
Allows false matches.
No false dismissals.
Cheap to compute.
Approximate method to compare a limited number of records.
With good expected probability.
Distance-filtering optimization.
Approximate Methods.
Multi-pass neighborhood method (MPN).
Choose a key.
Compare the results within a fixed window.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 15 / 22
16. User Involvement in Data Cleaning
Manual Data Repair (MDR) in Data Cleaning Graph
(DCG)
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 16 / 22
17. User Involvement in Data Cleaning
Case Study
Goal:
Clean the Pub table and produce a table containing only the
publications authored by at least one team member.
Duplicate entries for each publication organized in clusters.
Process:
Extract the author names.
Independently of the publication they are associated to.
Match author names against the names stored in the Team table.
Try to find synonyms.
Build the list of co-authors for each author.
Remove those publications that are not authored by any team
member.
Detect and cluster approximate duplicate publication records.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 17 / 22
18. User Involvement in Data Cleaning
Quality Constraints and Manual Data Repairs
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 18 / 22
19. Evaluation
Experiments
Executed with AJAX framework.
Multi-pass neighborhood method
(MPN) vs. Neighborhood join (NJ).
Experimental Results:
Addressing the user feedback may
significantly improve a data cleaning
process.
MPN faster, but less accurate than
NJ.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 19 / 22
20. Evaluation
Related Work
High level languages for data transformations.
SQL99, WHIRL’s SQL, SchemaSQL.
Lack of support for clustering and merging, and less optimized.
Immediate halt of execution upon exception in SQL.
Highly optimized matching operation in AJAX.
As it is made a first-citizen operator.
Data Integration and Cleaning Frameworks.
Less scale-up.
Algorithms to support matching, clustering, and merging operations.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 20 / 22
21. Conclusions
Conclusions
AJAX Framework
Design and Implementation of a data flow graph.
Quality heuristics for best accuracy.
Effectively and efficiently generate clean data.
Design of performance heuristics.
Execution speed of transformations.
User involvement is crucial in data cleaning.
Thank you!
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 21 / 22
22. Conclusions
References
Galhardas, H., Lopes, A., & Santos, E. (2011). Support for user
involvement in data cleaning. In Data Warehousing and Knowledge
Discovery (pp. 136-151). Springer Berlin Heidelberg..
Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000, May).
AJAX: an extensible data cleaning tool. In ACM Sigmod Record (Vol.
29, No. 2, p. 590). ACM.
Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C.
(2001). Declarative data cleaning: Language, model, and algorithms.
“Precisionrecall” by Walber - Own work. Licensed under CC BY-SA
4.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:
Precisionrecall.svg#/media/File:Precisionrecall.svg
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 22 / 22