Data Cleaning

Data Cleaning
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 2
April 7, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 1 / 22

Introduction
Introduction
Removal of inconsistencies and errors from original data sets.
Extraction Transformation Loading (ETL) and data cleaning tools.
Modeled as graphs of data transformations.
Data integration problem.
Derive structured and clean textual records.
To be able to perform meaningful queries.

Introduction
Motivation
Explanation for the reasoning behind the cleaning results.
Interactive facilities to tune a data cleaning program.
A language, an execution model, and algorithms.
To express data cleaning speciﬁcations declaratively.
To perform the cleaning eﬃciently.
Data cleaning graph with data quality constraints.
Support for user involvement in data cleaning.

Introduction
Challenges in Existing Technology
Lack of separation [...].
Lack of data lineage and user interaction facilities.
Lack of logical matching operation.
User-provided criteria.
Non-exhaustive.
Lack of documentation of the matching algorithms.
Lack of user consultation.

Contributions
AJAX Data Cleaning Framework and Strategy
Separation of Framework:
Logical Level.
Graph of transformations specified in declarative language.
Expressible with SQL99.
Explicit user interaction and stepwise refinement
Using a data lineage mechanism.
Physical Level.
Specific optimization algorithms chosen to implement the
transformations.
Notation:
To specify the properties of the approximate matching function.
To select an optimized implementation.

AJAX
Logical Level
Data Flow Graph: Main constituent of a data cleaning program.
Input Output ﬂows of operators logically modeled as database
relations.

AJAX
Framework for the bibliographic references

Data Cleaning Strategy
1. Add a key to every input record

2. Extract from each input record ..

3. Extract from each input record ..

4. Duplicate Elimination

5. Aggregation

Exception Handling
External functions written in a 3 GL language such as Java.
Exceptions autogenerated by the external functions.
Mark tuples that cannot be automatically handled by an operator.
Data lineage mechanism enables user inspection of exceptions.
Corrected data re-integrated into the data ﬂow graph.

Speciﬁcation Language
Logical Operators
Arbitrary clustering operations.
More general than the SQL group-by.
Merging operator with user deﬁned aggregation functions.
Not expressible in SQL99.

Specification Language
Implementation of Matching
Optimization Problem.
Pre-select the elements of the Cartesian product.
Allows false matches.
No false dismissals.
Cheap to compute.
Approximate method to compare a limited number of records.
With good expected probability.
Distance-filtering optimization.
Approximate Methods.
Multi-pass neighborhood method (MPN).
Choose a key.
Compare the results within a fixed window.

User Involvement in Data Cleaning
Manual Data Repair (MDR) in Data Cleaning Graph
(DCG)

Case Study
Goal:
Clean the Pub table and produce a table containing only the
publications authored by at least one team member.
Duplicate entries for each publication organized in clusters.
Process:
Extract the author names.
Independently of the publication they are associated to.
Match author names against the names stored in the Team table.
Try to ﬁnd synonyms.
Build the list of co-authors for each author.
Remove those publications that are not authored by any team
member.
Detect and cluster approximate duplicate publication records.

Quality Constraints and Manual Data Repairs

Evaluation
Experiments
Executed with AJAX framework.
Multi-pass neighborhood method
(MPN) vs. Neighborhood join (NJ).
Experimental Results:
Addressing the user feedback may
signiﬁcantly improve a data cleaning
process.
MPN faster, but less accurate than
NJ.

Evaluation
Related Work
High level languages for data transformations.
SQL99, WHIRL’s SQL, SchemaSQL.
Lack of support for clustering and merging, and less optimized.
Immediate halt of execution upon exception in SQL.
Highly optimized matching operation in AJAX.
As it is made a ﬁrst-citizen operator.
Data Integration and Cleaning Frameworks.
Less scale-up.
Algorithms to support matching, clustering, and merging operations.

Conclusions
Conclusions
AJAX Framework
Design and Implementation of a data flow graph.
Quality heuristics for best accuracy.
Effectively and efficiently generate clean data.
Design of performance heuristics.
Execution speed of transformations.
User involvement is crucial in data cleaning.
Thank you!

Conclusions
References
Galhardas, H., Lopes, A., & Santos, E. (2011). Support for user
involvement in data cleaning. In Data Warehousing and Knowledge
Discovery (pp. 136-151). Springer Berlin Heidelberg..
Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000, May).
AJAX: an extensible data cleaning tool. In ACM Sigmod Record (Vol.
29, No. 2, p. 590). ACM.
Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C.
(2001). Declarative data cleaning: Language, model, and algorithms.
“Precisionrecall” by Walber - Own work. Licensed under CC BY-SA
4.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:
Precisionrecall.svg#/media/File:Precisionrecall.svg

Data Cleaning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Cleaning

Similaire à Data Cleaning (20)

Plus de Pradeeban Kathiravelu, Ph.D.

Plus de Pradeeban Kathiravelu, Ph.D. (20)

Dernier

Dernier (20)

Data Cleaning