SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Data Cleaning
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 2
April 7, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 1 / 22
Introduction
Introduction
Removal of inconsistencies and errors from original data sets.
Extraction Transformation Loading (ETL) and data cleaning tools.
Modeled as graphs of data transformations.
Data integration problem.
Derive structured and clean textual records.
To be able to perform meaningful queries.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 2 / 22
Introduction
Motivation
Explanation for the reasoning behind the cleaning results.
Interactive facilities to tune a data cleaning program.
A language, an execution model, and algorithms.
To express data cleaning specifications declaratively.
To perform the cleaning efficiently.
Data cleaning graph with data quality constraints.
Support for user involvement in data cleaning.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 3 / 22
Introduction
Challenges in Existing Technology
Lack of separation [...].
Lack of data lineage and user interaction facilities.
Lack of logical matching operation.
User-provided criteria.
Non-exhaustive.
Lack of documentation of the matching algorithms.
Lack of user consultation.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 4 / 22
Contributions
AJAX Data Cleaning Framework and Strategy
Separation of Framework:
Logical Level.
Graph of transformations specified in declarative language.
Expressible with SQL99.
Explicit user interaction and stepwise refinement
Using a data lineage mechanism.
Physical Level.
Specific optimization algorithms chosen to implement the
transformations.
Notation:
To specify the properties of the approximate matching function.
To select an optimized implementation.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 5 / 22
AJAX
Logical Level
Data Flow Graph: Main constituent of a data cleaning program.
Input Output flows of operators logically modeled as database
relations.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 6 / 22
AJAX
Framework for the bibliographic references
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 7 / 22
Data Cleaning Strategy
1. Add a key to every input record
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 8 / 22
Data Cleaning Strategy
2. Extract from each input record ..
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 9 / 22
Data Cleaning Strategy
3. Extract from each input record ..
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 10 / 22
Data Cleaning Strategy
4. Duplicate Elimination
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 11 / 22
Data Cleaning Strategy
5. Aggregation
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 12 / 22
Data Cleaning Strategy
Exception Handling
External functions written in a 3 GL language such as Java.
Exceptions autogenerated by the external functions.
Mark tuples that cannot be automatically handled by an operator.
Data lineage mechanism enables user inspection of exceptions.
Corrected data re-integrated into the data flow graph.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 13 / 22
Specification Language
Logical Operators
Arbitrary clustering operations.
More general than the SQL group-by.
Merging operator with user defined aggregation functions.
Not expressible in SQL99.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 14 / 22
Specification Language
Implementation of Matching
Optimization Problem.
Pre-select the elements of the Cartesian product.
Allows false matches.
No false dismissals.
Cheap to compute.
Approximate method to compare a limited number of records.
With good expected probability.
Distance-filtering optimization.
Approximate Methods.
Multi-pass neighborhood method (MPN).
Choose a key.
Compare the results within a fixed window.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 15 / 22
User Involvement in Data Cleaning
Manual Data Repair (MDR) in Data Cleaning Graph
(DCG)
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 16 / 22
User Involvement in Data Cleaning
Case Study
Goal:
Clean the Pub table and produce a table containing only the
publications authored by at least one team member.
Duplicate entries for each publication organized in clusters.
Process:
Extract the author names.
Independently of the publication they are associated to.
Match author names against the names stored in the Team table.
Try to find synonyms.
Build the list of co-authors for each author.
Remove those publications that are not authored by any team
member.
Detect and cluster approximate duplicate publication records.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 17 / 22
User Involvement in Data Cleaning
Quality Constraints and Manual Data Repairs
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 18 / 22
Evaluation
Experiments
Executed with AJAX framework.
Multi-pass neighborhood method
(MPN) vs. Neighborhood join (NJ).
Experimental Results:
Addressing the user feedback may
significantly improve a data cleaning
process.
MPN faster, but less accurate than
NJ.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 19 / 22
Evaluation
Related Work
High level languages for data transformations.
SQL99, WHIRL’s SQL, SchemaSQL.
Lack of support for clustering and merging, and less optimized.
Immediate halt of execution upon exception in SQL.
Highly optimized matching operation in AJAX.
As it is made a first-citizen operator.
Data Integration and Cleaning Frameworks.
Less scale-up.
Algorithms to support matching, clustering, and merging operations.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 20 / 22
Conclusions
Conclusions
AJAX Framework
Design and Implementation of a data flow graph.
Quality heuristics for best accuracy.
Effectively and efficiently generate clean data.
Design of performance heuristics.
Execution speed of transformations.
User involvement is crucial in data cleaning.
Thank you!
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 21 / 22
Conclusions
References
Galhardas, H., Lopes, A., & Santos, E. (2011). Support for user
involvement in data cleaning. In Data Warehousing and Knowledge
Discovery (pp. 136-151). Springer Berlin Heidelberg..
Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000, May).
AJAX: an extensible data cleaning tool. In ACM Sigmod Record (Vol.
29, No. 2, p. 590). ACM.
Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C.
(2001). Declarative data cleaning: Language, model, and algorithms.
“Precisionrecall” by Walber - Own work. Licensed under CC BY-SA
4.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:
Precisionrecall.svg#/media/File:Precisionrecall.svg
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 22 / 22

Contenu connexe

Tendances

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Tendances (20)

Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Visualization - A Brief Overview
Data Visualization - A Brief OverviewData Visualization - A Brief Overview
Data Visualization - A Brief Overview
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data Cleaning
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data Ware Housing And Data Mining
Data Ware Housing And Data MiningData Ware Housing And Data Mining
Data Ware Housing And Data Mining
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 

Similaire à Data Cleaning

Similaire à Data Cleaning (20)

IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
 
E05312426
E05312426E05312426
E05312426
 
Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTION
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTIONA HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTION
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTION
 
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big Data
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
 
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture DataIRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
IRJET- A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...
IRJET-  	  A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...IRJET-  	  A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...
IRJET- A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
E132833
E132833E132833
E132833
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
 
Introduction to Data Quality
Introduction to Data QualityIntroduction to Data Quality
Introduction to Data Quality
 

Plus de Pradeeban Kathiravelu, Ph.D.

Plus de Pradeeban Kathiravelu, Ph.D. (20)

Google Summer of Code_2023.pdf
Google Summer of Code_2023.pdfGoogle Summer of Code_2023.pdf
Google Summer of Code_2023.pdf
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
 
Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
 
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentorsGoogle Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentors
 
Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
 
UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routers
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
 
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big Services
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 

Dernier

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Dernier (20)

%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 

Data Cleaning

  • 1. Data Cleaning Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 2 April 7, 2015. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 1 / 22
  • 2. Introduction Introduction Removal of inconsistencies and errors from original data sets. Extraction Transformation Loading (ETL) and data cleaning tools. Modeled as graphs of data transformations. Data integration problem. Derive structured and clean textual records. To be able to perform meaningful queries. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 2 / 22
  • 3. Introduction Motivation Explanation for the reasoning behind the cleaning results. Interactive facilities to tune a data cleaning program. A language, an execution model, and algorithms. To express data cleaning specifications declaratively. To perform the cleaning efficiently. Data cleaning graph with data quality constraints. Support for user involvement in data cleaning. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 3 / 22
  • 4. Introduction Challenges in Existing Technology Lack of separation [...]. Lack of data lineage and user interaction facilities. Lack of logical matching operation. User-provided criteria. Non-exhaustive. Lack of documentation of the matching algorithms. Lack of user consultation. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 4 / 22
  • 5. Contributions AJAX Data Cleaning Framework and Strategy Separation of Framework: Logical Level. Graph of transformations specified in declarative language. Expressible with SQL99. Explicit user interaction and stepwise refinement Using a data lineage mechanism. Physical Level. Specific optimization algorithms chosen to implement the transformations. Notation: To specify the properties of the approximate matching function. To select an optimized implementation. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 5 / 22
  • 6. AJAX Logical Level Data Flow Graph: Main constituent of a data cleaning program. Input Output flows of operators logically modeled as database relations. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 6 / 22
  • 7. AJAX Framework for the bibliographic references Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 7 / 22
  • 8. Data Cleaning Strategy 1. Add a key to every input record Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 8 / 22
  • 9. Data Cleaning Strategy 2. Extract from each input record .. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 9 / 22
  • 10. Data Cleaning Strategy 3. Extract from each input record .. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 10 / 22
  • 11. Data Cleaning Strategy 4. Duplicate Elimination Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 11 / 22
  • 12. Data Cleaning Strategy 5. Aggregation Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 12 / 22
  • 13. Data Cleaning Strategy Exception Handling External functions written in a 3 GL language such as Java. Exceptions autogenerated by the external functions. Mark tuples that cannot be automatically handled by an operator. Data lineage mechanism enables user inspection of exceptions. Corrected data re-integrated into the data flow graph. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 13 / 22
  • 14. Specification Language Logical Operators Arbitrary clustering operations. More general than the SQL group-by. Merging operator with user defined aggregation functions. Not expressible in SQL99. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 14 / 22
  • 15. Specification Language Implementation of Matching Optimization Problem. Pre-select the elements of the Cartesian product. Allows false matches. No false dismissals. Cheap to compute. Approximate method to compare a limited number of records. With good expected probability. Distance-filtering optimization. Approximate Methods. Multi-pass neighborhood method (MPN). Choose a key. Compare the results within a fixed window. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 15 / 22
  • 16. User Involvement in Data Cleaning Manual Data Repair (MDR) in Data Cleaning Graph (DCG) Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 16 / 22
  • 17. User Involvement in Data Cleaning Case Study Goal: Clean the Pub table and produce a table containing only the publications authored by at least one team member. Duplicate entries for each publication organized in clusters. Process: Extract the author names. Independently of the publication they are associated to. Match author names against the names stored in the Team table. Try to find synonyms. Build the list of co-authors for each author. Remove those publications that are not authored by any team member. Detect and cluster approximate duplicate publication records. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 17 / 22
  • 18. User Involvement in Data Cleaning Quality Constraints and Manual Data Repairs Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 18 / 22
  • 19. Evaluation Experiments Executed with AJAX framework. Multi-pass neighborhood method (MPN) vs. Neighborhood join (NJ). Experimental Results: Addressing the user feedback may significantly improve a data cleaning process. MPN faster, but less accurate than NJ. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 19 / 22
  • 20. Evaluation Related Work High level languages for data transformations. SQL99, WHIRL’s SQL, SchemaSQL. Lack of support for clustering and merging, and less optimized. Immediate halt of execution upon exception in SQL. Highly optimized matching operation in AJAX. As it is made a first-citizen operator. Data Integration and Cleaning Frameworks. Less scale-up. Algorithms to support matching, clustering, and merging operations. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 20 / 22
  • 21. Conclusions Conclusions AJAX Framework Design and Implementation of a data flow graph. Quality heuristics for best accuracy. Effectively and efficiently generate clean data. Design of performance heuristics. Execution speed of transformations. User involvement is crucial in data cleaning. Thank you! Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 21 / 22
  • 22. Conclusions References Galhardas, H., Lopes, A., & Santos, E. (2011). Support for user involvement in data cleaning. In Data Warehousing and Knowledge Discovery (pp. 136-151). Springer Berlin Heidelberg.. Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000, May). AJAX: an extensible data cleaning tool. In ACM Sigmod Record (Vol. 29, No. 2, p. 590). ACM. Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C. (2001). Declarative data cleaning: Language, model, and algorithms. “Precisionrecall” by Walber - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File: Precisionrecall.svg#/media/File:Precisionrecall.svg Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 22 / 22