SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Data Cleaning
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 2
April 7, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 1 / 22
Introduction
Introduction
Removal of inconsistencies and errors from original data sets.
Extraction Transformation Loading (ETL) and data cleaning tools.
Modeled as graphs of data transformations.
Data integration problem.
Derive structured and clean textual records.
To be able to perform meaningful queries.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 2 / 22
Introduction
Motivation
Explanation for the reasoning behind the cleaning results.
Interactive facilities to tune a data cleaning program.
A language, an execution model, and algorithms.
To express data cleaning specifications declaratively.
To perform the cleaning efficiently.
Data cleaning graph with data quality constraints.
Support for user involvement in data cleaning.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 3 / 22
Introduction
Challenges in Existing Technology
Lack of separation [...].
Lack of data lineage and user interaction facilities.
Lack of logical matching operation.
User-provided criteria.
Non-exhaustive.
Lack of documentation of the matching algorithms.
Lack of user consultation.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 4 / 22
Contributions
AJAX Data Cleaning Framework and Strategy
Separation of Framework:
Logical Level.
Graph of transformations specified in declarative language.
Expressible with SQL99.
Explicit user interaction and stepwise refinement
Using a data lineage mechanism.
Physical Level.
Specific optimization algorithms chosen to implement the
transformations.
Notation:
To specify the properties of the approximate matching function.
To select an optimized implementation.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 5 / 22
AJAX
Logical Level
Data Flow Graph: Main constituent of a data cleaning program.
Input Output flows of operators logically modeled as database
relations.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 6 / 22
AJAX
Framework for the bibliographic references
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 7 / 22
Data Cleaning Strategy
1. Add a key to every input record
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 8 / 22
Data Cleaning Strategy
2. Extract from each input record ..
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 9 / 22
Data Cleaning Strategy
3. Extract from each input record ..
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 10 / 22
Data Cleaning Strategy
4. Duplicate Elimination
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 11 / 22
Data Cleaning Strategy
5. Aggregation
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 12 / 22
Data Cleaning Strategy
Exception Handling
External functions written in a 3 GL language such as Java.
Exceptions autogenerated by the external functions.
Mark tuples that cannot be automatically handled by an operator.
Data lineage mechanism enables user inspection of exceptions.
Corrected data re-integrated into the data flow graph.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 13 / 22
Specification Language
Logical Operators
Arbitrary clustering operations.
More general than the SQL group-by.
Merging operator with user defined aggregation functions.
Not expressible in SQL99.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 14 / 22
Specification Language
Implementation of Matching
Optimization Problem.
Pre-select the elements of the Cartesian product.
Allows false matches.
No false dismissals.
Cheap to compute.
Approximate method to compare a limited number of records.
With good expected probability.
Distance-filtering optimization.
Approximate Methods.
Multi-pass neighborhood method (MPN).
Choose a key.
Compare the results within a fixed window.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 15 / 22
User Involvement in Data Cleaning
Manual Data Repair (MDR) in Data Cleaning Graph
(DCG)
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 16 / 22
User Involvement in Data Cleaning
Case Study
Goal:
Clean the Pub table and produce a table containing only the
publications authored by at least one team member.
Duplicate entries for each publication organized in clusters.
Process:
Extract the author names.
Independently of the publication they are associated to.
Match author names against the names stored in the Team table.
Try to find synonyms.
Build the list of co-authors for each author.
Remove those publications that are not authored by any team
member.
Detect and cluster approximate duplicate publication records.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 17 / 22
User Involvement in Data Cleaning
Quality Constraints and Manual Data Repairs
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 18 / 22
Evaluation
Experiments
Executed with AJAX framework.
Multi-pass neighborhood method
(MPN) vs. Neighborhood join (NJ).
Experimental Results:
Addressing the user feedback may
significantly improve a data cleaning
process.
MPN faster, but less accurate than
NJ.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 19 / 22
Evaluation
Related Work
High level languages for data transformations.
SQL99, WHIRL’s SQL, SchemaSQL.
Lack of support for clustering and merging, and less optimized.
Immediate halt of execution upon exception in SQL.
Highly optimized matching operation in AJAX.
As it is made a first-citizen operator.
Data Integration and Cleaning Frameworks.
Less scale-up.
Algorithms to support matching, clustering, and merging operations.
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 20 / 22
Conclusions
Conclusions
AJAX Framework
Design and Implementation of a data flow graph.
Quality heuristics for best accuracy.
Effectively and efficiently generate clean data.
Design of performance heuristics.
Execution speed of transformations.
User involvement is crucial in data cleaning.
Thank you!
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 21 / 22
Conclusions
References
Galhardas, H., Lopes, A., & Santos, E. (2011). Support for user
involvement in data cleaning. In Data Warehousing and Knowledge
Discovery (pp. 136-151). Springer Berlin Heidelberg..
Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000, May).
AJAX: an extensible data cleaning tool. In ACM Sigmod Record (Vol.
29, No. 2, p. 590). ACM.
Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C.
(2001). Declarative data cleaning: Language, model, and algorithms.
“Precisionrecall” by Walber - Own work. Licensed under CC BY-SA
4.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:
Precisionrecall.svg#/media/File:Precisionrecall.svg
Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 22 / 22

Contenu connexe

Tendances

Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
data warehouse , data mart, etl
data warehouse , data mart, etldata warehouse , data mart, etl
data warehouse , data mart, etlAashish Rathod
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Data management in Stata
Data management in StataData management in Stata
Data management in Stataizahn
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGAhtesham Ullah khan
 
Data management principles
Data management principlesData management principles
Data management principlesFiddy Prasetiya
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data VisualizationStephen Tracy
 
Data Visualization
Data VisualizationData Visualization
Data Visualizationgzargary
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guidethomasmary607
 

Tendances (20)

Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
data warehouse , data mart, etl
data warehouse , data mart, etldata warehouse , data mart, etl
data warehouse , data mart, etl
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Data management in Stata
Data management in StataData management in Stata
Data management in Stata
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Data Visualization - A Brief Overview
Data Visualization - A Brief OverviewData Visualization - A Brief Overview
Data Visualization - A Brief Overview
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Data management principles
Data management principlesData management principles
Data management principles
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Big data storage
Big data storageBig data storage
Big data storage
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 

Similaire à Data Cleaning Framework Separates Logical and Physical Levels

IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET Journal
 
Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Paolo Romano
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERScsandit
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals
 
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTION
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTIONA HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTION
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTIONijcsit
 
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET Journal
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesIRJET Journal
 
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture DataIRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture DataIRJET Journal
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEijcsa
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
IRJET- A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...
IRJET-  	  A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...IRJET-  	  A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...
IRJET- A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...IRJET Journal
 
E132833
E132833E132833
E132833irjes
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...IRJET Journal
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE cscpconf
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET Journal
 

Similaire à Data Cleaning Framework Separates Logical and Physical Levels (20)

IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
 
E05312426
E05312426E05312426
E05312426
 
Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTION
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTIONA HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTION
A HEURISTIC APPROACH FOR WEB-SERVICE DISCOVERY AND SELECTION
 
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big Data
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
 
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture DataIRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
IRJET- A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...
IRJET-  	  A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...IRJET-  	  A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...
IRJET- A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
E132833
E132833E132833
E132833
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
 
Introduction to Data Quality
Introduction to Data QualityIntroduction to Data Quality
Introduction to Data Quality
 

Plus de Pradeeban Kathiravelu, Ph.D.

Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Pradeeban Kathiravelu, Ph.D.
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...Pradeeban Kathiravelu, Ph.D.
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesPradeeban Kathiravelu, Ph.D.
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreePradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...Pradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Pradeeban Kathiravelu, Ph.D.
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersPradeeban Kathiravelu, Ph.D.
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Pradeeban Kathiravelu, Ph.D.
 
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...Pradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesPradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Pradeeban Kathiravelu, Ph.D.
 

Plus de Pradeeban Kathiravelu, Ph.D. (20)

Google Summer of Code_2023.pdf
Google Summer of Code_2023.pdfGoogle Summer of Code_2023.pdf
Google Summer of Code_2023.pdf
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
 
Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
 
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentorsGoogle Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentors
 
Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
 
UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routers
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
 
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big Services
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 

Dernier

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 

Dernier (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 

Data Cleaning Framework Separates Logical and Physical Levels

  • 1. Data Cleaning Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 2 April 7, 2015. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 1 / 22
  • 2. Introduction Introduction Removal of inconsistencies and errors from original data sets. Extraction Transformation Loading (ETL) and data cleaning tools. Modeled as graphs of data transformations. Data integration problem. Derive structured and clean textual records. To be able to perform meaningful queries. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 2 / 22
  • 3. Introduction Motivation Explanation for the reasoning behind the cleaning results. Interactive facilities to tune a data cleaning program. A language, an execution model, and algorithms. To express data cleaning specifications declaratively. To perform the cleaning efficiently. Data cleaning graph with data quality constraints. Support for user involvement in data cleaning. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 3 / 22
  • 4. Introduction Challenges in Existing Technology Lack of separation [...]. Lack of data lineage and user interaction facilities. Lack of logical matching operation. User-provided criteria. Non-exhaustive. Lack of documentation of the matching algorithms. Lack of user consultation. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 4 / 22
  • 5. Contributions AJAX Data Cleaning Framework and Strategy Separation of Framework: Logical Level. Graph of transformations specified in declarative language. Expressible with SQL99. Explicit user interaction and stepwise refinement Using a data lineage mechanism. Physical Level. Specific optimization algorithms chosen to implement the transformations. Notation: To specify the properties of the approximate matching function. To select an optimized implementation. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 5 / 22
  • 6. AJAX Logical Level Data Flow Graph: Main constituent of a data cleaning program. Input Output flows of operators logically modeled as database relations. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 6 / 22
  • 7. AJAX Framework for the bibliographic references Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 7 / 22
  • 8. Data Cleaning Strategy 1. Add a key to every input record Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 8 / 22
  • 9. Data Cleaning Strategy 2. Extract from each input record .. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 9 / 22
  • 10. Data Cleaning Strategy 3. Extract from each input record .. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 10 / 22
  • 11. Data Cleaning Strategy 4. Duplicate Elimination Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 11 / 22
  • 12. Data Cleaning Strategy 5. Aggregation Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 12 / 22
  • 13. Data Cleaning Strategy Exception Handling External functions written in a 3 GL language such as Java. Exceptions autogenerated by the external functions. Mark tuples that cannot be automatically handled by an operator. Data lineage mechanism enables user inspection of exceptions. Corrected data re-integrated into the data flow graph. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 13 / 22
  • 14. Specification Language Logical Operators Arbitrary clustering operations. More general than the SQL group-by. Merging operator with user defined aggregation functions. Not expressible in SQL99. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 14 / 22
  • 15. Specification Language Implementation of Matching Optimization Problem. Pre-select the elements of the Cartesian product. Allows false matches. No false dismissals. Cheap to compute. Approximate method to compare a limited number of records. With good expected probability. Distance-filtering optimization. Approximate Methods. Multi-pass neighborhood method (MPN). Choose a key. Compare the results within a fixed window. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 15 / 22
  • 16. User Involvement in Data Cleaning Manual Data Repair (MDR) in Data Cleaning Graph (DCG) Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 16 / 22
  • 17. User Involvement in Data Cleaning Case Study Goal: Clean the Pub table and produce a table containing only the publications authored by at least one team member. Duplicate entries for each publication organized in clusters. Process: Extract the author names. Independently of the publication they are associated to. Match author names against the names stored in the Team table. Try to find synonyms. Build the list of co-authors for each author. Remove those publications that are not authored by any team member. Detect and cluster approximate duplicate publication records. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 17 / 22
  • 18. User Involvement in Data Cleaning Quality Constraints and Manual Data Repairs Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 18 / 22
  • 19. Evaluation Experiments Executed with AJAX framework. Multi-pass neighborhood method (MPN) vs. Neighborhood join (NJ). Experimental Results: Addressing the user feedback may significantly improve a data cleaning process. MPN faster, but less accurate than NJ. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 19 / 22
  • 20. Evaluation Related Work High level languages for data transformations. SQL99, WHIRL’s SQL, SchemaSQL. Lack of support for clustering and merging, and less optimized. Immediate halt of execution upon exception in SQL. Highly optimized matching operation in AJAX. As it is made a first-citizen operator. Data Integration and Cleaning Frameworks. Less scale-up. Algorithms to support matching, clustering, and merging operations. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 20 / 22
  • 21. Conclusions Conclusions AJAX Framework Design and Implementation of a data flow graph. Quality heuristics for best accuracy. Effectively and efficiently generate clean data. Design of performance heuristics. Execution speed of transformations. User involvement is crucial in data cleaning. Thank you! Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 21 / 22
  • 22. Conclusions References Galhardas, H., Lopes, A., & Santos, E. (2011). Support for user involvement in data cleaning. In Data Warehousing and Knowledge Discovery (pp. 136-151). Springer Berlin Heidelberg.. Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000, May). AJAX: an extensible data cleaning tool. In ACM Sigmod Record (Vol. 29, No. 2, p. 590). ACM. Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C. (2001). Declarative data cleaning: Language, model, and algorithms. “Precisionrecall” by Walber - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File: Precisionrecall.svg#/media/File:Precisionrecall.svg Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 22 / 22