SlideShare une entreprise Scribd logo
1  sur  13
Declarative Data Cleaning :
Language, Model, and
Algorithms
(Galhardas et. Al., Proc. VLDB, 2001)
Outline
• Problem
• Motivating example
• AJAX framework
• Logical layer
• Physical layer
• Matching
• Neighborhood join algorithm
• Multi-pass neighborhood algorithm
• Evaluation
• Related work
• Conclusion
• Discussion
Problem
• Data cleaning is a difficult problem.
• Current solutions (ETL and reengineering tools) :
• Not sophisticated enough to design data flow graphs
efficiently and effectively.
• Non-interactive.
• Hinder stepwise refinement process crucial to data
cleaning.
Motivating Example
AJAX framework
• Logical layer :
• Declarative language to express data cleaning using
logical operators (extension of SQL).
• Physical layer :
• Specify algorithm.
• Optimization.
• Exceptions as a mechanism to solicit user
interaction.
Logical layer
• 5 Operations :
• Mapping
• View
• Matching (important)
• Clustering
• Merging
• Duplicate elimination is handled by a sequence of
match, cluster, and merge.
Physical layer
• Implementations written in 3GL and registered
within the AJAX library.
• Matching algorithms :
• Naïve.
• Neighborhood Join optimization (NJ).
• Multi-pass Neighborhood optimization (MPN).
NJ optimization
• Apply distance filters on naïve algorithm.
• Devise function over input tuples so that cheaper
to compute similarity than actual similarity.
• E.g, use prefixes of strings
• Actual similarity only computed after passing filter.
• Damerau-Levenshtein for similarity
• Transitive closure.
NJ optimization
MPN optimization
• NJ does not allow false dismissals.
• MPN relaxes this requirement.
• Algorithm :
• Outer join on relations.
• Select key for each record.
• Sort all keys.
• Compare records that are close; within fixed window.
• Multiple passes allowed.
Evaluation
• MPN faster but less accurate than NJ.
• NJ algorithm is able to achieve a recall of 1 much
faster than the MPN method for more unstructured
domains :
• E.g., event name vs author name
Related work
• AJAX has more operations than related languages
:
• SQL doesn’t have merging and clustering operations
or exception support.
• WHIRL doesn’t have merging and clustering.
• AJAX and Potter’s Wheel both interactive.
• Potter’s Wheel automatic discrepancy detection
algorithm can be integrated into AJAX.
Conclusion
• AJAX framework :
• Logical and physical separation.
• Declarative language to specify transformations.
• Exceptions as a way to solicit interactions.

Contenu connexe

Tendances

C++ in object oriented programming
C++ in object oriented programmingC++ in object oriented programming
C++ in object oriented programmingSaket Khopkar
 
Whole Platform LWC11 Submission
Whole Platform LWC11 SubmissionWhole Platform LWC11 Submission
Whole Platform LWC11 SubmissionRiccardo Solmi
 
F# type providers
F# type providersF# type providers
F# type providersAntya Dev
 
Pipes & Filters Architectural Pattern
Pipes & Filters Architectural PatternPipes & Filters Architectural Pattern
Pipes & Filters Architectural PatternFredrik Kivi
 
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPragueSql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPragueLuis Beltran
 
The Rise of Functional Programming
The Rise of Functional ProgrammingThe Rise of Functional Programming
The Rise of Functional ProgrammingTjerk Wolterink
 
Introduction to React by Ebowe Blessing
Introduction to React by Ebowe BlessingIntroduction to React by Ebowe Blessing
Introduction to React by Ebowe BlessingBlessing Ebowe
 
State management in react applications (Statecharts)
State management in react applications (Statecharts)State management in react applications (Statecharts)
State management in react applications (Statecharts)Tomáš Drenčák
 
Actors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreActors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreVictorOps
 
State Management in Angular/React
State Management in Angular/ReactState Management in Angular/React
State Management in Angular/ReactDEV Cafe
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
 

Tendances (13)

C++ in object oriented programming
C++ in object oriented programmingC++ in object oriented programming
C++ in object oriented programming
 
Cetpa dotnet taining
Cetpa dotnet tainingCetpa dotnet taining
Cetpa dotnet taining
 
Whole Platform LWC11 Submission
Whole Platform LWC11 SubmissionWhole Platform LWC11 Submission
Whole Platform LWC11 Submission
 
F# type providers
F# type providersF# type providers
F# type providers
 
Pipes & Filters Architectural Pattern
Pipes & Filters Architectural PatternPipes & Filters Architectural Pattern
Pipes & Filters Architectural Pattern
 
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPragueSql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
 
The Rise of Functional Programming
The Rise of Functional ProgrammingThe Rise of Functional Programming
The Rise of Functional Programming
 
Introduction to React by Ebowe Blessing
Introduction to React by Ebowe BlessingIntroduction to React by Ebowe Blessing
Introduction to React by Ebowe Blessing
 
State management in react applications (Statecharts)
State management in react applications (Statecharts)State management in react applications (Statecharts)
State management in react applications (Statecharts)
 
Actors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreActors: Not Just for Movies Anymore
Actors: Not Just for Movies Anymore
 
State Management in Angular/React
State Management in Angular/ReactState Management in Angular/React
State Management in Angular/React
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Intro to ember.js
Intro to ember.jsIntro to ember.js
Intro to ember.js
 

Similaire à Ajax

PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxneju3
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)Jose Luis Lopez Pino
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with EpsilonSina Madani
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptxShafii8
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Основы функционального JS
Основы функционального JSОсновы функционального JS
Основы функционального JSАнна Луць
 
AP computer barron book ppt AP CS A.pptx
AP computer barron book ppt AP CS A.pptxAP computer barron book ppt AP CS A.pptx
AP computer barron book ppt AP CS A.pptxKoutheeshSellamuthu
 
Agent Based Modeling and Simulation - Overview and Tools
Agent Based Modeling and Simulation - Overview and ToolsAgent Based Modeling and Simulation - Overview and Tools
Agent Based Modeling and Simulation - Overview and ToolsStathis Grigoropoulos
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to ApexSujit Kumar
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization Hafiz faiz
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update referencesandeepji_choudhary
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NETmeghantaylor
 
Programming language paradigms
Programming language paradigmsProgramming language paradigms
Programming language paradigmsAshok Raj
 
Getting ready to java 8
Getting ready to java 8Getting ready to java 8
Getting ready to java 8Strannik_2013
 

Similaire à Ajax (20)

PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Java9to19Final.pptx
Java9to19Final.pptxJava9to19Final.pptx
Java9to19Final.pptx
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Основы функционального JS
Основы функционального JSОсновы функционального JS
Основы функционального JS
 
AP computer barron book ppt AP CS A.pptx
AP computer barron book ppt AP CS A.pptxAP computer barron book ppt AP CS A.pptx
AP computer barron book ppt AP CS A.pptx
 
Agent Based Modeling and Simulation - Overview and Tools
Agent Based Modeling and Simulation - Overview and ToolsAgent Based Modeling and Simulation - Overview and Tools
Agent Based Modeling and Simulation - Overview and Tools
 
CPP19 - Revision
CPP19 - RevisionCPP19 - Revision
CPP19 - Revision
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to Apex
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update reference
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NET
 
Programming language paradigms
Programming language paradigmsProgramming language paradigms
Programming language paradigms
 
CDN algos
CDN algosCDN algos
CDN algos
 
Getting ready to java 8
Getting ready to java 8Getting ready to java 8
Getting ready to java 8
 

Plus de dhruvgairola

A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...dhruvgairola
 
Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.dhruvgairola
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learningdhruvgairola
 
Discussion : Info sharing across private DBs
Discussion : Info sharing across private DBsDiscussion : Info sharing across private DBs
Discussion : Info sharing across private DBsdhruvgairola
 

Plus de dhruvgairola (8)

A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
 
Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.
 
Beginning jQuery
Beginning jQueryBeginning jQuery
Beginning jQuery
 
Beginning CSS.
Beginning CSS.Beginning CSS.
Beginning CSS.
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learning
 
Discussion : Info sharing across private DBs
Discussion : Info sharing across private DBsDiscussion : Info sharing across private DBs
Discussion : Info sharing across private DBs
 
PRIMES is in P
PRIMES is in PPRIMES is in P
PRIMES is in P
 
Potters wheel
Potters wheelPotters wheel
Potters wheel
 

Dernier

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Ajax

  • 1. Declarative Data Cleaning : Language, Model, and Algorithms (Galhardas et. Al., Proc. VLDB, 2001)
  • 2. Outline • Problem • Motivating example • AJAX framework • Logical layer • Physical layer • Matching • Neighborhood join algorithm • Multi-pass neighborhood algorithm • Evaluation • Related work • Conclusion • Discussion
  • 3. Problem • Data cleaning is a difficult problem. • Current solutions (ETL and reengineering tools) : • Not sophisticated enough to design data flow graphs efficiently and effectively. • Non-interactive. • Hinder stepwise refinement process crucial to data cleaning.
  • 5. AJAX framework • Logical layer : • Declarative language to express data cleaning using logical operators (extension of SQL). • Physical layer : • Specify algorithm. • Optimization. • Exceptions as a mechanism to solicit user interaction.
  • 6. Logical layer • 5 Operations : • Mapping • View • Matching (important) • Clustering • Merging • Duplicate elimination is handled by a sequence of match, cluster, and merge.
  • 7. Physical layer • Implementations written in 3GL and registered within the AJAX library. • Matching algorithms : • Naïve. • Neighborhood Join optimization (NJ). • Multi-pass Neighborhood optimization (MPN).
  • 8. NJ optimization • Apply distance filters on naïve algorithm. • Devise function over input tuples so that cheaper to compute similarity than actual similarity. • E.g, use prefixes of strings • Actual similarity only computed after passing filter. • Damerau-Levenshtein for similarity • Transitive closure.
  • 10. MPN optimization • NJ does not allow false dismissals. • MPN relaxes this requirement. • Algorithm : • Outer join on relations. • Select key for each record. • Sort all keys. • Compare records that are close; within fixed window. • Multiple passes allowed.
  • 11. Evaluation • MPN faster but less accurate than NJ. • NJ algorithm is able to achieve a recall of 1 much faster than the MPN method for more unstructured domains : • E.g., event name vs author name
  • 12. Related work • AJAX has more operations than related languages : • SQL doesn’t have merging and clustering operations or exception support. • WHIRL doesn’t have merging and clustering. • AJAX and Potter’s Wheel both interactive. • Potter’s Wheel automatic discrepancy detection algorithm can be integrated into AJAX.
  • 13. Conclusion • AJAX framework : • Logical and physical separation. • Declarative language to specify transformations. • Exceptions as a way to solicit interactions.