SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
International Journal of Research in Computer Science
 eISSN 2249-8265 Volume 2 Issue 2 (2012) pp. 29-32
 © White Globe Publications
 www.ijorcs.org


            DUPLICATE DETECTION OF RECORDS IN
                QUERIES USING CLUSTERING
                                  M.Anitha1, A.Srinivas2, T.P.Shekhar3, D.Sagar4
                         * Sree Chaitanya College Of Engineering (SCCE), Karimnagar, India

   Abstract: The problem of detecting and eliminating      multiple sources, the amount of the data increases and
duplicated data is one of the major problems in the        as well as data is duplicated. Data warehouse may
broad area of data cleaning and data quality in data       have terabyte of data for the mining process. The
warehouse. Many times, the same logical real world         preprocessing of data is the initial and often crucial
entity may have multiple representations in the data       step of the data mining process. To increase the
warehouse. Duplicate elimination is hard because it is     accuracy of the mining result one has to perform data
caused by several types of errors like typographical       preprocessing because 80% of mining efforts often
errors, and different representations of the same          spend their time on data quality. So, data cleaning is
logical value. Also, it is important to detect and clean   very much important in data warehouse before the
equivalence errors because an equivalence error may        mining process. The result of the data mining process
result in several duplicate tuples. Recent research        will not be accurate because of the data duplication
efforts have focused on the issue of duplicate             and poor quality of data. There are many existing
elimination in data warehouses. This entails trying to     methods available for duplicate data detection and
match inexact duplicate records, which are records         elimination. But the speed of the data cleaning process
that refer to the same real-world entity while not being   is very slow and the time taken for the cleaning
syntactically equivalent. This paper mainly focuses on     process is high with large amount of data. So, there is a
efficient detection and elimination of duplicate data.     need to reduce time and increase speed of the data
The main objective of this research work is to detect      cleaning process as well as need to improve the quality
exact and inexact duplicates by using duplicate            of the data.
detection and elimination rules. This approach is used
to improve the efficiency of the data.                        There are two issues to be considered for duplicate
                                                           detection: Accuracy and Speed. The measure of
Keywords: Data Cleaning, Duplicate Data, Data              accuracy in duplicate detection depends on the number
Warehouse, Data Mining                                     of false negatives (duplicates you did not classify as
                                                           such) and false positives (non-duplicates which were
                 I. INTRODUCTION                           classified as duplicates) [12].
   Data warehouse contains large amounts of data for          In this research work, a duplicate detection and
data mining to analyze the data for decision making        elimination rule is developed to handle any duplicate
process. Data miners do not simply analyze data, they      data in a data warehouse. Duplicate elimination is very
have to bring the data in a format and state that allows   important to identify which duplicate to retain and
for this analysis. It has been estimated that the actual   duplicate is to be removed. The main objective of this
mining of data only makes up 10% of the time               research work is to reduce the number of false
required for the complete knowledge discovery              positives, to speed up the data cleaning process reduce
process [3]. According to Jiawei, the precedent time       the complexity and to improve the quality of data. A
consuming step of preprocessing is of essential            high quality, scalable duplicate elimination algorithm
important for data mining. It is more than a tedious       is used and evaluated it on real datasets from an
necessity: The techniques used in the preprocessing        operational data warehouse to achieve objective.
step can deeply influence the results of the following
step, the actual application of a data mining algorithm         II. RECORD MATCHING OVER QUERY
[6]. Hans-peter stated as the role of the impact on the                      RESULTS
link of data preprocessing to data mining will gain
steadily more interest over the coming years.              A. First Problem Definition
Preprocessing is one of the fourth future trend and
major issues in data mining over the next years [7].          Our focus is to find the matching status among the
                                                           records and to retain the non duplicate records. Then,
   In data warehouse, data is integrated or collected      the goal is to cluster the matched records using fuzzy
from multiple sources. While integrating data from         ontological document clustering.


                                                                             www.ijorcs.org
30                                                                            M.Anitha, A.Srinivas, T.P.Shekhar, D.Sagar

B. Element Identification                                   Algorithm:
    Supervised learning methods use only some of the         1.    D=φ
fields in a record for identification. This is the reason    2.    Set the parameters W of C1 according to N
for query results obtained using supervised learning to      3.    Use C1 to get a set of duplicate vector pairs d1 and f
contain duplicate records. Unsupervised Duplicate                  from P and N
Elimination (UDE) does not suffer from these types of        4.    P = P- d1
user reference problems. A preprocessing step called         5.    while | d1 |≠ 0
exact matching is used for matching relevant records.        6.    N’ = N - f
It requires the data format of the records to be the         7.    D = D + d1 + f
same. So, the exact matching method is applicable
                                                             8.    Train C2 using D and N’
only for the records from the same data source.
                                                             9.    Classify P using C2 and get a set of newly identified
Element identification thus merges the records that are
                                                                   duplicate vector pairs d2
exactly the same in relevant matching fields.
                                                             10.   P = P - d2
C. Ontology matching                                         11.   D =D + d2
                                                             12.   Adjust the parameters W of C1 according to N’ and
   The term Ontology is derived from the Greek words
                                                                   D
‘onto’ which means being and ‘logia’ which means
                                                             13.   Use C1 to get a new set of duplicate vector pairs d1
written or spoken disclosure. In short, it refers to a
                                                                   and f from P and N
specification of a conceptualization.
                                                             14.   N=N’
   Ontology basically refers to the set of concepts such     15.   Return D
as things, events and relations that are specified in                         Figure 1: UDE Algorithm
some way in order to create an agreed-upon
vocabulary for exchanging information. Ontologies           B. Certainty factor
can be represented in textual or graphical formats.
Usually, graphical formats are preferred for easy               In the existing method of duplicate data elimination
understandability. Ontologies with a large knowledge        [10], certainty factor (CF) is calculated by classifying
base [5] can be represented in different forms such as      attributes with distinct and missing value, type and size
hierarchical trees, expandable hierarchical trees,          of the attribute. These attributes are identified
hyperbolic trees, etc. In the expandable hierarchical       manually based on the type of the data and the most
tree format, the user has the freedom to expand only        important of data in that data warehouse. For example,
the node of interest and leave the rest in a collapsed      if name, telephone and fax field are used for matching
state [2]. If necessary, the entire tree can be expanded    then high value is assigned for certainty factor. In this
to get the complete knowledge base. This type of            research work, best attributes are identified in the early
format can be used only when there are a large number       stages of data cleaning. The attributes are selected
of hierarchical relationships. Ontology matching is         based on the specific criteria and quality of the data.
used for finding the matching status of the record pairs    Attribute threshold value is calculated based on the
by matching the record attributes.                          measurement type and size of the data. These selected
                                                            attributes are well suited for the data cleaning process.
          III. SYSTEM METHODOLOGY                           Certainty factor is assigned based on the attribute
                                                            types. This is shown in the following table.
A. Unsupervised Duplicate Elimination
                                                                      Table 1: Classification of attribute types
   UDE employs a similarity function to find field                                 Distinct   Missing              Types of
similarity. We use similarity vector to represent a pair    S. No Key Attribute                       Size of data
                                                                                   values     values                 data
of records.                                                   1        √                                   √          √
Input: Potential duplicate vector set P Non-duplicate         2        √                                   √
vector set N                                                  3        √                                              √
Output: Duplicate vector set D                                4                       √         √          √          √
                                                              5                       √         √          √
C1 : A classification algorithm with adjustable
parameters W that identifies duplicate vector pairs           6                       √         √                     √
from P C2 : a supervised classifier, SVM                      7                       √                    √          √
                                                              8                       √         √
                                                             9                        √                               √
                                                             10                       √                    √



                                                                                  www.ijorcs.org
Duplicate Detection of Records in Queries using Clustering                                                      31

                                                               • Matching key field with high type and high size
Rule 1: certainty factor 0.95 (No. 1 and No. 4)
                                                               • And matching field with high distinct value and
  • Matching key field with high type and high size
                                                                 high value data type
  • And matching field with high distinct value, low
    missing value, high value data type and matching         Rule 12: certainty factor 0.7 (No. 2 and No. 9)
    field with high range value                                • Matching key field with high size
                                                               • And matching field with high distinct value and
Rule 2: certainty factor 0.9 (No. 2 and No. 4)
                                                                 high value data type
  • Matching key field with high range value
  • And matching field with high distinct value, low         Rule 13: certainty factor 0.7 (No. 3 and No. 9)
    missing value, and matching field with high range          • Matching key field with high type
    value                                                      • And matching field with high distinct value and
                                                                 high value data type
Rule 3: certainty factor 0.9 (No. 3 and No. 4)
  • Matching key field with high type                        Rule 14: certainty factor 0.7 (No. 1 and No. 10)
  • And Matching field with high distinct value,               • Matching key field with high type and high size
    low missing value, high value data type and                • And matching field with high distinct value and
    matching field with high range value                         high range value
Rule 4: certainty factor 0.85 (No. 1 and No. 5)              Rule 15: certainty factor 0.7 (No. 2 and No. 10)
  • Matching key field with high type and high size            • Matching key field with high size
  • And matching field with high distinct value, low           • And matching field with high distinct value and
    missing value and high range value                           high range value
Rule 5: certainty factor 0.85 (No. 1 and No. 5)              Rule 16: certainty factor 0.7 (No. 3 and No. 10)
  • Matching key field and high size                           • Matching key field with high type
  • And matching field with high distinct value, low           • And matching field with high distinct value and
    missing value and high range value                           high range value
Rule 6: certainty factor 0.85 (No. 2 and No. 5)
                                                               S                            Certainty Threshold
                                                                           Rules
  • Matching key field with high type                         No.                          Factor (CF) value (TH)
  • And matching field with high distinct value, low           1   {TS}, {D, M, DT, DS}       0.95        0.75
                                                               2 {T, S}, {D, M, DT, DS}        0.9        0.80
    missing value and high range value
                                                                  {TS, T, S}, {D, M, DT},
                                                               3                              0.85        0.85
Rule 7: certainty factor 0.85 (No. 1 and No. 6)                         {D, M, DS}
                                                               4 {TS, T, S}, {D, DT, DS}       0.8         0.9
  • Matching key field with high size and high type            5    {TS, T, S}, {D, M}        0.75        0.95
  • And matching field with high distinct value, low              {TS, T, S}, {D, DT}, {D,
                                                               6                               0.7        0.95
    missing value and high value data type                                  DS}

Rule 8: certainty factor 0.8 (No. 3 and No. 7)               TS – Type and Size of key attribute
  • Matching key field with high type                        T – Type of key attribute
  • And Matching field with high distinct value,             S – Size of key attributes
    high value data type and high range value                D – Distinct value of attributes
                                                             M – Missing value of attributes
Rule 09: certainty factor 0.75 (No. 2 and No. 8)
                                                             DT – Data type of attributes
  • Matching key field with high size                        DS – Data size of attributes
  • And matching field with high distinct value and
    low missing value                                           Duplicate records are identified in each cluster to
                                                             identify exact and inexact duplicate records. The
Rule 10: certainty factor 0.75 (No. 3 and No. 8)             duplicate records can be categorized as match, may be
  • Matching key field with high type                        match and no-match. Match and may be match
                                                             duplicate records are used in the duplicate data
  • And matching field with high distinct value and
                                                             elimination rule. Duplicate data elimination rule will
    low missing value                                        identify the quality of the each duplicate record to
Rule 11: certainty factor 0.7 (No. 1 and No. 9)              eliminate     poor    quality    duplicate    records.


                                                                             www.ijorcs.org
32                                                                               M.Anitha, A.Srinivas, T.P.Shekhar, D.Sagar

• Calculate the similarity of the documents matched in          Data cleansing is a complex and challenging problem.
  main concepts (Xmc) and the similarity of the                 This rule-based strategy helps to manage the
  documents matched in detailed descriptions (Xdd).             complexity, but does not remove that complexity. This
• Evaluate Xmc and Xdd using the rules to derive the            approach can be applied to any subject oriented data
  corresponding memberships.                                    warehouse in any domain.
• Compare the memberships and select the minimum
                                                                                    V. REFERENCES
  membership from these two sets to represent the
  membership of the corresponding concept (high                 [1] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani,
  similarity, medium similarity, and low similarity) for              “Robust and Efficient Fuzzy Match for Online Data
  each rule.                                                          Cleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003.
• Collect memberships which represent the same                  [2]   Kuhanandha Mahalingam and Michael N.Huhns,
                                                                      “Representing and using Ontologies”,USC-CIT
  concept in one set.
                                                                      Technical Report 98-01.
• Derive the maximum membership for each set, and               [3]   Weifeng Su, Jiying Wang, and Federick H.Lochovsky,
  compute the final inference result.                                 ” Record Matching over Query Results from Multiple
C. Evaluation Metric                                                  Web Databases” IEEE transactions on Knowledge and
                                                                      Data Engineering, vol. 22, N0.4,2010.
   The overall performance can be found using                   [4]   R. Ananthakrishna, S. Chaudhuri, and V. Ganti,


𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
precision and recall where
               Number of correctly identified duplicate pairs
                                                                      “Eliminating Fuzzy Duplicates in Data Warehouses.
                                                                      VLDB”, pages 586-597, 2002.
                  Number of all identified duplicate pairs
                                                                [5]   Tetlow.P,Pan.J,Oberle.D,Wallace.E,Uschold.M,Kendall


𝑅𝑒𝑐𝑎𝑙𝑙 =
           Number of correctly identified duplicate pairs
                                                                      .E,”Ontology Driven Architectures and Potential Uses

                  Number of true duplicate pairs
                                                                      of     the     Semantic      Web      in    Software
                                                                      Engineering”,W3C,Semantic Web Best Practices and
                                                                      Deployment Working Group,Draft(2006).
The classification quality is evaluated using F-measure         [6]   Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma,
which is the harmonic mean of precision and recall                    “Instance-based Schema Matching for Web Databases


           𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
                              2(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)(𝑟𝑒𝑐𝑎𝑙𝑙)
                                                                      by Domain-specific Query Probing”, Proceedings of the

                                 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
                                                                      30th VLDB Conference, Toronto, Canada, 2004.
                                                                [7]   Amy J.C.Trappey, Charles V.Trappey, Fu-Chiang
                                                                      Hsu,and David W.Hsiao, “A Fuzzy Ontological
                  IV. CONCLUSION                                      Knowledge Document Clustering Methodology”,IEEE
                                                                      Transactions on Systems,Man,and Cybernetics-Part
   Deduplication and data linkage are important tasks                 B:Cybernetics,Vol.39,No.3,june 2009.
in the pre-processing step for many data mining
projects. It is important to improve data quality before
data is loaded into data warehouse. Locating
approximate duplicates in large data warehouse is an
important part of data management and plays a critical
role in the data cleaning process. In this research wok,
a framework is designed to clean duplicate data for
improving data quality and also to support any subject
oriented data.y
   In this research work, efficient duplicate detection
and duplicate elimination approach is developed to
obtain good result of duplicate detection and
elimination by reducing false positives. Performance
of this research work shows that the time saved
significantly and improved duplicate results than
existing approach.
    The framework is mainly developed to increase the
speed of the duplicate data detection and elimination
process and to increase the quality of the data by
identifying true duplicates and strict enough to keep
out false-positives. The accuracy and efficiency of
duplicate elimination strategies are improved by
introducing the concept of a certainty factor for a rule.


                                                                                   www.ijorcs.org

Contenu connexe

Tendances

A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
Privacy Preserving Clustering on Distorted data
Privacy Preserving Clustering on Distorted dataPrivacy Preserving Clustering on Distorted data
Privacy Preserving Clustering on Distorted dataIOSR Journals
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02Jeet Das
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaIJDKP
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXINGIJDKP
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
 
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLINGLOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLINGijaia
 

Tendances (17)

A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Privacy Preserving Clustering on Distorted data
Privacy Preserving Clustering on Distorted dataPrivacy Preserving Clustering on Distorted data
Privacy Preserving Clustering on Distorted data
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
Data Mining
Data MiningData Mining
Data Mining
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 
winbis1005
winbis1005winbis1005
winbis1005
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
61_Empirical
61_Empirical61_Empirical
61_Empirical
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
 
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLINGLOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
 

Similaire à Detecting Duplicate Records Using Clustering

Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MININGcsandit
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueMehmet Beyaz
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESSIJDKP
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESSIJDKP
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONPRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONcscpconf
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.docbutest
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.docbutest
 
Enhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete DatabasesEnhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete Databasespaperpublications3
 

Similaire à Detecting Duplicate Records Using Clustering (20)

Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
 
130509
130509130509
130509
 
130509
130509130509
130509
 
G045033841
G045033841G045033841
G045033841
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONPRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
1 5
1 51 5
1 5
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.doc
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.doc
 
Enhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete DatabasesEnhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete Databases
 

Plus de IJORCS

Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsHelp the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsIJORCS
 
Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4IJORCS
 
Real-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition SystemReal-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition SystemIJORCS
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveIJORCS
 
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...IJORCS
 
Algebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression FunctionAlgebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression FunctionIJORCS
 
Enhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State LogicEnhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State LogicIJORCS
 
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...IJORCS
 
CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2IJORCS
 
Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1IJORCS
 
Voice Recognition System using Template Matching
Voice Recognition System using Template MatchingVoice Recognition System using Template Matching
Voice Recognition System using Template MatchingIJORCS
 
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and FairnessChannel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and FairnessIJORCS
 
A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...IJORCS
 
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...IJORCS
 
A Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETsA Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETsIJORCS
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...IJORCS
 
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemAn Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemIJORCS
 
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...IJORCS
 
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...IJORCS
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 

Plus de IJORCS (20)

Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsHelp the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
 
Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4
 
Real-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition SystemReal-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition System
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
 
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
 
Algebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression FunctionAlgebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression Function
 
Enhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State LogicEnhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State Logic
 
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
 
CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2
 
Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1
 
Voice Recognition System using Template Matching
Voice Recognition System using Template MatchingVoice Recognition System using Template Matching
Voice Recognition System using Template Matching
 
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and FairnessChannel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
 
A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...
 
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
 
A Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETsA Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETs
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
 
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemAn Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
 
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
 
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 

Dernier

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Detecting Duplicate Records Using Clustering

  • 1. International Journal of Research in Computer Science eISSN 2249-8265 Volume 2 Issue 2 (2012) pp. 29-32 © White Globe Publications www.ijorcs.org DUPLICATE DETECTION OF RECORDS IN QUERIES USING CLUSTERING M.Anitha1, A.Srinivas2, T.P.Shekhar3, D.Sagar4 * Sree Chaitanya College Of Engineering (SCCE), Karimnagar, India Abstract: The problem of detecting and eliminating multiple sources, the amount of the data increases and duplicated data is one of the major problems in the as well as data is duplicated. Data warehouse may broad area of data cleaning and data quality in data have terabyte of data for the mining process. The warehouse. Many times, the same logical real world preprocessing of data is the initial and often crucial entity may have multiple representations in the data step of the data mining process. To increase the warehouse. Duplicate elimination is hard because it is accuracy of the mining result one has to perform data caused by several types of errors like typographical preprocessing because 80% of mining efforts often errors, and different representations of the same spend their time on data quality. So, data cleaning is logical value. Also, it is important to detect and clean very much important in data warehouse before the equivalence errors because an equivalence error may mining process. The result of the data mining process result in several duplicate tuples. Recent research will not be accurate because of the data duplication efforts have focused on the issue of duplicate and poor quality of data. There are many existing elimination in data warehouses. This entails trying to methods available for duplicate data detection and match inexact duplicate records, which are records elimination. But the speed of the data cleaning process that refer to the same real-world entity while not being is very slow and the time taken for the cleaning syntactically equivalent. This paper mainly focuses on process is high with large amount of data. So, there is a efficient detection and elimination of duplicate data. need to reduce time and increase speed of the data The main objective of this research work is to detect cleaning process as well as need to improve the quality exact and inexact duplicates by using duplicate of the data. detection and elimination rules. This approach is used to improve the efficiency of the data. There are two issues to be considered for duplicate detection: Accuracy and Speed. The measure of Keywords: Data Cleaning, Duplicate Data, Data accuracy in duplicate detection depends on the number Warehouse, Data Mining of false negatives (duplicates you did not classify as such) and false positives (non-duplicates which were I. INTRODUCTION classified as duplicates) [12]. Data warehouse contains large amounts of data for In this research work, a duplicate detection and data mining to analyze the data for decision making elimination rule is developed to handle any duplicate process. Data miners do not simply analyze data, they data in a data warehouse. Duplicate elimination is very have to bring the data in a format and state that allows important to identify which duplicate to retain and for this analysis. It has been estimated that the actual duplicate is to be removed. The main objective of this mining of data only makes up 10% of the time research work is to reduce the number of false required for the complete knowledge discovery positives, to speed up the data cleaning process reduce process [3]. According to Jiawei, the precedent time the complexity and to improve the quality of data. A consuming step of preprocessing is of essential high quality, scalable duplicate elimination algorithm important for data mining. It is more than a tedious is used and evaluated it on real datasets from an necessity: The techniques used in the preprocessing operational data warehouse to achieve objective. step can deeply influence the results of the following step, the actual application of a data mining algorithm II. RECORD MATCHING OVER QUERY [6]. Hans-peter stated as the role of the impact on the RESULTS link of data preprocessing to data mining will gain steadily more interest over the coming years. A. First Problem Definition Preprocessing is one of the fourth future trend and major issues in data mining over the next years [7]. Our focus is to find the matching status among the records and to retain the non duplicate records. Then, In data warehouse, data is integrated or collected the goal is to cluster the matched records using fuzzy from multiple sources. While integrating data from ontological document clustering. www.ijorcs.org
  • 2. 30 M.Anitha, A.Srinivas, T.P.Shekhar, D.Sagar B. Element Identification Algorithm: Supervised learning methods use only some of the 1. D=φ fields in a record for identification. This is the reason 2. Set the parameters W of C1 according to N for query results obtained using supervised learning to 3. Use C1 to get a set of duplicate vector pairs d1 and f contain duplicate records. Unsupervised Duplicate from P and N Elimination (UDE) does not suffer from these types of 4. P = P- d1 user reference problems. A preprocessing step called 5. while | d1 |≠ 0 exact matching is used for matching relevant records. 6. N’ = N - f It requires the data format of the records to be the 7. D = D + d1 + f same. So, the exact matching method is applicable 8. Train C2 using D and N’ only for the records from the same data source. 9. Classify P using C2 and get a set of newly identified Element identification thus merges the records that are duplicate vector pairs d2 exactly the same in relevant matching fields. 10. P = P - d2 C. Ontology matching 11. D =D + d2 12. Adjust the parameters W of C1 according to N’ and The term Ontology is derived from the Greek words D ‘onto’ which means being and ‘logia’ which means 13. Use C1 to get a new set of duplicate vector pairs d1 written or spoken disclosure. In short, it refers to a and f from P and N specification of a conceptualization. 14. N=N’ Ontology basically refers to the set of concepts such 15. Return D as things, events and relations that are specified in Figure 1: UDE Algorithm some way in order to create an agreed-upon vocabulary for exchanging information. Ontologies B. Certainty factor can be represented in textual or graphical formats. Usually, graphical formats are preferred for easy In the existing method of duplicate data elimination understandability. Ontologies with a large knowledge [10], certainty factor (CF) is calculated by classifying base [5] can be represented in different forms such as attributes with distinct and missing value, type and size hierarchical trees, expandable hierarchical trees, of the attribute. These attributes are identified hyperbolic trees, etc. In the expandable hierarchical manually based on the type of the data and the most tree format, the user has the freedom to expand only important of data in that data warehouse. For example, the node of interest and leave the rest in a collapsed if name, telephone and fax field are used for matching state [2]. If necessary, the entire tree can be expanded then high value is assigned for certainty factor. In this to get the complete knowledge base. This type of research work, best attributes are identified in the early format can be used only when there are a large number stages of data cleaning. The attributes are selected of hierarchical relationships. Ontology matching is based on the specific criteria and quality of the data. used for finding the matching status of the record pairs Attribute threshold value is calculated based on the by matching the record attributes. measurement type and size of the data. These selected attributes are well suited for the data cleaning process. III. SYSTEM METHODOLOGY Certainty factor is assigned based on the attribute types. This is shown in the following table. A. Unsupervised Duplicate Elimination Table 1: Classification of attribute types UDE employs a similarity function to find field Distinct Missing Types of similarity. We use similarity vector to represent a pair S. No Key Attribute Size of data values values data of records. 1 √ √ √ Input: Potential duplicate vector set P Non-duplicate 2 √ √ vector set N 3 √ √ Output: Duplicate vector set D 4 √ √ √ √ 5 √ √ √ C1 : A classification algorithm with adjustable parameters W that identifies duplicate vector pairs 6 √ √ √ from P C2 : a supervised classifier, SVM 7 √ √ √ 8 √ √ 9 √ √ 10 √ √ www.ijorcs.org
  • 3. Duplicate Detection of Records in Queries using Clustering 31 • Matching key field with high type and high size Rule 1: certainty factor 0.95 (No. 1 and No. 4) • And matching field with high distinct value and • Matching key field with high type and high size high value data type • And matching field with high distinct value, low missing value, high value data type and matching Rule 12: certainty factor 0.7 (No. 2 and No. 9) field with high range value • Matching key field with high size • And matching field with high distinct value and Rule 2: certainty factor 0.9 (No. 2 and No. 4) high value data type • Matching key field with high range value • And matching field with high distinct value, low Rule 13: certainty factor 0.7 (No. 3 and No. 9) missing value, and matching field with high range • Matching key field with high type value • And matching field with high distinct value and high value data type Rule 3: certainty factor 0.9 (No. 3 and No. 4) • Matching key field with high type Rule 14: certainty factor 0.7 (No. 1 and No. 10) • And Matching field with high distinct value, • Matching key field with high type and high size low missing value, high value data type and • And matching field with high distinct value and matching field with high range value high range value Rule 4: certainty factor 0.85 (No. 1 and No. 5) Rule 15: certainty factor 0.7 (No. 2 and No. 10) • Matching key field with high type and high size • Matching key field with high size • And matching field with high distinct value, low • And matching field with high distinct value and missing value and high range value high range value Rule 5: certainty factor 0.85 (No. 1 and No. 5) Rule 16: certainty factor 0.7 (No. 3 and No. 10) • Matching key field and high size • Matching key field with high type • And matching field with high distinct value, low • And matching field with high distinct value and missing value and high range value high range value Rule 6: certainty factor 0.85 (No. 2 and No. 5) S Certainty Threshold Rules • Matching key field with high type No. Factor (CF) value (TH) • And matching field with high distinct value, low 1 {TS}, {D, M, DT, DS} 0.95 0.75 2 {T, S}, {D, M, DT, DS} 0.9 0.80 missing value and high range value {TS, T, S}, {D, M, DT}, 3 0.85 0.85 Rule 7: certainty factor 0.85 (No. 1 and No. 6) {D, M, DS} 4 {TS, T, S}, {D, DT, DS} 0.8 0.9 • Matching key field with high size and high type 5 {TS, T, S}, {D, M} 0.75 0.95 • And matching field with high distinct value, low {TS, T, S}, {D, DT}, {D, 6 0.7 0.95 missing value and high value data type DS} Rule 8: certainty factor 0.8 (No. 3 and No. 7) TS – Type and Size of key attribute • Matching key field with high type T – Type of key attribute • And Matching field with high distinct value, S – Size of key attributes high value data type and high range value D – Distinct value of attributes M – Missing value of attributes Rule 09: certainty factor 0.75 (No. 2 and No. 8) DT – Data type of attributes • Matching key field with high size DS – Data size of attributes • And matching field with high distinct value and low missing value Duplicate records are identified in each cluster to identify exact and inexact duplicate records. The Rule 10: certainty factor 0.75 (No. 3 and No. 8) duplicate records can be categorized as match, may be • Matching key field with high type match and no-match. Match and may be match duplicate records are used in the duplicate data • And matching field with high distinct value and elimination rule. Duplicate data elimination rule will low missing value identify the quality of the each duplicate record to Rule 11: certainty factor 0.7 (No. 1 and No. 9) eliminate poor quality duplicate records. www.ijorcs.org
  • 4. 32 M.Anitha, A.Srinivas, T.P.Shekhar, D.Sagar • Calculate the similarity of the documents matched in Data cleansing is a complex and challenging problem. main concepts (Xmc) and the similarity of the This rule-based strategy helps to manage the documents matched in detailed descriptions (Xdd). complexity, but does not remove that complexity. This • Evaluate Xmc and Xdd using the rules to derive the approach can be applied to any subject oriented data corresponding memberships. warehouse in any domain. • Compare the memberships and select the minimum V. REFERENCES membership from these two sets to represent the membership of the corresponding concept (high [1] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, similarity, medium similarity, and low similarity) for “Robust and Efficient Fuzzy Match for Online Data each rule. Cleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003. • Collect memberships which represent the same [2] Kuhanandha Mahalingam and Michael N.Huhns, “Representing and using Ontologies”,USC-CIT concept in one set. Technical Report 98-01. • Derive the maximum membership for each set, and [3] Weifeng Su, Jiying Wang, and Federick H.Lochovsky, compute the final inference result. ” Record Matching over Query Results from Multiple C. Evaluation Metric Web Databases” IEEE transactions on Knowledge and Data Engineering, vol. 22, N0.4,2010. The overall performance can be found using [4] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = precision and recall where Number of correctly identified duplicate pairs “Eliminating Fuzzy Duplicates in Data Warehouses. VLDB”, pages 586-597, 2002. Number of all identified duplicate pairs [5] Tetlow.P,Pan.J,Oberle.D,Wallace.E,Uschold.M,Kendall 𝑅𝑒𝑐𝑎𝑙𝑙 = Number of correctly identified duplicate pairs .E,”Ontology Driven Architectures and Potential Uses Number of true duplicate pairs of the Semantic Web in Software Engineering”,W3C,Semantic Web Best Practices and Deployment Working Group,Draft(2006). The classification quality is evaluated using F-measure [6] Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma, which is the harmonic mean of precision and recall “Instance-based Schema Matching for Web Databases 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)(𝑟𝑒𝑐𝑎𝑙𝑙) by Domain-specific Query Probing”, Proceedings of the 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 30th VLDB Conference, Toronto, Canada, 2004. [7] Amy J.C.Trappey, Charles V.Trappey, Fu-Chiang Hsu,and David W.Hsiao, “A Fuzzy Ontological IV. CONCLUSION Knowledge Document Clustering Methodology”,IEEE Transactions on Systems,Man,and Cybernetics-Part Deduplication and data linkage are important tasks B:Cybernetics,Vol.39,No.3,june 2009. in the pre-processing step for many data mining projects. It is important to improve data quality before data is loaded into data warehouse. Locating approximate duplicates in large data warehouse is an important part of data management and plays a critical role in the data cleaning process. In this research wok, a framework is designed to clean duplicate data for improving data quality and also to support any subject oriented data.y In this research work, efficient duplicate detection and duplicate elimination approach is developed to obtain good result of duplicate detection and elimination by reducing false positives. Performance of this research work shows that the time saved significantly and improved duplicate results than existing approach. The framework is mainly developed to increase the speed of the duplicate data detection and elimination process and to increase the quality of the data by identifying true duplicates and strict enough to keep out false-positives. The accuracy and efficiency of duplicate elimination strategies are improved by introducing the concept of a certainty factor for a rule. www.ijorcs.org