SlideShare une entreprise Scribd logo
1  sur  17
Ahsan AbdullahAhsan Abdullah
11
Data WarehousingData Warehousing
Lecture-20Lecture-20
Data Duplication Elimination & BSN MethodData Duplication Elimination & BSN Method
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan1010@yahoo.com
Ahsan Abdullah
2
Why data duplicated?Why data duplicated?
A data warehouse is created from heterogeneous sources,
with heterogeneous databases (different
schema/representation) of the same entity.
The data coming from outside the organization owning the
DWH, can have even lower quality data i.e. different
representation for same entity, transcription or typographical
errors.
Ahsan Abdullah
3
Problems due to data duplicationProblems due to data duplication
Data duplication, can result in costly errors, such as:
 False frequency distributions.
 Incorrect aggregates due to double counting.
 Difficulty with catching fabricated identities by credit card companies.
Ahsan Abdullah
4
Unable to determine customer relationships (CRM)Unable to determine customer relationships (CRM)
Unable to analyze employee benefits trendsUnable to analyze employee benefits trends
Name Phone Number Cust. No.
M. Ismail Siddiqi 021.666.1244 780701
M. Ismail Siddiqi 021.666.1244 780203
M. Ismail Siddiqi 021.666.1244 780009
Bonus Date Name Department Emp. No.
Jan. 2000 Khan Muhammad 213 (MKT) 5353536
Dec. 2001 Khan Muhammad 567 (SLS) 4577833
Mar. 2002 Khan Muhammad 349 (HR) 3457642
• Duplicate Identification Numbers
• Multiple Customer Numbers
• Multiple Employee Numbers
Data Duplication: Non-Unique PKData Duplication: Non-Unique PK
Ahsan Abdullah
5
Data Duplication: House HoldingData Duplication: House Holding
 Group together all records that belong to the sameGroup together all records that belong to the same
household.household.
Why bother ?Why bother ?
……… S. Ahad 440, Munir Road, Lahore
……… ………….… ………………………………
……… Shiekh Ahad No. 440, Munir Rd, Lhr
……… Shiekh Ahed House # 440, Munir Road, Lahore
……… ………….… ………………………………
Ahsan Abdullah
6
 Identify multiple records in each household whichIdentify multiple records in each household which
represent the same individualrepresent the same individual
Address field is standardized.Address field is standardized.
By coincidence ??By coincidence ??
……… M. Ahad 440, Munir Road, Lahore
……… ………….… ………………………………
……… Maj Ahad 440, Munir Road, Lahore
Data Duplication: IndividualizationData Duplication: Individualization
Ahsan Abdullah
7
Formal definition & NomenclatureFormal definition & Nomenclature
 Problem statement:Problem statement:
 ““Given two databases, identify the potentially matchedGiven two databases, identify the potentially matched
recordsrecords EfficientlyEfficiently andand EffectivelyEffectively””
 Many names, such as:Many names, such as:
 Record linkageRecord linkage
 Merge/purgeMerge/purge
 Entity reconciliationEntity reconciliation
 List washing and data cleansing.List washing and data cleansing.
 Current market and tools heavily centeredCurrent market and tools heavily centered
towards customer lists.towards customer lists.
Ahsan Abdullah
8
Need & Tool SupportNeed & Tool Support
 Logical solution to dirty data is to clean it in some way.
 Doing it manually is very slow and prone to errors.
 Tools are required to do it “cost” effectively to achieve
reasonable quality.
 Tools are there, some for specific fields, others for specific
cleaning phase.
 Since application specific, so work very well, but need
support from other tools for broad spectrum of cleaning
problems.
Ahsan Abdullah
9
Overview of the Basic ConceptOverview of the Basic Concept
 In its simplest form, there is an identifying attribute (orIn its simplest form, there is an identifying attribute (or
combination) per record for identification.combination) per record for identification.
 Records can be from single source or multiple sourcesRecords can be from single source or multiple sources
sharing same PK or other common unique attributes.sharing same PK or other common unique attributes.
 Sorting performed on identifying attributes and neighboringSorting performed on identifying attributes and neighboring
records checked.records checked.
 What if no common attributes or dirty data?What if no common attributes or dirty data?
 The degree of similarity measured numerically, differentThe degree of similarity measured numerically, different
attributes may contribute differently.attributes may contribute differently.
Ahsan Abdullah
10
Basic Sorted Neighborhood (BSN) MethodBasic Sorted Neighborhood (BSN) Method
 Concatenate data into one sequential list of N recordsConcatenate data into one sequential list of N records
 Steps 1: Create KeysSteps 1: Create Keys
 Compute a key for each record in the list by extracting relevant fieldsCompute a key for each record in the list by extracting relevant fields
or portions of fieldsor portions of fields
 Effectiveness of the this method highly depends on a properlyEffectiveness of the this method highly depends on a properly
chosen keychosen key
 Step 2: Sort DataStep 2: Sort Data
 Sort the records in the data list using the key of step 1Sort the records in the data list using the key of step 1
 Step 3: MergeStep 3: Merge
 Move a fixed size window through the sequential list of recordsMove a fixed size window through the sequential list of records
limiting the comparisons for matching records to those records in thelimiting the comparisons for matching records to those records in the
windowwindow
 If the size of the window isIf the size of the window is ww records then every new record enteringrecords then every new record entering
the window is compared with the previousthe window is compared with the previous w-1w-1 records.records.
Ahsan Abdullah
11
BSN Method : Sliding WindowBSN Method : Sliding Window
.
.
.
.
.
.
Current window
of records
w
Next window
of records
w
Ahsan Abdullah
12
BSN Method: Selection of KeysBSN Method: Selection of Keys
 Selection of KeysSelection of Keys
 Effectiveness highly dependent on the key selected to sort theEffectiveness highly dependent on the key selected to sort the
records middle name vs. family name,records middle name vs. family name,
 A key is a sequence of a subset of attributes or sub-stringsA key is a sequence of a subset of attributes or sub-strings
within the attributes chosen from the record.within the attributes chosen from the record.
 The keys are used for sorting the entire dataset with theThe keys are used for sorting the entire dataset with the
intention that matched candidates will appear close to eachintention that matched candidates will appear close to each
other.other.
First Middle Address NID Key
Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345
Ahsan Abdullah
13
BSN Method: Problem with keysBSN Method: Problem with keys
 Since data is dirty, so keys WILL also be dirty, and
matching records will not come together.
 Data becomes dirty due to data entry errors or use of
abbreviations. Some real examples are as follows:
 Solution is to use external standard source files to validate the
data and resolve any data conflicts.
Technology
Tech.
Techno.
Tchnlgy
Ahsan Abdullah
14
BSN Method: Problem with keys (e.g.)BSN Method: Problem with keys (e.g.)
No Name Address Gender
1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M
2 Syed Noman 420 4 Rwp Scheme M
3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F
No Name Address Gender
1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M
2 S. Noman 420, Scheme 4, Rwp M
3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
If contents of fields are not properly ordered, similar records will NOT
fall in the same window.
Example: Records 1 and 2 are similar but will occur far apart.
Solution is to TOKENize the fields i.e. break them further. Use the
tokens in different fields for sorting to fix the error.
Example: Either using the name or the address field records 1 and 2 will
fall close.
Ahsan Abdullah
15
BSN Method: Matching CandidatesBSN Method: Matching Candidates
Merging of records is a complex inferential process.
Example-1:Example-1: Two persons with names spelled nearly but not
identically, have the exact same address. We infer they are same
person i.e. NomaNoma Abdullah and NomanNoman Abdullah.
Example-2:Example-2: Two persons have same National ID numbers but names
and addresses are completely different. We infer same person who
changed his name and moved or the records represent different
persons and NID is incorrect for one of them.
Use of further information such as age, gender etc. can alter theUse of further information such as age, gender etc. can alter the
decision.decision.
Example-3:Example-3: NomaNoma-F and NomanNoman-M we could perhaps infer that Noma
and Noman are siblings i.e. brothers and sisters. NomaNoma-30 and
NomanNoman-5 i.e. mother and son.
Ahsan Abdullah
16
 Time Complexity: O(n log n)Time Complexity: O(n log n)
 O (n) for Key CreationO (n) for Key Creation
 O (n log n) for SortingO (n log n) for Sorting
 O (w n) for matching, where wO (w n) for matching, where w ≤≤ 22 ≤≤ nn
 Constants vary a lotConstants vary a lot
 At least three passes required on the dataset.At least three passes required on the dataset.
 Complexity or rule and window size detrimental.Complexity or rule and window size detrimental.
 For large sets disk I/O is detrimental.For large sets disk I/O is detrimental.
Complexity Analysis of BSN MethodComplexity Analysis of BSN Method
Ahsan Abdullah
17
BSN Method: Equational TheoryBSN Method: Equational Theory
To specify the inferences we need equational
Theory.
 Logic is NOT based on string equivalence.
 Logic based on domain equivalence.
 Requires declarative rule language.

Contenu connexe

Tendances

Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-index
mahi_uta
 

Tendances (8)

Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
 
Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-index
 
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
 
Boolean Training
Boolean TrainingBoolean Training
Boolean Training
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 

En vedette (20)

Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Lecture 18
Lecture 18Lecture 18
Lecture 18
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
 
Lecture 27
Lecture 27Lecture 27
Lecture 27
 
Lecture 31
Lecture 31Lecture 31
Lecture 31
 
Lecture 21
Lecture 21Lecture 21
Lecture 21
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Lecture 16
Lecture 16Lecture 16
Lecture 16
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
 
Lecture 34
Lecture 34Lecture 34
Lecture 34
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Lecture 33
Lecture 33Lecture 33
Lecture 33
 
Lecture 30
Lecture 30Lecture 30
Lecture 30
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Lecture 35
Lecture 35Lecture 35
Lecture 35
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
 
Lecture 32
Lecture 32Lecture 32
Lecture 32
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
 

Similaire à Lecture 20

Database fundamentals
Database fundamentalsDatabase fundamentals
Database fundamentals
crystalpullen
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
Deepak K
 
Tackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsTackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening Programs
Alessa
 

Similaire à Lecture 20 (11)

Database fundamentals
Database fundamentalsDatabase fundamentals
Database fundamentals
 
FSDN conversations
FSDN conversationsFSDN conversations
FSDN conversations
 
Week12
Week12Week12
Week12
 
Blast gp assignment
Blast  gp assignmentBlast  gp assignment
Blast gp assignment
 
Vivo Search
Vivo SearchVivo Search
Vivo Search
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
 
Tackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsTackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening Programs
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
PostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | EdurekaPostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | Edureka
 
bm25 demystified
bm25 demystifiedbm25 demystified
bm25 demystified
 

Plus de Shani729 (20)

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
 
Lecture 36
Lecture 36Lecture 36
Lecture 36
 
Lecture 29
Lecture 29Lecture 29
Lecture 29
 
Lecture 28
Lecture 28Lecture 28
Lecture 28
 

Dernier

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 

Dernier (20)

Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 

Lecture 20

  • 1. Ahsan AbdullahAhsan Abdullah 11 Data WarehousingData Warehousing Lecture-20Lecture-20 Data Duplication Elimination & BSN MethodData Duplication Elimination & BSN Method Virtual University of PakistanVirtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: ahsan1010@yahoo.com
  • 2. Ahsan Abdullah 2 Why data duplicated?Why data duplicated? A data warehouse is created from heterogeneous sources, with heterogeneous databases (different schema/representation) of the same entity. The data coming from outside the organization owning the DWH, can have even lower quality data i.e. different representation for same entity, transcription or typographical errors.
  • 3. Ahsan Abdullah 3 Problems due to data duplicationProblems due to data duplication Data duplication, can result in costly errors, such as:  False frequency distributions.  Incorrect aggregates due to double counting.  Difficulty with catching fabricated identities by credit card companies.
  • 4. Ahsan Abdullah 4 Unable to determine customer relationships (CRM)Unable to determine customer relationships (CRM) Unable to analyze employee benefits trendsUnable to analyze employee benefits trends Name Phone Number Cust. No. M. Ismail Siddiqi 021.666.1244 780701 M. Ismail Siddiqi 021.666.1244 780203 M. Ismail Siddiqi 021.666.1244 780009 Bonus Date Name Department Emp. No. Jan. 2000 Khan Muhammad 213 (MKT) 5353536 Dec. 2001 Khan Muhammad 567 (SLS) 4577833 Mar. 2002 Khan Muhammad 349 (HR) 3457642 • Duplicate Identification Numbers • Multiple Customer Numbers • Multiple Employee Numbers Data Duplication: Non-Unique PKData Duplication: Non-Unique PK
  • 5. Ahsan Abdullah 5 Data Duplication: House HoldingData Duplication: House Holding  Group together all records that belong to the sameGroup together all records that belong to the same household.household. Why bother ?Why bother ? ……… S. Ahad 440, Munir Road, Lahore ……… ………….… ……………………………… ……… Shiekh Ahad No. 440, Munir Rd, Lhr ……… Shiekh Ahed House # 440, Munir Road, Lahore ……… ………….… ………………………………
  • 6. Ahsan Abdullah 6  Identify multiple records in each household whichIdentify multiple records in each household which represent the same individualrepresent the same individual Address field is standardized.Address field is standardized. By coincidence ??By coincidence ?? ……… M. Ahad 440, Munir Road, Lahore ……… ………….… ……………………………… ……… Maj Ahad 440, Munir Road, Lahore Data Duplication: IndividualizationData Duplication: Individualization
  • 7. Ahsan Abdullah 7 Formal definition & NomenclatureFormal definition & Nomenclature  Problem statement:Problem statement:  ““Given two databases, identify the potentially matchedGiven two databases, identify the potentially matched recordsrecords EfficientlyEfficiently andand EffectivelyEffectively””  Many names, such as:Many names, such as:  Record linkageRecord linkage  Merge/purgeMerge/purge  Entity reconciliationEntity reconciliation  List washing and data cleansing.List washing and data cleansing.  Current market and tools heavily centeredCurrent market and tools heavily centered towards customer lists.towards customer lists.
  • 8. Ahsan Abdullah 8 Need & Tool SupportNeed & Tool Support  Logical solution to dirty data is to clean it in some way.  Doing it manually is very slow and prone to errors.  Tools are required to do it “cost” effectively to achieve reasonable quality.  Tools are there, some for specific fields, others for specific cleaning phase.  Since application specific, so work very well, but need support from other tools for broad spectrum of cleaning problems.
  • 9. Ahsan Abdullah 9 Overview of the Basic ConceptOverview of the Basic Concept  In its simplest form, there is an identifying attribute (orIn its simplest form, there is an identifying attribute (or combination) per record for identification.combination) per record for identification.  Records can be from single source or multiple sourcesRecords can be from single source or multiple sources sharing same PK or other common unique attributes.sharing same PK or other common unique attributes.  Sorting performed on identifying attributes and neighboringSorting performed on identifying attributes and neighboring records checked.records checked.  What if no common attributes or dirty data?What if no common attributes or dirty data?  The degree of similarity measured numerically, differentThe degree of similarity measured numerically, different attributes may contribute differently.attributes may contribute differently.
  • 10. Ahsan Abdullah 10 Basic Sorted Neighborhood (BSN) MethodBasic Sorted Neighborhood (BSN) Method  Concatenate data into one sequential list of N recordsConcatenate data into one sequential list of N records  Steps 1: Create KeysSteps 1: Create Keys  Compute a key for each record in the list by extracting relevant fieldsCompute a key for each record in the list by extracting relevant fields or portions of fieldsor portions of fields  Effectiveness of the this method highly depends on a properlyEffectiveness of the this method highly depends on a properly chosen keychosen key  Step 2: Sort DataStep 2: Sort Data  Sort the records in the data list using the key of step 1Sort the records in the data list using the key of step 1  Step 3: MergeStep 3: Merge  Move a fixed size window through the sequential list of recordsMove a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in thelimiting the comparisons for matching records to those records in the windowwindow  If the size of the window isIf the size of the window is ww records then every new record enteringrecords then every new record entering the window is compared with the previousthe window is compared with the previous w-1w-1 records.records.
  • 11. Ahsan Abdullah 11 BSN Method : Sliding WindowBSN Method : Sliding Window . . . . . . Current window of records w Next window of records w
  • 12. Ahsan Abdullah 12 BSN Method: Selection of KeysBSN Method: Selection of Keys  Selection of KeysSelection of Keys  Effectiveness highly dependent on the key selected to sort theEffectiveness highly dependent on the key selected to sort the records middle name vs. family name,records middle name vs. family name,  A key is a sequence of a subset of attributes or sub-stringsA key is a sequence of a subset of attributes or sub-strings within the attributes chosen from the record.within the attributes chosen from the record.  The keys are used for sorting the entire dataset with theThe keys are used for sorting the entire dataset with the intention that matched candidates will appear close to eachintention that matched candidates will appear close to each other.other. First Middle Address NID Key Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345 Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345 Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345 Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345
  • 13. Ahsan Abdullah 13 BSN Method: Problem with keysBSN Method: Problem with keys  Since data is dirty, so keys WILL also be dirty, and matching records will not come together.  Data becomes dirty due to data entry errors or use of abbreviations. Some real examples are as follows:  Solution is to use external standard source files to validate the data and resolve any data conflicts. Technology Tech. Techno. Tchnlgy
  • 14. Ahsan Abdullah 14 BSN Method: Problem with keys (e.g.)BSN Method: Problem with keys (e.g.) No Name Address Gender 1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M 2 Syed Noman 420 4 Rwp Scheme M 3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F No Name Address Gender 1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M 2 S. Noman 420, Scheme 4, Rwp M 3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F If contents of fields are not properly ordered, similar records will NOT fall in the same window. Example: Records 1 and 2 are similar but will occur far apart. Solution is to TOKENize the fields i.e. break them further. Use the tokens in different fields for sorting to fix the error. Example: Either using the name or the address field records 1 and 2 will fall close.
  • 15. Ahsan Abdullah 15 BSN Method: Matching CandidatesBSN Method: Matching Candidates Merging of records is a complex inferential process. Example-1:Example-1: Two persons with names spelled nearly but not identically, have the exact same address. We infer they are same person i.e. NomaNoma Abdullah and NomanNoman Abdullah. Example-2:Example-2: Two persons have same National ID numbers but names and addresses are completely different. We infer same person who changed his name and moved or the records represent different persons and NID is incorrect for one of them. Use of further information such as age, gender etc. can alter theUse of further information such as age, gender etc. can alter the decision.decision. Example-3:Example-3: NomaNoma-F and NomanNoman-M we could perhaps infer that Noma and Noman are siblings i.e. brothers and sisters. NomaNoma-30 and NomanNoman-5 i.e. mother and son.
  • 16. Ahsan Abdullah 16  Time Complexity: O(n log n)Time Complexity: O(n log n)  O (n) for Key CreationO (n) for Key Creation  O (n log n) for SortingO (n log n) for Sorting  O (w n) for matching, where wO (w n) for matching, where w ≤≤ 22 ≤≤ nn  Constants vary a lotConstants vary a lot  At least three passes required on the dataset.At least three passes required on the dataset.  Complexity or rule and window size detrimental.Complexity or rule and window size detrimental.  For large sets disk I/O is detrimental.For large sets disk I/O is detrimental. Complexity Analysis of BSN MethodComplexity Analysis of BSN Method
  • 17. Ahsan Abdullah 17 BSN Method: Equational TheoryBSN Method: Equational Theory To specify the inferences we need equational Theory.  Logic is NOT based on string equivalence.  Logic based on domain equivalence.  Requires declarative rule language.

Notes de l'éditeur

  1. <number>
  2. <number>
  3. <number>
  4. <number>