SlideShare une entreprise Scribd logo
1  sur  17
Ahsan AbdullahAhsan Abdullah
11
Data WarehousingData Warehousing
Lecture-19Lecture-19
ETL Detail: Data CleansingETL Detail: Data Cleansing
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan@yahoo.com
Ahsan Abdullah
2
ETL Detail: Data CleansingETL Detail: Data Cleansing
Ahsan Abdullah
3
BackgroundBackground
 Other names:Other names: Called as data scrubbing or cleaning.Called as data scrubbing or cleaning.
 More than data arranging:More than data arranging: DWH is NOT just about arranging data,DWH is NOT just about arranging data,
but should be clean for overall health of organization. We drinkbut should be clean for overall health of organization. We drink
clean water!clean water!
 Big problem, big effect:Big problem, big effect: Enormous problem, as most data is dirty.Enormous problem, as most data is dirty.
GIGOGIGO
 Dirty is relative:Dirty is relative: Dirty means does not confirm to proper domainDirty means does not confirm to proper domain
definition and vary from domain to domain.definition and vary from domain to domain.
 Paradox:Paradox: Must involve domain expert, as detailed domainMust involve domain expert, as detailed domain
knowledge is required, so it becomes semi-automatic, but has to beknowledge is required, so it becomes semi-automatic, but has to be
automatic because of large data sets.automatic because of large data sets.
 Data duplication:Data duplication: Original problem was removing duplicates in oneOriginal problem was removing duplicates in one
system, compounded by duplicates from many systems.system, compounded by duplicates from many systems.
ONLY yellow part will go to Graphics
Ahsan Abdullah
4
Lighter Side of Dirty DataLighter Side of Dirty Data
 Year of birth 1995 current year 2005Year of birth 1995 current year 2005
 Born in 1986 hired in 1985Born in 1986 hired in 1985
 Who would take it seriously?Who would take it seriously? Computers whileComputers while
summarizing, aggregating, populating etc.summarizing, aggregating, populating etc.
 Small discrepancies become irrelevantSmall discrepancies become irrelevant for largefor large
averages, but what about sums, medians, maximum,averages, but what about sums, medians, maximum,
minimum etc.?minimum etc.?
{Comment: Show picture of baby}
ONLY yellow part will go to Graphics
Ahsan Abdullah
5
Serious Side of dirty dataSerious Side of dirty data
 Decision making at the Government level onDecision making at the Government level on
investmentinvestment based on rate of birth in terms ofbased on rate of birth in terms of
schools and then teachers. Wrong data resultingschools and then teachers. Wrong data resulting
in over and under investment.in over and under investment.
 Direct mail marketingDirect mail marketing sending letters to wrongsending letters to wrong
addresses retuned, or multiple letters to sameaddresses retuned, or multiple letters to same
address, loss of money and bad reputation andaddress, loss of money and bad reputation and
wrong identification of marketing region.wrong identification of marketing region.
ONLY yellow part will go to Graphics
Ahsan Abdullah
6
3 Classes of Anomalies…3 Classes of Anomalies…
 Syntactically Dirty DataSyntactically Dirty Data
 Lexical ErrorsLexical Errors
 IrregularitiesIrregularities
 Semantically Dirty DataSemantically Dirty Data
 Integrity Constraint ViolationIntegrity Constraint Violation
 Business rule contradictionBusiness rule contradiction
 DuplicationDuplication
 Coverage AnomaliesCoverage Anomalies
 Missing AttributesMissing Attributes
 Missing RecordsMissing Records
Ahsan Abdullah
7
3 Classes of Anomalies…3 Classes of Anomalies…
 Syntactically Dirty DataSyntactically Dirty Data
 Lexical ErrorsLexical Errors
 Discrepancies between the structure of the data items and the specifiedDiscrepancies between the structure of the data items and the specified
format of stored valuesformat of stored values
 e.g. number of columns used are unexpected for a tuple (mixed up numbere.g. number of columns used are unexpected for a tuple (mixed up number
of attributes)of attributes)
 IrregularitiesIrregularities
 Non uniform use of units and values, such as only giving annual salary butNon uniform use of units and values, such as only giving annual salary but
without info i.e. in US$ or PK Rs?without info i.e. in US$ or PK Rs?
 Semantically Dirty DataSemantically Dirty Data
 Integrity Constraint violationIntegrity Constraint violation
 ContradictionContradiction
 DoB > Hiring date etc.DoB > Hiring date etc.
 DuplicationDuplication
This slide will NOT go to Graphics
Ahsan Abdullah
8
 Coverage or lack of itCoverage or lack of it
 Missing AttributeMissing Attribute
 Result of omissions while collecting the data.Result of omissions while collecting the data.
 A constraint violation if we have null values for attributesA constraint violation if we have null values for attributes
where NOT NULL constraint exists.where NOT NULL constraint exists.
 Case more complicated where no such constraint exists.Case more complicated where no such constraint exists.
 Have to decide whether the value exists in the real world andHave to decide whether the value exists in the real world and
has to be deduced here or not.has to be deduced here or not.
3 Classes of Anomalies…3 Classes of Anomalies…
This slide will NOT go to Graphics
Ahsan Abdullah
9
Why Coverage Anomalies?Why Coverage Anomalies?
 Equipment malfunction (bar code reader, keyboard etc.)Equipment malfunction (bar code reader, keyboard etc.)
 Inconsistent with other recorded data and thus deleted.Inconsistent with other recorded data and thus deleted.
 Data not entered due to misunderstanding/illegibility.Data not entered due to misunderstanding/illegibility.
 Data not considered important at the time of entry (e.g. Y2K).Data not considered important at the time of entry (e.g. Y2K).
Ahsan Abdullah
10
 Dropping records.Dropping records.
 ““Manually” filling missing values.Manually” filling missing values.
 Using a global constant as filler.Using a global constant as filler.
 Using the attribute mean (or median) as filler.Using the attribute mean (or median) as filler.
 Using the most probable value as filler.Using the most probable value as filler.
Handling missing dataHandling missing data
Ahsan Abdullah
11
Key Based Classification of ProblemsKey Based Classification of Problems
 Primary key problemsPrimary key problems
 Non-Primary key problemsNon-Primary key problems
Ahsan Abdullah
12
Primary key problemsPrimary key problems
 Same PK but different data.Same PK but different data.
 Same entity with different keys.Same entity with different keys.
 PK in one system but not in other.PK in one system but not in other.
 Same PK but in different formats.Same PK but in different formats.
Ahsan Abdullah
13
Non primary key problems…Non primary key problems…
 Different encoding in different sources.Different encoding in different sources.
 Multiple ways to represent the sameMultiple ways to represent the same
information.information.
 Sources might contain invalid data.Sources might contain invalid data.
 Two fields with different data but sameTwo fields with different data but same
name.name.
Ahsan Abdullah
14
 Required fields left blank.Required fields left blank.
 Data erroneous or incomplete.Data erroneous or incomplete.
 Data contains null values.Data contains null values.
Non primary key problemsNon primary key problems
Ahsan Abdullah
15
Automatic Data Cleansing…Automatic Data Cleansing…
1.Statistical
2.Pattern Based
3.Clustering
4.Association Rules
Ahsan Abdullah
16
1.1. Statistical MethodsStatistical Methods
 Identifying outlier fields and records using the values ofIdentifying outlier fields and records using the values of
mean, standard deviation, range, etc., based onmean, standard deviation, range, etc., based on
Chebyshev’s theoremChebyshev’s theorem
2.2. Pattern-basedPattern-based
 Identify outlier fields and records that do not conform toIdentify outlier fields and records that do not conform to
existing patterns in the data.existing patterns in the data.
 A pattern is defined by a group of records that have similarA pattern is defined by a group of records that have similar
characteristics (“behavior”) for p% of the fields in the datacharacteristics (“behavior”) for p% of the fields in the data
set, where p is a user-defined value (usually above 90).set, where p is a user-defined value (usually above 90).
 Techniques such as partitioning, classification, andTechniques such as partitioning, classification, and
clustering can be used to identify patterns that apply toclustering can be used to identify patterns that apply to
most records.most records.
Automatic Data Cleansing…Automatic Data Cleansing…
This slide will NOT go to Graphics
Ahsan Abdullah
17
3.3. ClusteringClustering
 Identify outlier records using clustering based on Euclidian (orIdentify outlier records using clustering based on Euclidian (or
other) distance.other) distance.
 Clustering the entire record space can reveal outliers that areClustering the entire record space can reveal outliers that are
not identified at the field level inspectionnot identified at the field level inspection
 Main drawback of this method is computational time.Main drawback of this method is computational time.
4.4. Association rulesAssociation rules
 Association rules with high confidence and support define aAssociation rules with high confidence and support define a
different kind of pattern.different kind of pattern.
 Records that do not follow these rules are considered outliers.Records that do not follow these rules are considered outliers.
Automatic Data CleansingAutomatic Data Cleansing
This slide will NOT go to Graphics

Contenu connexe

Tendances

Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPromptCloud
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesDerek Kane
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | EdurekaEdureka!
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaIJDKP
 
Data analytics using R programming
Data analytics using R programmingData analytics using R programming
Data analytics using R programmingUmang Singh
 
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataEvaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataAM Publications
 
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...Simplilearn
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)raj.kamal13
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE cscpconf
 
Administrators Guide to RelDog Lite - FINAL
Administrators Guide to RelDog Lite - FINALAdministrators Guide to RelDog Lite - FINAL
Administrators Guide to RelDog Lite - FINALJennifer Gumpert, PMP
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 

Tendances (20)

Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics Algorithms
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics Capabilities
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Data analytics using R programming
Data analytics using R programmingData analytics using R programming
Data analytics using R programming
 
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataEvaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
 
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
 
Konrad cedem praesi
Konrad cedem praesiKonrad cedem praesi
Konrad cedem praesi
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE
 
Administrators Guide to RelDog Lite - FINAL
Administrators Guide to RelDog Lite - FINALAdministrators Guide to RelDog Lite - FINAL
Administrators Guide to RelDog Lite - FINAL
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 

En vedette

En vedette (20)

Lecture 23
Lecture 23Lecture 23
Lecture 23
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
 
Lecture 35
Lecture 35Lecture 35
Lecture 35
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Lecture 26
Lecture 26Lecture 26
Lecture 26
 
Lecture 30
Lecture 30Lecture 30
Lecture 30
 
Lecture 27
Lecture 27Lecture 27
Lecture 27
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Lecture 31
Lecture 31Lecture 31
Lecture 31
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
 
Lecture 34
Lecture 34Lecture 34
Lecture 34
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Lecture 33
Lecture 33Lecture 33
Lecture 33
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
 
Lecture 29
Lecture 29Lecture 29
Lecture 29
 
Lecture 32
Lecture 32Lecture 32
Lecture 32
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 

Similaire à Lecture 19

Normalization
NormalizationNormalization
NormalizationAbuSahama
 
Fairness in Machine Learning
Fairness in Machine LearningFairness in Machine Learning
Fairness in Machine LearningDelip Rao
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptkannaradhas
 
ICS Part 2 Computer Science Short Notes
ICS Part 2 Computer Science Short NotesICS Part 2 Computer Science Short Notes
ICS Part 2 Computer Science Short NotesAbdul Haseeb
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptxProfPPavanKumar
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptxProfPPavanKumar
 
03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.ppt03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.pptMuhammadHanifSyabani
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.pptAnkitaAnki16
 
03Predddddddddddddddddddddddprocessling.ppt
03Predddddddddddddddddddddddprocessling.ppt03Predddddddddddddddddddddddprocessling.ppt
03Predddddddddddddddddddddddprocessling.ppta99150433
 
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
Data Preprocessing and VisualizsdjvnovrnververdfvdfationData Preprocessing and Visualizsdjvnovrnververdfvdfation
Data Preprocessing and Visualizsdjvnovrnververdfvdfationwokati2689
 
Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3DanWooster1
 
12.Data processing and concepts.pdf
12.Data processing and concepts.pdf12.Data processing and concepts.pdf
12.Data processing and concepts.pdfAyele40
 

Similaire à Lecture 19 (20)

Cs501 data preprocessingdw
Cs501 data preprocessingdwCs501 data preprocessingdw
Cs501 data preprocessingdw
 
Normalization
NormalizationNormalization
Normalization
 
Fairness in Machine Learning
Fairness in Machine LearningFairness in Machine Learning
Fairness in Machine Learning
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
ICS Part 2 Computer Science Short Notes
ICS Part 2 Computer Science Short NotesICS Part 2 Computer Science Short Notes
ICS Part 2 Computer Science Short Notes
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.ppt
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.ppt03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.ppt
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
03Predddddddddddddddddddddddprocessling.ppt
03Predddddddddddddddddddddddprocessling.ppt03Predddddddddddddddddddddddprocessling.ppt
03Predddddddddddddddddddddddprocessling.ppt
 
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
Data Preprocessing and VisualizsdjvnovrnververdfvdfationData Preprocessing and Visualizsdjvnovrnververdfvdfation
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3
 
12.Data processing and concepts.pdf
12.Data processing and concepts.pdf12.Data processing and concepts.pdf
12.Data processing and concepts.pdf
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 

Plus de Shani729

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012Shani729
 
Python tutorial
Python tutorialPython tutorial
Python tutorialShani729
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionShani729
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)Shani729
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15Shani729
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodShani729
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15Shani729
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10Shani729
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Shani729
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Shani729
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Shani729
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2Shani729
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1Shani729
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13Shani729
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Shani729
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furcShani729
 
Lecture 36
Lecture 36Lecture 36
Lecture 36Shani729
 
Lecture 28
Lecture 28Lecture 28
Lecture 28Shani729
 

Plus de Shani729 (18)

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Lecture 36
Lecture 36Lecture 36
Lecture 36
 
Lecture 28
Lecture 28Lecture 28
Lecture 28
 

Dernier

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 

Dernier (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 

Lecture 19

  • 1. Ahsan AbdullahAhsan Abdullah 11 Data WarehousingData Warehousing Lecture-19Lecture-19 ETL Detail: Data CleansingETL Detail: Data Cleansing Virtual University of PakistanVirtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: ahsan@yahoo.com
  • 2. Ahsan Abdullah 2 ETL Detail: Data CleansingETL Detail: Data Cleansing
  • 3. Ahsan Abdullah 3 BackgroundBackground  Other names:Other names: Called as data scrubbing or cleaning.Called as data scrubbing or cleaning.  More than data arranging:More than data arranging: DWH is NOT just about arranging data,DWH is NOT just about arranging data, but should be clean for overall health of organization. We drinkbut should be clean for overall health of organization. We drink clean water!clean water!  Big problem, big effect:Big problem, big effect: Enormous problem, as most data is dirty.Enormous problem, as most data is dirty. GIGOGIGO  Dirty is relative:Dirty is relative: Dirty means does not confirm to proper domainDirty means does not confirm to proper domain definition and vary from domain to domain.definition and vary from domain to domain.  Paradox:Paradox: Must involve domain expert, as detailed domainMust involve domain expert, as detailed domain knowledge is required, so it becomes semi-automatic, but has to beknowledge is required, so it becomes semi-automatic, but has to be automatic because of large data sets.automatic because of large data sets.  Data duplication:Data duplication: Original problem was removing duplicates in oneOriginal problem was removing duplicates in one system, compounded by duplicates from many systems.system, compounded by duplicates from many systems. ONLY yellow part will go to Graphics
  • 4. Ahsan Abdullah 4 Lighter Side of Dirty DataLighter Side of Dirty Data  Year of birth 1995 current year 2005Year of birth 1995 current year 2005  Born in 1986 hired in 1985Born in 1986 hired in 1985  Who would take it seriously?Who would take it seriously? Computers whileComputers while summarizing, aggregating, populating etc.summarizing, aggregating, populating etc.  Small discrepancies become irrelevantSmall discrepancies become irrelevant for largefor large averages, but what about sums, medians, maximum,averages, but what about sums, medians, maximum, minimum etc.?minimum etc.? {Comment: Show picture of baby} ONLY yellow part will go to Graphics
  • 5. Ahsan Abdullah 5 Serious Side of dirty dataSerious Side of dirty data  Decision making at the Government level onDecision making at the Government level on investmentinvestment based on rate of birth in terms ofbased on rate of birth in terms of schools and then teachers. Wrong data resultingschools and then teachers. Wrong data resulting in over and under investment.in over and under investment.  Direct mail marketingDirect mail marketing sending letters to wrongsending letters to wrong addresses retuned, or multiple letters to sameaddresses retuned, or multiple letters to same address, loss of money and bad reputation andaddress, loss of money and bad reputation and wrong identification of marketing region.wrong identification of marketing region. ONLY yellow part will go to Graphics
  • 6. Ahsan Abdullah 6 3 Classes of Anomalies…3 Classes of Anomalies…  Syntactically Dirty DataSyntactically Dirty Data  Lexical ErrorsLexical Errors  IrregularitiesIrregularities  Semantically Dirty DataSemantically Dirty Data  Integrity Constraint ViolationIntegrity Constraint Violation  Business rule contradictionBusiness rule contradiction  DuplicationDuplication  Coverage AnomaliesCoverage Anomalies  Missing AttributesMissing Attributes  Missing RecordsMissing Records
  • 7. Ahsan Abdullah 7 3 Classes of Anomalies…3 Classes of Anomalies…  Syntactically Dirty DataSyntactically Dirty Data  Lexical ErrorsLexical Errors  Discrepancies between the structure of the data items and the specifiedDiscrepancies between the structure of the data items and the specified format of stored valuesformat of stored values  e.g. number of columns used are unexpected for a tuple (mixed up numbere.g. number of columns used are unexpected for a tuple (mixed up number of attributes)of attributes)  IrregularitiesIrregularities  Non uniform use of units and values, such as only giving annual salary butNon uniform use of units and values, such as only giving annual salary but without info i.e. in US$ or PK Rs?without info i.e. in US$ or PK Rs?  Semantically Dirty DataSemantically Dirty Data  Integrity Constraint violationIntegrity Constraint violation  ContradictionContradiction  DoB > Hiring date etc.DoB > Hiring date etc.  DuplicationDuplication This slide will NOT go to Graphics
  • 8. Ahsan Abdullah 8  Coverage or lack of itCoverage or lack of it  Missing AttributeMissing Attribute  Result of omissions while collecting the data.Result of omissions while collecting the data.  A constraint violation if we have null values for attributesA constraint violation if we have null values for attributes where NOT NULL constraint exists.where NOT NULL constraint exists.  Case more complicated where no such constraint exists.Case more complicated where no such constraint exists.  Have to decide whether the value exists in the real world andHave to decide whether the value exists in the real world and has to be deduced here or not.has to be deduced here or not. 3 Classes of Anomalies…3 Classes of Anomalies… This slide will NOT go to Graphics
  • 9. Ahsan Abdullah 9 Why Coverage Anomalies?Why Coverage Anomalies?  Equipment malfunction (bar code reader, keyboard etc.)Equipment malfunction (bar code reader, keyboard etc.)  Inconsistent with other recorded data and thus deleted.Inconsistent with other recorded data and thus deleted.  Data not entered due to misunderstanding/illegibility.Data not entered due to misunderstanding/illegibility.  Data not considered important at the time of entry (e.g. Y2K).Data not considered important at the time of entry (e.g. Y2K).
  • 10. Ahsan Abdullah 10  Dropping records.Dropping records.  ““Manually” filling missing values.Manually” filling missing values.  Using a global constant as filler.Using a global constant as filler.  Using the attribute mean (or median) as filler.Using the attribute mean (or median) as filler.  Using the most probable value as filler.Using the most probable value as filler. Handling missing dataHandling missing data
  • 11. Ahsan Abdullah 11 Key Based Classification of ProblemsKey Based Classification of Problems  Primary key problemsPrimary key problems  Non-Primary key problemsNon-Primary key problems
  • 12. Ahsan Abdullah 12 Primary key problemsPrimary key problems  Same PK but different data.Same PK but different data.  Same entity with different keys.Same entity with different keys.  PK in one system but not in other.PK in one system but not in other.  Same PK but in different formats.Same PK but in different formats.
  • 13. Ahsan Abdullah 13 Non primary key problems…Non primary key problems…  Different encoding in different sources.Different encoding in different sources.  Multiple ways to represent the sameMultiple ways to represent the same information.information.  Sources might contain invalid data.Sources might contain invalid data.  Two fields with different data but sameTwo fields with different data but same name.name.
  • 14. Ahsan Abdullah 14  Required fields left blank.Required fields left blank.  Data erroneous or incomplete.Data erroneous or incomplete.  Data contains null values.Data contains null values. Non primary key problemsNon primary key problems
  • 15. Ahsan Abdullah 15 Automatic Data Cleansing…Automatic Data Cleansing… 1.Statistical 2.Pattern Based 3.Clustering 4.Association Rules
  • 16. Ahsan Abdullah 16 1.1. Statistical MethodsStatistical Methods  Identifying outlier fields and records using the values ofIdentifying outlier fields and records using the values of mean, standard deviation, range, etc., based onmean, standard deviation, range, etc., based on Chebyshev’s theoremChebyshev’s theorem 2.2. Pattern-basedPattern-based  Identify outlier fields and records that do not conform toIdentify outlier fields and records that do not conform to existing patterns in the data.existing patterns in the data.  A pattern is defined by a group of records that have similarA pattern is defined by a group of records that have similar characteristics (“behavior”) for p% of the fields in the datacharacteristics (“behavior”) for p% of the fields in the data set, where p is a user-defined value (usually above 90).set, where p is a user-defined value (usually above 90).  Techniques such as partitioning, classification, andTechniques such as partitioning, classification, and clustering can be used to identify patterns that apply toclustering can be used to identify patterns that apply to most records.most records. Automatic Data Cleansing…Automatic Data Cleansing… This slide will NOT go to Graphics
  • 17. Ahsan Abdullah 17 3.3. ClusteringClustering  Identify outlier records using clustering based on Euclidian (orIdentify outlier records using clustering based on Euclidian (or other) distance.other) distance.  Clustering the entire record space can reveal outliers that areClustering the entire record space can reveal outliers that are not identified at the field level inspectionnot identified at the field level inspection  Main drawback of this method is computational time.Main drawback of this method is computational time. 4.4. Association rulesAssociation rules  Association rules with high confidence and support define aAssociation rules with high confidence and support define a different kind of pattern.different kind of pattern.  Records that do not follow these rules are considered outliers.Records that do not follow these rules are considered outliers. Automatic Data CleansingAutomatic Data Cleansing This slide will NOT go to Graphics

Notes de l'éditeur

  1. <number>