SlideShare une entreprise Scribd logo
1  sur  15
DWH-Ahsan AbdullahDWH-Ahsan Abdullah
11
Data WarehousingData Warehousing
Lecture-22Lecture-22
DQM: Quantifying Data QualityDQM: Quantifying Data Quality
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan101@yahoo.com
2
BackgroundBackground
Companies want to measure the quality of their data that requires
usable metrics.
Have to deal with both the subjective perceptions and objective
measurements.
Subjective data quality assessments reflect the needs and
experiences of stakeholders.
Objective assessments can be task-independent or task-dependent.
Task-independent metrics reflect states of the data without the
contextual knowledge of the application.
Task dependent metrics, include organization’s business rules,
regulations etc.
We will discuss objective assessment and validation techniques
(dependent & independent), if time permits will briefly cover
subjective assessment too.
Text will not go to graphics
3
More on Characteristics of Data QualityMore on Characteristics of Data Quality
Data Quality Dim Definition
Believability The extent to which data is regarded as true and
credible.
Appropriate
Amount of
Data
The extent to which the volume of data is appropriate
for the task at hand.
Timeliness A measure of how current or up to date the data is.
Accessibility The extent to which data is available, or easily and
quickly retrievable
Objectivity The extent to which data is unbiased, unprejudiced,
and impartial.
Interpretability The extent to which data is in appropriate languages,
symbols, and units, and the definitions are clear.
Uniqueness The state of being only one of its kind or being
without an equal or parallel.
Only this column will go to graphics
4
Data Quality Assessment TechniquesData Quality Assessment Techniques
 RatiosRatios
 Min-MaxMin-Max
5
 Simple RatiosSimple Ratios
 Free-of-ErrorFree-of-Error
 CompletenessCompleteness
 SchemaSchema
 ColumnColumn
 PopulationPopulation
 ConsistencyConsistency
Ratio of violations to total number of consistencyRatio of violations to total number of consistency
checks.checks.
Data Quality Assessment TechniquesData Quality Assessment Techniques
Sub-Sub-bullets will not go to graphics
6
 Min-MaxMin-Max
 Used for multiple values, based on aggregation ofUsed for multiple values, based on aggregation of
normalized individual valuesnormalized individual values
 Min is conservative, while max is liberalMin is conservative, while max is liberal
 BelievabilityBelievability
 Comparison with a standard or experienceComparison with a standard or experience
 Min {0.8, 0.7, 0.6) = 0.6Min {0.8, 0.7, 0.6) = 0.6
 Weighted averageWeighted average
 Appropriate Amount of DataAppropriate Amount of Data
Min {Dp/Dn , Dn/Dp}Min {Dp/Dn , Dn/Dp}
Data Quality Assessment TechniquesData Quality Assessment Techniques
Dp: Data units provided
Dn: Data units needed
Sub-bullets and keys will not go to graphics
7
 Min-MaxMin-Max
 TimelinessTimeliness
Max {0, 1- C/V} C = A + Dt - ItMax {0, 1- C/V} C = A + Dt - It
 AccessibilityAccessibility
Max {0, 1- Trd/Tru}Max {0, 1- Trd/Tru}
Data Quality Assessment TechniquesData Quality Assessment Techniques
C: Currency
V: Volatility
A: Age
Dt: Delivery time
It: Input time (received in system)
Trd: Time between request
by user to delivery
Tru: Request by user to time
data remains useful
Sub-bullets and keys will not go to graphics
8
Data Quality Validation TechniquesData Quality Validation Techniques
 Referential Integrity (RI).Referential Integrity (RI).
 Attribute domain.Attribute domain.
 Using Data Quality Rules.Using Data Quality Rules.
 Data Histograming.Data Histograming.
9
Referential Integrity ValidationReferential Integrity Validation
Example: How many outstanding payments in theExample: How many outstanding payments in the
DWH without a corresponding customer_ID in theDWH without a corresponding customer_ID in the
customer table?customer table?
RI checked every week or month, and no. of orphan
records should be going down with time.
RI peculiar to DWH, not for operational systems
Yellow will not go to graphics
10
Business Case for RIBusiness Case for RI
Not very interesting to knowNot very interesting to know
number of outstanding paymentsnumber of outstanding payments
from a business point of view.from a business point of view.
Interesting to know the actualInteresting to know the actual
amount outstanding, on per yearamount outstanding, on per year
basis, per region basis…basis, per region basis…
11
Performance Case for RIPerformance Case for RI
Cost of enforcing RI is very high for largeCost of enforcing RI is very high for large
volume DWH implementations, therefore:volume DWH implementations, therefore:
 Should RI constraints be turned OFF in a dataShould RI constraints be turned OFF in a data
warehouse? orwarehouse? or
 Should those records be “discarded” that violateShould those records be “discarded” that violate
one or more RI constraints?one or more RI constraints?
12
3 steps of Attribute Domain Validation3 steps of Attribute Domain Validation
Step-1:Step-1: Capture and quantifyCapture and quantify the occurrences ofthe occurrences of
each domain value within each coded attribute ofeach domain value within each coded attribute of
the database.the database.
Step-2:Step-2: CompareCompare actual content of attributesactual content of attributes
against set of valid values.against set of valid values.
Step-3:Step-3: InvestigateInvestigate exceptions to determineexceptions to determine
cause and impact of the data quality defects.cause and impact of the data quality defects.
Note: Step 3 (above) applies to all defect types.Note: Step 3 (above) applies to all defect types.
Yellow will go to graphics
13
Attribute Domain Validation: What next?Attribute Domain Validation: What next?
What to do next?What to do next?
 Trace back to source cause(s).Trace back to source cause(s).
 Quantify business impact of the defects.Quantify business impact of the defects.
 Assess cost (and time frame) to fix and proceedAssess cost (and time frame) to fix and proceed
accordingly.accordingly.
14
Data Quality RulesData Quality Rules
15
Statistical Validation using HistogramStatistical Validation using Histogram
1901 …………………………………………. 2000
Spike of
Centurions (age >= 100 yrs)
NOTE: For a certain environment, the above distribution may
be perfectly normal.
outliers

Contenu connexe

Tendances

Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsUmasree Raghunath
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Edureka!
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Edureka!
 
Data mining financial services
Data mining financial servicesData mining financial services
Data mining financial servicesHprentice
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPromptCloud
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
 
Have Data—Need Analysts. Lessons Learned From The Woodworking Industry
Have Data—Need Analysts. Lessons Learned From The Woodworking IndustryHave Data—Need Analysts. Lessons Learned From The Woodworking Industry
Have Data—Need Analysts. Lessons Learned From The Woodworking IndustryHealth Catalyst
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsSSaudia
 
The Hive Data Virtualization Introduction - Sanjay Krishnamurti, Chief Archit...
The Hive Data Virtualization Introduction - Sanjay Krishnamurti, Chief Archit...The Hive Data Virtualization Introduction - Sanjay Krishnamurti, Chief Archit...
The Hive Data Virtualization Introduction - Sanjay Krishnamurti, Chief Archit...The Hive
 
Aa proj assited-living_iot
Aa proj assited-living_iotAa proj assited-living_iot
Aa proj assited-living_iotIshanDhoble1
 
Challenges in business analytics
Challenges in business analyticsChallenges in business analytics
Challenges in business analyticsMiklos Koren
 

Tendances (20)

Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Machine Learning in Healthcare: A Case Study
Machine Learning in Healthcare: A Case StudyMachine Learning in Healthcare: A Case Study
Machine Learning in Healthcare: A Case Study
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Data mining financial services
Data mining financial servicesData mining financial services
Data mining financial services
 
Life Science Analytics
Life Science AnalyticsLife Science Analytics
Life Science Analytics
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics Algorithms
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
Have Data—Need Analysts. Lessons Learned From The Woodworking Industry
Have Data—Need Analysts. Lessons Learned From The Woodworking IndustryHave Data—Need Analysts. Lessons Learned From The Woodworking Industry
Have Data—Need Analysts. Lessons Learned From The Woodworking Industry
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Machine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case studyMachine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case study
 
The Hive Data Virtualization Introduction - Sanjay Krishnamurti, Chief Archit...
The Hive Data Virtualization Introduction - Sanjay Krishnamurti, Chief Archit...The Hive Data Virtualization Introduction - Sanjay Krishnamurti, Chief Archit...
The Hive Data Virtualization Introduction - Sanjay Krishnamurti, Chief Archit...
 
Aa proj assited-living_iot
Aa proj assited-living_iotAa proj assited-living_iot
Aa proj assited-living_iot
 
Challenges in business analytics
Challenges in business analyticsChallenges in business analytics
Challenges in business analytics
 

En vedette

JAZOON'13 - Thomas Hug & Bartosz Majsak - Git Workshop -Essentials
JAZOON'13 - Thomas Hug & Bartosz Majsak - Git Workshop -EssentialsJAZOON'13 - Thomas Hug & Bartosz Majsak - Git Workshop -Essentials
JAZOON'13 - Thomas Hug & Bartosz Majsak - Git Workshop -Essentialsjazoon13
 
Lecture 25
Lecture 25Lecture 25
Lecture 25Shani729
 
JAZOON'13 - Joe Justice - Test First Saves The World
JAZOON'13 - Joe Justice - Test First Saves The WorldJAZOON'13 - Joe Justice - Test First Saves The World
JAZOON'13 - Joe Justice - Test First Saves The Worldjazoon13
 
JAZOON'13 - Bartosz Majsak - Git Workshop - Kung Fu
JAZOON'13 - Bartosz Majsak - Git Workshop - Kung FuJAZOON'13 - Bartosz Majsak - Git Workshop - Kung Fu
JAZOON'13 - Bartosz Majsak - Git Workshop - Kung Fujazoon13
 
Vive a lectura que te llevará a nuevos
Vive  a lectura que te  llevará a nuevosVive  a lectura que te  llevará a nuevos
Vive a lectura que te llevará a nuevoslorena25881
 
JAZOON'13 - Guide Schmutz - Kafka and Strom Event Processing In Realtime
JAZOON'13 - Guide Schmutz - Kafka and Strom Event Processing In RealtimeJAZOON'13 - Guide Schmutz - Kafka and Strom Event Processing In Realtime
JAZOON'13 - Guide Schmutz - Kafka and Strom Event Processing In Realtimejazoon13
 

En vedette (6)

JAZOON'13 - Thomas Hug & Bartosz Majsak - Git Workshop -Essentials
JAZOON'13 - Thomas Hug & Bartosz Majsak - Git Workshop -EssentialsJAZOON'13 - Thomas Hug & Bartosz Majsak - Git Workshop -Essentials
JAZOON'13 - Thomas Hug & Bartosz Majsak - Git Workshop -Essentials
 
Lecture 25
Lecture 25Lecture 25
Lecture 25
 
JAZOON'13 - Joe Justice - Test First Saves The World
JAZOON'13 - Joe Justice - Test First Saves The WorldJAZOON'13 - Joe Justice - Test First Saves The World
JAZOON'13 - Joe Justice - Test First Saves The World
 
JAZOON'13 - Bartosz Majsak - Git Workshop - Kung Fu
JAZOON'13 - Bartosz Majsak - Git Workshop - Kung FuJAZOON'13 - Bartosz Majsak - Git Workshop - Kung Fu
JAZOON'13 - Bartosz Majsak - Git Workshop - Kung Fu
 
Vive a lectura que te llevará a nuevos
Vive  a lectura que te  llevará a nuevosVive  a lectura que te  llevará a nuevos
Vive a lectura que te llevará a nuevos
 
JAZOON'13 - Guide Schmutz - Kafka and Strom Event Processing In Realtime
JAZOON'13 - Guide Schmutz - Kafka and Strom Event Processing In RealtimeJAZOON'13 - Guide Schmutz - Kafka and Strom Event Processing In Realtime
JAZOON'13 - Guide Schmutz - Kafka and Strom Event Processing In Realtime
 

Similaire à Lecture 22

Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityJaveriaGauhar
 
State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023RTTS
 
Developing A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataDeveloping A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataFindWhitePapers
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfarifulislam946965
 
Infographic Things You Should Know About Big Data Testing
Infographic Things You Should Know About Big Data TestingInfographic Things You Should Know About Big Data Testing
Infographic Things You Should Know About Big Data TestingKiwiQA
 
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive DataData Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive DataDATAVERSITY
 
Test Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainTest Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainChelsea Frischknecht
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bijeffd00
 
Database and Data Warehousing-Building Business Intelligence
Database and Data Warehousing-Building Business IntelligenceDatabase and Data Warehousing-Building Business Intelligence
Database and Data Warehousing-Building Business IntelligenceYeng Ferraris Portes
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Leveraging Automated Data Validation to Reduce Software Development Timeline...
Leveraging Automated Data Validation  to Reduce Software Development Timeline...Leveraging Automated Data Validation  to Reduce Software Development Timeline...
Leveraging Automated Data Validation to Reduce Software Development Timeline...Cognizant
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesCarl Anderson
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
Data Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileData Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileMehmet Gök
 
Just Enough (Automated) Testing
Just Enough (Automated) TestingJust Enough (Automated) Testing
Just Enough (Automated) TestingSauce Labs
 
What is Data Observability.pdf
What is Data Observability.pdfWhat is Data Observability.pdf
What is Data Observability.pdf4dalert
 

Similaire à Lecture 22 (20)

Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
 
Data quality
Data qualityData quality
Data quality
 
Data quality
Data qualityData quality
Data quality
 
State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023
 
Developing A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataDeveloping A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product Data
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdf
 
Infographic Things You Should Know About Big Data Testing
Infographic Things You Should Know About Big Data TestingInfographic Things You Should Know About Big Data Testing
Infographic Things You Should Know About Big Data Testing
 
Databse management system
Databse management systemDatabse management system
Databse management system
 
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive DataData Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
 
Test Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainTest Data Management: The Underestimated Pain
Test Data Management: The Underestimated Pain
 
Data Quality
Data QualityData Quality
Data Quality
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bi
 
Database and Data Warehousing-Building Business Intelligence
Database and Data Warehousing-Building Business IntelligenceDatabase and Data Warehousing-Building Business Intelligence
Database and Data Warehousing-Building Business Intelligence
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Leveraging Automated Data Validation to Reduce Software Development Timeline...
Leveraging Automated Data Validation  to Reduce Software Development Timeline...Leveraging Automated Data Validation  to Reduce Software Development Timeline...
Leveraging Automated Data Validation to Reduce Software Development Timeline...
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Data Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileData Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel File
 
Just Enough (Automated) Testing
Just Enough (Automated) TestingJust Enough (Automated) Testing
Just Enough (Automated) Testing
 
What is Data Observability.pdf
What is Data Observability.pdfWhat is Data Observability.pdf
What is Data Observability.pdf
 

Plus de Shani729

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012Shani729
 
Python tutorial
Python tutorialPython tutorial
Python tutorialShani729
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionShani729
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)Shani729
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15Shani729
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodShani729
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15Shani729
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10Shani729
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Shani729
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Shani729
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Shani729
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2Shani729
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1Shani729
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13Shani729
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Shani729
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furcShani729
 
Lecture 40
Lecture 40Lecture 40
Lecture 40Shani729
 
Lecture 39
Lecture 39Lecture 39
Lecture 39Shani729
 
Lecture 38
Lecture 38Lecture 38
Lecture 38Shani729
 
Lecture 37
Lecture 37Lecture 37
Lecture 37Shani729
 

Plus de Shani729 (20)

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
 

Dernier

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 

Dernier (20)

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

Lecture 22

  • 1. DWH-Ahsan AbdullahDWH-Ahsan Abdullah 11 Data WarehousingData Warehousing Lecture-22Lecture-22 DQM: Quantifying Data QualityDQM: Quantifying Data Quality Virtual University of PakistanVirtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: ahsan101@yahoo.com
  • 2. 2 BackgroundBackground Companies want to measure the quality of their data that requires usable metrics. Have to deal with both the subjective perceptions and objective measurements. Subjective data quality assessments reflect the needs and experiences of stakeholders. Objective assessments can be task-independent or task-dependent. Task-independent metrics reflect states of the data without the contextual knowledge of the application. Task dependent metrics, include organization’s business rules, regulations etc. We will discuss objective assessment and validation techniques (dependent & independent), if time permits will briefly cover subjective assessment too. Text will not go to graphics
  • 3. 3 More on Characteristics of Data QualityMore on Characteristics of Data Quality Data Quality Dim Definition Believability The extent to which data is regarded as true and credible. Appropriate Amount of Data The extent to which the volume of data is appropriate for the task at hand. Timeliness A measure of how current or up to date the data is. Accessibility The extent to which data is available, or easily and quickly retrievable Objectivity The extent to which data is unbiased, unprejudiced, and impartial. Interpretability The extent to which data is in appropriate languages, symbols, and units, and the definitions are clear. Uniqueness The state of being only one of its kind or being without an equal or parallel. Only this column will go to graphics
  • 4. 4 Data Quality Assessment TechniquesData Quality Assessment Techniques  RatiosRatios  Min-MaxMin-Max
  • 5. 5  Simple RatiosSimple Ratios  Free-of-ErrorFree-of-Error  CompletenessCompleteness  SchemaSchema  ColumnColumn  PopulationPopulation  ConsistencyConsistency Ratio of violations to total number of consistencyRatio of violations to total number of consistency checks.checks. Data Quality Assessment TechniquesData Quality Assessment Techniques Sub-Sub-bullets will not go to graphics
  • 6. 6  Min-MaxMin-Max  Used for multiple values, based on aggregation ofUsed for multiple values, based on aggregation of normalized individual valuesnormalized individual values  Min is conservative, while max is liberalMin is conservative, while max is liberal  BelievabilityBelievability  Comparison with a standard or experienceComparison with a standard or experience  Min {0.8, 0.7, 0.6) = 0.6Min {0.8, 0.7, 0.6) = 0.6  Weighted averageWeighted average  Appropriate Amount of DataAppropriate Amount of Data Min {Dp/Dn , Dn/Dp}Min {Dp/Dn , Dn/Dp} Data Quality Assessment TechniquesData Quality Assessment Techniques Dp: Data units provided Dn: Data units needed Sub-bullets and keys will not go to graphics
  • 7. 7  Min-MaxMin-Max  TimelinessTimeliness Max {0, 1- C/V} C = A + Dt - ItMax {0, 1- C/V} C = A + Dt - It  AccessibilityAccessibility Max {0, 1- Trd/Tru}Max {0, 1- Trd/Tru} Data Quality Assessment TechniquesData Quality Assessment Techniques C: Currency V: Volatility A: Age Dt: Delivery time It: Input time (received in system) Trd: Time between request by user to delivery Tru: Request by user to time data remains useful Sub-bullets and keys will not go to graphics
  • 8. 8 Data Quality Validation TechniquesData Quality Validation Techniques  Referential Integrity (RI).Referential Integrity (RI).  Attribute domain.Attribute domain.  Using Data Quality Rules.Using Data Quality Rules.  Data Histograming.Data Histograming.
  • 9. 9 Referential Integrity ValidationReferential Integrity Validation Example: How many outstanding payments in theExample: How many outstanding payments in the DWH without a corresponding customer_ID in theDWH without a corresponding customer_ID in the customer table?customer table? RI checked every week or month, and no. of orphan records should be going down with time. RI peculiar to DWH, not for operational systems Yellow will not go to graphics
  • 10. 10 Business Case for RIBusiness Case for RI Not very interesting to knowNot very interesting to know number of outstanding paymentsnumber of outstanding payments from a business point of view.from a business point of view. Interesting to know the actualInteresting to know the actual amount outstanding, on per yearamount outstanding, on per year basis, per region basis…basis, per region basis…
  • 11. 11 Performance Case for RIPerformance Case for RI Cost of enforcing RI is very high for largeCost of enforcing RI is very high for large volume DWH implementations, therefore:volume DWH implementations, therefore:  Should RI constraints be turned OFF in a dataShould RI constraints be turned OFF in a data warehouse? orwarehouse? or  Should those records be “discarded” that violateShould those records be “discarded” that violate one or more RI constraints?one or more RI constraints?
  • 12. 12 3 steps of Attribute Domain Validation3 steps of Attribute Domain Validation Step-1:Step-1: Capture and quantifyCapture and quantify the occurrences ofthe occurrences of each domain value within each coded attribute ofeach domain value within each coded attribute of the database.the database. Step-2:Step-2: CompareCompare actual content of attributesactual content of attributes against set of valid values.against set of valid values. Step-3:Step-3: InvestigateInvestigate exceptions to determineexceptions to determine cause and impact of the data quality defects.cause and impact of the data quality defects. Note: Step 3 (above) applies to all defect types.Note: Step 3 (above) applies to all defect types. Yellow will go to graphics
  • 13. 13 Attribute Domain Validation: What next?Attribute Domain Validation: What next? What to do next?What to do next?  Trace back to source cause(s).Trace back to source cause(s).  Quantify business impact of the defects.Quantify business impact of the defects.  Assess cost (and time frame) to fix and proceedAssess cost (and time frame) to fix and proceed accordingly.accordingly.
  • 14. 14 Data Quality RulesData Quality Rules
  • 15. 15 Statistical Validation using HistogramStatistical Validation using Histogram 1901 …………………………………………. 2000 Spike of Centurions (age >= 100 yrs) NOTE: For a certain environment, the above distribution may be perfectly normal. outliers

Notes de l'éditeur

  1. <number>
  2. <number>