SlideShare a Scribd company logo
1 of 18
DATA QUALITY
THE HOLY GRAIL OF A DATA FLUENT ORGANIZATION
Balvinder Khurana
2
Balvinder has 15 years of experience in building large-scale custom software and
big data platform solutions for complicated client problems. She has extensive
experience in Analysis, Design, Architecture,
and Development of Web based Enterprise systems and Analytical systems using
Agile practices like Scrum and XP.
Balvinder currently works as a Data Architect and Global Data Community Lead for
Thoughtworks
Data Architect
Balvinder Khurana
A little bit
about me..
We often hear organisations complaining about…
… the same things
“We are not able to do the RCA of failures with the available data”
“We do not know if we can monetize our data”
“Our assortment team doesn’t trust the data our platform is providing and they are still using their old
Excel-based mechanism to do assortment planning”
“Often our POS systems go down and we lose an entire chunk of data”
“We can’t use the data we have to build a credit scoring algorithm since our existing data has many
income groups missing”
Garbage in,
Garbage Out!
Your analysis is as good
as the underlying data.
Systemic Data
Quality Issues
Addressing data quality issues late in the process
Quite a lot of time gets spent in addressing data quality issues in downstream systems or
data platforms as opposed to in source systems.
Missing context
As we move downstream, context gets lost and addressing some data quality issues leads to
further more data quality issues.
Non-uniform definitions
The redressal for data quality issues isn’t often agreed upon with various teams and across
organisation which leads to trust issues in the underlying data.
Point solutions
Data quality gets looked at from the lens of the viewer, thereby causing myopic solutions
that are tactical in nature but don’t address the root cause.
Lack of strategy
Data quality is addressed tactically, and not as an integrated process or framework in the
entire ecosystem of products and platform of an organisation.
Under-estimating the impact
Data quality issues not only affect the downstream systems such as BI/Predictive dashboard,
but are a big reason for teams losing trust on the data platform and hence, become an
impediment to change management.
DEFINING DATA QUALITY
Data quality refers to the ability of a given set of
data to fulfill an intended purpose.
It is the degree to which a set of inherent characteristics fulfill the
requirements of a system, determine the fitness for use of the data and
ensure its conformance to the requirements.
7
DATA
QUALITY
Uniqueness
Integrity
Consistency
Trustworthiness Standardisation
Usability
Availability
Reliability
Relevance
Class Balance
Multidimensionality in Data Quality
● Accuracy
● Integrity
● Consistency
● Completeness
● Auditability
8
Uniqueness
Reliability
Consistency
Standardisation
Trustworthiness
Usability
Availability
Integrity
Relevance
Class Balance
DATA
QUALITY
● Accessibility
● Timeliness
● Authorization
● Fitness
● Value
● Freshness
● Documentation
● Credibility
● Metadata
● Statistical Bias
● Readability
● Definability
● Referenceability
● Reproducibility
● Interoperable
Multidimensionality and its tenets in Data Quality
Availability Usability Reliability Relevance Standardisation
Accessibility Documentation Accuracy Fitness Definability
Timeliness Credibility Integrity Value Referenceability
Authorization Metadata Consistency Freshness Reproducibility
Statistical Bias Completeness Interoperable
Readability Auditability
Tenets of Data Quality
Big Data ecosystems bring in additional complexities
Volume Variety Velocity Veracity
How do we have a
comprehensive data quality
control for PBs of data
How do we cater to multiple
types of data - structured,
semi-structured and
unstructured
How do we have a data
quality measure in time to
cater for high velocity
How to handle inherent
impreciseness and
uncertainty
Modern Data Platforms - A Conceptual view
How do we validate the success of our solution?
How do we validate and measure the correctness of the prices you recommend?
How do we validate our analytics accuracy?
How do we provide more transparency into data quality at every transformation stage in the data
pipeline for the development teams?
How do we establish trust with data and insights that I am provisioning to my business teams?
How do we enable teams to discover and use the data that is being collected in various systems?
How do I ensure legal and regulatory compliance?
Who is responsible for ensuring data quality within various systems?
12
Example - Pricing for a Retailer
Baseline data quality / sensible defaults
KPIs and
dashboards
Rules execution
engine
Rules authoring
Fit for purpose
data quality
Reports/
alerts
Fit for purpose
data quality
ML
algorithms
Fit for purpose
data quality
Ad-hoc
analysis
Fit for purpose
data quality
Preventive
and
corrective
action
Fit for purpose
data quality
Downstream
systems
Intermediate data quality
Metrics
definition
Metrics
definition
Critical
path
Critical
path
Interface to enable quick
discovery and navigation of right
dataset
Data Discovery
Metadata Ingestion/updation by
APIs such as Business Glossary,
Technical Metadata, Lineage etc
Metadata Service
Metadata Repositories e.g.
schemas/relations/lineage/
indexing services
Repository & Indexing
Service
Owners and SLA/SLO/SLI to
ensure Data Quality, for each
layer, including the business
process
Ownership of DQ
Data Quality Framework
Domain Data Products and Data Quality
Article
Fixed values of
Category
Article Price
Article Price can not be
negative
Unique Price point per
article per channel
Sales
Total amount can be
negative(Returns)
Competitor
Prices
Multiple Price points
per article per channel
Legacy Data
warehouse
Modern
Pricing System
POS / Online
Sales
Surveys/ Web
Crawlers
Dynamic
Pricing
algorithm
Article
Price &
Sales
There should be
no outliers in price
Reports
Discoverable
Addressable
Self-describing
Trustworthy
Interoperable
Secure
PRE-ETL VALIDATIONS
Format
Consistency
Completeness
Domain
Timeliness
POST-ETL & PRE-
SIMULATION VALIDATIONS
Meta data
Data Transformation
Data Completeness
Business specific
Scope
Joins
Data copy
SIMULATION
VALIDATIONS
Model Validation
Implementation
Computation
AGGREGATION
VALIDATION
Hierarchy
Data Scope
Summarized values
UI VALIDATIONS
Representation
Format
Intuitive
Data Quality across the pipeline
Goals
of data
collecting
Determining
quality
dimensions
Determining
indicators/
KPIs
Formulating
evaluation
baseline
Data
analysis
and data
mining
Data
cleaning
Output
results
Output
data
Data quality
assessment
Generating
data quality
report
New goals
Quick pilot*
Satisfy
goals?
Data
collection
Yes
*Improve
data quality
17
Mitigate
Prioritize
Quantify
Identify
Operationalising
Data Quality
Thank You!
Reach out to us:
@Balvinder

More Related Content

Similar to Data Quality_ the holy grail for a Data Fluent Organization.pptx

Similar to Data Quality_ the holy grail for a Data Fluent Organization.pptx (20)

Data Integrity: From speed dating to lifelong partnership
Data Integrity: From speed dating to lifelong partnershipData Integrity: From speed dating to lifelong partnership
Data Integrity: From speed dating to lifelong partnership
 
Strategy For Data Quality
Strategy For Data QualityStrategy For Data Quality
Strategy For Data Quality
 
Intro of Key Features of Soft CAAT Ent Software
Intro of Key Features of Soft CAAT Ent SoftwareIntro of Key Features of Soft CAAT Ent Software
Intro of Key Features of Soft CAAT Ent Software
 
Data Quality Assessment Manager (DQAM)
Data Quality Assessment Manager (DQAM)Data Quality Assessment Manager (DQAM)
Data Quality Assessment Manager (DQAM)
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdf
 
Intro of Key Features of SoftCAAT BI SQL Software
Intro of Key Features of SoftCAAT BI SQL SoftwareIntro of Key Features of SoftCAAT BI SQL Software
Intro of Key Features of SoftCAAT BI SQL Software
 
Is Your Data Ready to Drive Your Company's Future?
Is Your Data Ready to Drive Your Company's Future?Is Your Data Ready to Drive Your Company's Future?
Is Your Data Ready to Drive Your Company's Future?
 
Intro of Key Features of SoftCAAT Ent SQL Software
Intro of Key Features of SoftCAAT Ent SQL SoftwareIntro of Key Features of SoftCAAT Ent SQL Software
Intro of Key Features of SoftCAAT Ent SQL Software
 
Intro of Key Features of S-CAAT
Intro of Key Features of S-CAATIntro of Key Features of S-CAAT
Intro of Key Features of S-CAAT
 
Data Quality Management: Cleaner Data, Better Reporting
Data Quality Management: Cleaner Data, Better ReportingData Quality Management: Cleaner Data, Better Reporting
Data Quality Management: Cleaner Data, Better Reporting
 
Guided Analytics vs. Self-Service BI: Choose Your Path to Data-driven Success!
Guided Analytics vs. Self-Service BI: Choose Your Path to Data-driven Success!Guided Analytics vs. Self-Service BI: Choose Your Path to Data-driven Success!
Guided Analytics vs. Self-Service BI: Choose Your Path to Data-driven Success!
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
The Persona-Based Value of Modern Data Governance
The Persona-Based Value of Modern Data Governance The Persona-Based Value of Modern Data Governance
The Persona-Based Value of Modern Data Governance
 
Px Solutions Business Intelligence Overview
Px Solutions Business Intelligence OverviewPx Solutions Business Intelligence Overview
Px Solutions Business Intelligence Overview
 
Kickstart a Data Quality Strategy to Build Trust in Your Data
Kickstart a Data Quality Strategy to Build Trust in Your DataKickstart a Data Quality Strategy to Build Trust in Your Data
Kickstart a Data Quality Strategy to Build Trust in Your Data
 
Analytics in manufacturing
Analytics in manufacturingAnalytics in manufacturing
Analytics in manufacturing
 
Data Governance a Business Value Driven Approach
Data Governance a Business Value Driven ApproachData Governance a Business Value Driven Approach
Data Governance a Business Value Driven Approach
 
Driving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information ManagementDriving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information Management
 
Data Quality: The Cornerstone Of High-Yield Technology Investments
Data Quality: The Cornerstone Of High-Yield Technology InvestmentsData Quality: The Cornerstone Of High-Yield Technology Investments
Data Quality: The Cornerstone Of High-Yield Technology Investments
 
Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...
Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...
Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...
 

More from Balvinder Hira

Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...
Balvinder Hira
 

More from Balvinder Hira (7)

Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...
 
Observability in real time at scale
Observability in real time at scaleObservability in real time at scale
Observability in real time at scale
 
Time series analysis 101
Time series analysis 101Time series analysis 101
Time series analysis 101
 
Agile, qa and data projects geek night 2020
Agile, qa and data projects   geek night 2020Agile, qa and data projects   geek night 2020
Agile, qa and data projects geek night 2020
 
Pricing Deep learning model
Pricing Deep learning modelPricing Deep learning model
Pricing Deep learning model
 
Google Cloud Platform
Google Cloud PlatformGoogle Cloud Platform
Google Cloud Platform
 
Observability in real time at scale
Observability in real time at scaleObservability in real time at scale
Observability in real time at scale
 

Recently uploaded

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Recently uploaded (20)

Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 

Data Quality_ the holy grail for a Data Fluent Organization.pptx

  • 1. DATA QUALITY THE HOLY GRAIL OF A DATA FLUENT ORGANIZATION Balvinder Khurana
  • 2. 2 Balvinder has 15 years of experience in building large-scale custom software and big data platform solutions for complicated client problems. She has extensive experience in Analysis, Design, Architecture, and Development of Web based Enterprise systems and Analytical systems using Agile practices like Scrum and XP. Balvinder currently works as a Data Architect and Global Data Community Lead for Thoughtworks Data Architect Balvinder Khurana A little bit about me..
  • 3. We often hear organisations complaining about… … the same things “We are not able to do the RCA of failures with the available data” “We do not know if we can monetize our data” “Our assortment team doesn’t trust the data our platform is providing and they are still using their old Excel-based mechanism to do assortment planning” “Often our POS systems go down and we lose an entire chunk of data” “We can’t use the data we have to build a credit scoring algorithm since our existing data has many income groups missing”
  • 4. Garbage in, Garbage Out! Your analysis is as good as the underlying data.
  • 5. Systemic Data Quality Issues Addressing data quality issues late in the process Quite a lot of time gets spent in addressing data quality issues in downstream systems or data platforms as opposed to in source systems. Missing context As we move downstream, context gets lost and addressing some data quality issues leads to further more data quality issues. Non-uniform definitions The redressal for data quality issues isn’t often agreed upon with various teams and across organisation which leads to trust issues in the underlying data. Point solutions Data quality gets looked at from the lens of the viewer, thereby causing myopic solutions that are tactical in nature but don’t address the root cause. Lack of strategy Data quality is addressed tactically, and not as an integrated process or framework in the entire ecosystem of products and platform of an organisation. Under-estimating the impact Data quality issues not only affect the downstream systems such as BI/Predictive dashboard, but are a big reason for teams losing trust on the data platform and hence, become an impediment to change management.
  • 6. DEFINING DATA QUALITY Data quality refers to the ability of a given set of data to fulfill an intended purpose. It is the degree to which a set of inherent characteristics fulfill the requirements of a system, determine the fitness for use of the data and ensure its conformance to the requirements.
  • 8. ● Accuracy ● Integrity ● Consistency ● Completeness ● Auditability 8 Uniqueness Reliability Consistency Standardisation Trustworthiness Usability Availability Integrity Relevance Class Balance DATA QUALITY ● Accessibility ● Timeliness ● Authorization ● Fitness ● Value ● Freshness ● Documentation ● Credibility ● Metadata ● Statistical Bias ● Readability ● Definability ● Referenceability ● Reproducibility ● Interoperable Multidimensionality and its tenets in Data Quality
  • 9. Availability Usability Reliability Relevance Standardisation Accessibility Documentation Accuracy Fitness Definability Timeliness Credibility Integrity Value Referenceability Authorization Metadata Consistency Freshness Reproducibility Statistical Bias Completeness Interoperable Readability Auditability Tenets of Data Quality
  • 10. Big Data ecosystems bring in additional complexities Volume Variety Velocity Veracity How do we have a comprehensive data quality control for PBs of data How do we cater to multiple types of data - structured, semi-structured and unstructured How do we have a data quality measure in time to cater for high velocity How to handle inherent impreciseness and uncertainty
  • 11. Modern Data Platforms - A Conceptual view
  • 12. How do we validate the success of our solution? How do we validate and measure the correctness of the prices you recommend? How do we validate our analytics accuracy? How do we provide more transparency into data quality at every transformation stage in the data pipeline for the development teams? How do we establish trust with data and insights that I am provisioning to my business teams? How do we enable teams to discover and use the data that is being collected in various systems? How do I ensure legal and regulatory compliance? Who is responsible for ensuring data quality within various systems? 12 Example - Pricing for a Retailer
  • 13. Baseline data quality / sensible defaults KPIs and dashboards Rules execution engine Rules authoring Fit for purpose data quality Reports/ alerts Fit for purpose data quality ML algorithms Fit for purpose data quality Ad-hoc analysis Fit for purpose data quality Preventive and corrective action Fit for purpose data quality Downstream systems Intermediate data quality Metrics definition Metrics definition Critical path Critical path Interface to enable quick discovery and navigation of right dataset Data Discovery Metadata Ingestion/updation by APIs such as Business Glossary, Technical Metadata, Lineage etc Metadata Service Metadata Repositories e.g. schemas/relations/lineage/ indexing services Repository & Indexing Service Owners and SLA/SLO/SLI to ensure Data Quality, for each layer, including the business process Ownership of DQ Data Quality Framework
  • 14. Domain Data Products and Data Quality Article Fixed values of Category Article Price Article Price can not be negative Unique Price point per article per channel Sales Total amount can be negative(Returns) Competitor Prices Multiple Price points per article per channel Legacy Data warehouse Modern Pricing System POS / Online Sales Surveys/ Web Crawlers Dynamic Pricing algorithm Article Price & Sales There should be no outliers in price Reports Discoverable Addressable Self-describing Trustworthy Interoperable Secure
  • 15. PRE-ETL VALIDATIONS Format Consistency Completeness Domain Timeliness POST-ETL & PRE- SIMULATION VALIDATIONS Meta data Data Transformation Data Completeness Business specific Scope Joins Data copy SIMULATION VALIDATIONS Model Validation Implementation Computation AGGREGATION VALIDATION Hierarchy Data Scope Summarized values UI VALIDATIONS Representation Format Intuitive Data Quality across the pipeline
  • 16. Goals of data collecting Determining quality dimensions Determining indicators/ KPIs Formulating evaluation baseline Data analysis and data mining Data cleaning Output results Output data Data quality assessment Generating data quality report New goals Quick pilot* Satisfy goals? Data collection Yes *Improve data quality
  • 18. Thank You! Reach out to us: @Balvinder

Editor's Notes

  1. 1 min
  2. 1 min
  3. Volume:comprehensive data quality assessment is not possible.The data quality measures are approximate define in terms of probability and confidence intervals Have a clear metric and metric definition for data quality Variety: Data is also being collected from external sources 1) data sets from the internet and mobile internet 2) data from the Internet of Things; 3) data collected by various industries; 4) scientific experimental and observational data Velocity: Need to have data quality measures which are relevant as well as feasible Sampling, data quality on fly, structural validations instead of semantic Veracity: How to make sure the trustworthiness of source of data, else, such data might skew your data quality report
  4. The client is a huge retailer and has reached out to you to help them price their entire assortment of articles based on number of data points that they collect, what is the demand of any product, what is the competitor price for same product, does the product have any seasonal value….. SLO/SLA/Governance teams Business are losing trust in data How to I ascertain my Data Quality How much to invest on data quality assurance Untrustworthy results or inaccurate insights from analytics were due to a lack of quality in the data fed into systems such as AI and machine learning
  5. Data Quality framework hierarchical data quality framework from the perspective of data users. This framework consists of big data quality dimensions, quality characteristics, and quality indexes ROI of data quality Define, Measure, Analyze, Design/Improve, and Verify/Control
  6. Data users and data providers are often different organizations with very different goals and operational procedures. Thus, it is no surprise that their notions of data quality are very different. In many cases, the data providers have no clue about the business use cases of data users (data providers might not even care about it, unless they are getting paid for the data). This disconnect between data source and data use is one of the prime reasons behind the data quality issues.
  7. Plan: Planning (or designing) phase consists of defining scope & business need, identifying stakeholders, clarifying business rules for data, and identifying business processes. The outcome of the planning phase should clearly communicate to relevant senior management as well as other stakeholders the objectives of the DQ work. Assess: This phase measures the existing data with respect to business policies, data standards, and business practices. Profiling is a key component of this phase and of course a lot has been written about profiling & assessment. Analyze: Typically, we use both quantitative and qualitative analytical techniques to do gap analysis of where the data quality should be based on what’s defined in planning phase and where the data quality actually is. Pilot: There may be variations in how different organizations deal with Pilot and Deploy phases but we recommend a Piloting phase to focus on specific actions needed to improve the data quality. Piloting phase might also identify any business processes that need to be adjusted to improve data quality on a sustaining basis. Deploy: Based on the outcomes of pilot phase, Deploy phase should focus on both business and technical solutions to improve data quality. The tendency of many organizations is to focus on technical solutions only and ignore business solutions but in our opinion, it is a major mistake. Maintain: It is very important to make sure that processes and control mechanisms should be put in place to maintain the data quality efforts on an ongoing basis. Data Governance will play an important role in making sure that data quality is maintained for a sustaining program.