Measuring Data Quality with DataOps

Clarity Cloudworks
illuminating issues before they become problems

Development and
Operations  
are not  
the only groups in IT

Data Teams
• Are focused on urgent, unplanned work
• Traditionally operate the systems they develop,
because they don’t perceive hand-o! is possible
• Scant theory,  
what little writing exists is technology-focused

The DataOps Manifesto
Whether referred to as data science,
data engineering, data management,
big data, business intelligence, or the
like, through our work we have come to
value in analytics:
https://www.dataopsmanifesto.org/

Individuals and interactions
over
processes and tools

Working analytics
over
comprehensive documentation

Customer collaboration
over
contract negotiation

Experimentation, iteration, and feedback
over
extensive upfront design

Cross-functional ownership of operations
over
siloed responsibilities

Do you have  
Bad Data?
In the absence of information,  
rumour becomes widely believed. 
Rumour is biased toward emotion,  
which in work places tends to be negative.

What problems does data quality cause?
• Data / ETL pipelines crash,  
resulting in unavailable, stale, or incorrect data
• > 80% of Data Scientists’ time spent  
collecting data
• Incorrect data is used for decisions  
or published
• Doubts about data hurt morale and  
discourage evidence-based decision making

What is Data Quality?
Data quality is good  
when people who inspect data see what they expect.
Data quality is bad  
when people are surprised by the data they see.

Document data characteristics  
and train people to know them
If you only learn one thing today:  
 
In the absence of training and documentation most people
will be surprised by the data even when nothing is wrong.

What do we want? 
Evidence Based Decision Making 
When do we want it? 
After Peer Review

Data Testing
• Accuracy, Consistency, Completeness Tests
• On records and relationships
• Relationship Consistency Tests

Test Objectives
• Accuracy - is it true?
• Consistent - does it obey the rules?
• Complete - what is missing?

Data Test Scopes
• Within a record (SQL row, NoSQL document, etc.)
• Within a set (SQL Table, etc.)
• Within an Application (HRIS, ERP, etc.)*
• Across the organisation*
* - combinatorial

Monitor Data as if it is
Infrastructure
When Where Who
Code
Event driven
Commit / PR
Test Developers ﬁx errors
Infrastructure
Constantly at tight
intervals
Production
Automated repair
failover to Ops
Data Constantly Production
Automated repair
failover to data
steward

Data Production Value
Development
Idea
Value Pipeline
Innovation Pipeline
continuous data monitoring,
continuous application monitoring,
periodic code testing.

Pipelines
• Monitor each step in the pipeline
• If steps are idempotent, kill and retry once any
step whose measures are anomalous
• Raise an incident if the retry is also anomalous
• Insert data quality gates between steps in test
design and in response to incidents

Pipeline Measures
For each step in a data pipeline:
• Duration
• Cost (BUFFER_GETS, PAGE_READS, CPU Seconds)
• Records in
• Records out

Quality Measures
• Accuracy and completeness checks  
are number of errors and error %  
for every scope and time period
• Consistency checks  
are errors and error %  
for each rule and time period

How to Test
Real World 
Accuracy
Cache Accuracy Complete Consistent
Record
Talk to people 
(Call centre
veriﬁcation)
Compare to
system of record
Permissable
Values
Rules within the
record
Set n/a
Compare to
system of record
Reconciliation
Rules within the
set
Application n/a n/a n/a
Rules between
types
Organisation n/a n/a n/a
Rules between
applications

When to Test
Real World 
Accuracy
Cache Accuracy Complete Consistent
Record Infrequent Regular Every read Every read
Set n/a Regular Regular Regular
Application n/a n/a Regular Regular
Organisation n/a n/a n/a Regular

The journey of  
a thousand applications  
starts with  
a single test.

steven@claritycloudworks.com
+64 27 620 1237
claritycloudworks.com
Steven Ensslen

Measuring Data Quality with DataOps

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Measuring Data Quality with DataOps

Similaire à Measuring Data Quality with DataOps (20)

Dernier

Dernier (20)

Measuring Data Quality with DataOps