Most organisations think that they have poor data quality, but don’t know how to measure it or what to do about it. Teams of data scientists, analysts, and ETL developers are either blindly taking a “garbage in -> garbage out” approach, or worse still, “cleansing” data to fit their limited perspectives. DataOps is a systematic approach to measuring data and for planning mitigations for bad data.
3. Data Teams
• Are focused on urgent, unplanned work
• Traditionally operate the systems they develop,
because they don’t perceive hand-o! is possible
• Scant theory,
what little writing exists is technology-focused
4. The DataOps Manifesto
Whether referred to as data science,
data engineering, data management,
big data, business intelligence, or the
like, through our work we have come to
value in analytics:
https://www.dataopsmanifesto.org/
10. Do you have
Bad Data?
In the absence of information,
rumour becomes widely believed.
Rumour is biased toward emotion,
which in work places tends to be negative.
11. What problems does data quality cause?
• Data / ETL pipelines crash,
resulting in unavailable, stale, or incorrect data
• > 80% of Data Scientists’ time spent
collecting data
• Incorrect data is used for decisions
or published
• Doubts about data hurt morale and
discourage evidence-based decision making
12. What is Data Quality?
Data quality is good
when people who inspect data see what they expect.
Data quality is bad
when people are surprised by the data they see.
15. Document data characteristics
and train people to know them
If you only learn one thing today:
In the absence of training and documentation most people
will be surprised by the data even when nothing is wrong.
16. What do we want?
Evidence Based Decision Making
When do we want it?
After Peer Review
17. Data Testing
• Accuracy, Consistency, Completeness Tests
• On records and relationships
• Relationship Consistency Tests
18. Test Objectives
• Accuracy - is it true?
• Consistent - does it obey the rules?
• Complete - what is missing?
19. Data Test Scopes
• Within a record (SQL row, NoSQL document, etc.)
• Within a set (SQL Table, etc.)
• Within an Application (HRIS, ERP, etc.)*
• Across the organisation*
* - combinatorial
21. Monitor Data as if it is
Infrastructure
When Where Who
Code
Event driven
Commit / PR
Test Developers fix errors
Infrastructure
Constantly at tight
intervals
Production
Automated repair
failover to Ops
Data Constantly Production
Automated repair
failover to data
steward
23. Pipelines
• Monitor each step in the pipeline
• If steps are idempotent, kill and retry once any
step whose measures are anomalous
• Raise an incident if the retry is also anomalous
• Insert data quality gates between steps in test
design and in response to incidents
24.
25. Pipeline Measures
For each step in a data pipeline:
• Duration
• Cost (BUFFER_GETS, PAGE_READS, CPU Seconds)
• Records in
• Records out
26. Quality Measures
• Accuracy and completeness checks
are number of errors and error %
for every scope and time period
• Consistency checks
are errors and error %
for each rule and time period
27. How to Test
Real World
Accuracy
Cache Accuracy Complete Consistent
Record
Talk to people
(Call centre
verification)
Compare to
system of record
Permissable
Values
Rules within the
record
Set n/a
Compare to
system of record
Reconciliation
Rules within the
set
Application n/a n/a n/a
Rules between
types
Organisation n/a n/a n/a
Rules between
applications
28. When to Test
Real World
Accuracy
Cache Accuracy Complete Consistent
Record Infrequent Regular Every read Every read
Set n/a Regular Regular Regular
Application n/a n/a Regular Regular
Organisation n/a n/a n/a Regular
29. The journey of
a thousand applications
starts with
a single test.