ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
Data Quality Process Design For Analytics And Reporting
1. Jim Atwater
Principal Consultant
Management Analytics Practice
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC
2. Ad-hoc reporting processes are largely relational,
aspiring to a more dimensional model
More than one data source is involved, including
relational databases, spreadsheets, SharePoint lists
and flat files
Source data is staged to a relational database and
reported using standard tools like Excel and
PowerPoint
Data Quality is a concern, guided by a need for “One
version of the truth”
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC
4. No “T” in the “ETL” means:
no auditing or error handling
no added value until the report is generated
analysts end up doing the same work every reporting cycle
QA and formatting step actually adds errors
Column names
User-defined aggregations
Macros and calculated fields
Violates the prime directive
“One version of the truth” -> “None versions are the truth”
Automation only automates the errors
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC
6. Screen Definitions
A screen is a specific type of an automated test case
Screens can validate (among other things)
physical or logical structure
atomic-level data values
logical aggregation values
Screens enforce data quality
Identify errors and score their severity
Feed the Error Event Fact Table
Screen Order
Allows screens to be run in parallel for better performance
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC
7. Staged data from source systems
Data quality begins and ends with the source systems
Composed of feeds from source systems
Driven by source system “data map” analysis
Run screens
Run in groups to optimize performance
Consumed by master screen-processing codebase for supportability
Data quality metrics
Screen definition
Data quality score
Exception action
Screen type and category
Screen metadata (elapsed time, row count, byte count)
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC
8. Error events
Star-schema data structure
Source-system feed data rows are at the fact level
Main facts are the screen definition key and the severity score
Primary consumer is the Information Steward
Source of the Data Quality detail report
Audit dimensions
One dimension table for each report data source
One row for each class of error each time the report is updated
Integrated with the report data deliverables
Consumed by users as well as the Information Quality Lead
Acts as a physical guarantee of “one version of the truth”
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC
9. Source System Owners
Owns data provided by the source systems
Works with Information Steward to define strategy and reports
Works with Information Quality Lead to define and enforce data
quality metrics
Information Steward
Owns relationships with report consumers to define reports and
select data sources
Owns relationships with source system owners to define feeds
Information Quality Lead
Owns business rules for data quality metrics
Owns relationship with source system owners to define and
enforce data quality metrics
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC
10. Key dependencies
Diversity of reports
Quantity and complexity of data sources
Responsiveness of source systems
Milestones
Identify and automate of reports
Stack-rank reports by impact of decisions made
Map data from sources to reports
Define data quality screens
Establish infrastructure
Acquire hardware and staff resources
Create QA infrastructure
Data quality process
Iterative process between Users and Source Systems
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC
11. McKeel Research LLC
Affiliated with Allyis, Incorporated
Contact:
James W. Atwater, 3rd
Principal Consultant, Management Analytics Practice
Office: (425) 996-0427
Cell: (425) 766-0832
Email/IM: jamesatwater@hotmail.com
McKeel
12/9/2008 All Rights Reserved RESEARCH LLC