SlideShare a Scribd company logo
1 of 11
Jim Atwater
                Principal Consultant
            Management Analytics Practice



                                            McKeel
12/9/2008             All Rights Reserved   RESEARCH LLC
Ad-hoc reporting processes are largely relational,

aspiring to a more dimensional model
 More than one data source is involved, including
relational databases, spreadsheets, SharePoint lists
and flat files
 Source data is staged to a relational database and
reported using standard tools like Excel and
PowerPoint
 Data Quality is a concern, guided by a need for “One
version of the truth”
                                                 McKeel
12/9/2008              All Rights Reserved       RESEARCH LLC
McKeel
12/9/2008   All Rights Reserved   RESEARCH LLC
No “T” in the “ETL” means:

     no auditing or error handling
     no added value until the report is generated
     analysts end up doing the same work every reporting cycle
    QA and formatting step actually adds errors

     Column names
     User-defined aggregations
     Macros and calculated fields
    Violates the prime directive

     “One version of the truth” -> “None versions are the truth”
     Automation only automates the errors


                                                                McKeel
12/9/2008                     All Rights Reserved               RESEARCH LLC
McKeel
12/9/2008   All Rights Reserved   RESEARCH LLC
Screen Definitions

     A screen is a specific type of an automated test case
     Screens can validate (among other things)
        physical or logical structure
        atomic-level data values
        logical aggregation values
     Screens enforce data quality
        Identify errors and score their severity
        Feed the Error Event Fact Table
    Screen Order

     Allows screens to be run in parallel for better performance



                                                                    McKeel
12/9/2008                          All Rights Reserved              RESEARCH LLC
Staged data from source systems

     Data quality begins and ends with the source systems
     Composed of feeds from source systems
     Driven by source system “data map” analysis
    Run screens

     Run in groups to optimize performance
     Consumed by master screen-processing codebase for supportability
    Data quality metrics

       Screen definition
       Data quality score
       Exception action
       Screen type and category
       Screen metadata (elapsed time, row count, byte count)

                                                                  McKeel
12/9/2008                        All Rights Reserved              RESEARCH LLC
Error events

       Star-schema data structure
       Source-system feed data rows are at the fact level
       Main facts are the screen definition key and the severity score
       Primary consumer is the Information Steward
       Source of the Data Quality detail report
    Audit dimensions

       One dimension table for each report data source
       One row for each class of error each time the report is updated
       Integrated with the report data deliverables
       Consumed by users as well as the Information Quality Lead
       Acts as a physical guarantee of “one version of the truth”

                                                                   McKeel
12/9/2008                        All Rights Reserved               RESEARCH LLC
Source System Owners

     Owns data provided by the source systems
     Works with Information Steward to define strategy and reports
     Works with Information Quality Lead to define and enforce data
      quality metrics
    Information Steward

     Owns relationships with report consumers to define reports and
      select data sources
     Owns relationships with source system owners to define feeds
    Information Quality Lead

     Owns business rules for data quality metrics
     Owns relationship with source system owners to define and
      enforce data quality metrics

                                                              McKeel
12/9/2008                    All Rights Reserved              RESEARCH LLC
Key dependencies

     Diversity of reports
     Quantity and complexity of data sources
     Responsiveness of source systems
    Milestones

     Identify and automate of reports
        Stack-rank reports by impact of decisions made
        Map data from sources to reports
        Define data quality screens
     Establish infrastructure
        Acquire hardware and staff resources
        Create QA infrastructure
     Data quality process
        Iterative process between Users and Source Systems

                                                              McKeel
12/9/2008                        All Rights Reserved          RESEARCH LLC
McKeel Research LLC

     Affiliated with Allyis, Incorporated
     Contact:
           James W. Atwater, 3rd
           Principal Consultant, Management Analytics Practice
           Office: (425) 996-0427
           Cell: (425) 766-0832
           Email/IM: jamesatwater@hotmail.com




                                                                  McKeel
12/9/2008                          All Rights Reserved            RESEARCH LLC

More Related Content

Similar to Data Quality Process Design For Analytics And Reporting

Next Generation Datacenter Oracle - Alan Hartwell
Next Generation Datacenter Oracle - Alan HartwellNext Generation Datacenter Oracle - Alan Hartwell
Next Generation Datacenter Oracle - Alan HartwellHPDutchWorld
 
Oracle - Next Generation Datacenter - Alan Hartwell
Oracle - Next Generation Datacenter - Alan HartwellOracle - Next Generation Datacenter - Alan Hartwell
Oracle - Next Generation Datacenter - Alan HartwellHPDutchWorld
 
Building functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortalBuilding functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortalDmitriy Gumeniuk
 
How to Identify SQL Server Blame Shifters and Fix Them Quickly
How to Identify SQL Server Blame Shifters and Fix Them QuicklyHow to Identify SQL Server Blame Shifters and Fix Them Quickly
How to Identify SQL Server Blame Shifters and Fix Them QuicklySolarWinds
 
Resume Ca Moore 2009
Resume Ca Moore 2009Resume Ca Moore 2009
Resume Ca Moore 2009chrisamoore69
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...Big Data Week
 
Sai Teja K Resume.pdf
Sai Teja K Resume.pdfSai Teja K Resume.pdf
Sai Teja K Resume.pdfSaiTejaK11
 
Keeping the Pulse of Your Data: Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data:  Why You Need Data Observability to Improve D...Keeping the Pulse of Your Data:  Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data: Why You Need Data Observability to Improve D...Precisely
 
Ops A La Carte Statistical Process Control (SPC) Seminar
Ops A La Carte Statistical Process Control (SPC) SeminarOps A La Carte Statistical Process Control (SPC) Seminar
Ops A La Carte Statistical Process Control (SPC) SeminarJay Muns
 
SQL Server 2008 Migration
SQL Server 2008 MigrationSQL Server 2008 Migration
SQL Server 2008 MigrationMark Ginnebaugh
 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Precisely
 
Building An SOA To Power The Smart Grid
Building An SOA To Power The Smart GridBuilding An SOA To Power The Smart Grid
Building An SOA To Power The Smart Gridandrescarvallo
 
White Paper Data Quality Process Design For Ad Hoc Reporting
White Paper   Data Quality Process Design For Ad Hoc ReportingWhite Paper   Data Quality Process Design For Ad Hoc Reporting
White Paper Data Quality Process Design For Ad Hoc Reportingmacrochaotic
 
DataCyte - The Future of Data Storage & Retrieval
DataCyte - The Future of Data Storage & RetrievalDataCyte - The Future of Data Storage & Retrieval
DataCyte - The Future of Data Storage & RetrievalDaniel Opland
 
One Unified Platform for Deploying Enterprise Class Solutions across any ente...
One Unified Platform for Deploying Enterprise Class Solutions across any ente...One Unified Platform for Deploying Enterprise Class Solutions across any ente...
One Unified Platform for Deploying Enterprise Class Solutions across any ente...trw188
 
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...EMC
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
 

Similar to Data Quality Process Design For Analytics And Reporting (20)

Next Generation Datacenter Oracle - Alan Hartwell
Next Generation Datacenter Oracle - Alan HartwellNext Generation Datacenter Oracle - Alan Hartwell
Next Generation Datacenter Oracle - Alan Hartwell
 
Oracle - Next Generation Datacenter - Alan Hartwell
Oracle - Next Generation Datacenter - Alan HartwellOracle - Next Generation Datacenter - Alan Hartwell
Oracle - Next Generation Datacenter - Alan Hartwell
 
Building functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortalBuilding functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortal
 
How to Identify SQL Server Blame Shifters and Fix Them Quickly
How to Identify SQL Server Blame Shifters and Fix Them QuicklyHow to Identify SQL Server Blame Shifters and Fix Them Quickly
How to Identify SQL Server Blame Shifters and Fix Them Quickly
 
Resume Ca Moore 2009
Resume Ca Moore 2009Resume Ca Moore 2009
Resume Ca Moore 2009
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
 
Sai Teja K Resume.pdf
Sai Teja K Resume.pdfSai Teja K Resume.pdf
Sai Teja K Resume.pdf
 
9 Yrs Manual and Selenium Testing Profile
9 Yrs Manual and Selenium Testing Profile9 Yrs Manual and Selenium Testing Profile
9 Yrs Manual and Selenium Testing Profile
 
Keeping the Pulse of Your Data: Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data:  Why You Need Data Observability to Improve D...Keeping the Pulse of Your Data:  Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data: Why You Need Data Observability to Improve D...
 
Ops A La Carte Statistical Process Control (SPC) Seminar
Ops A La Carte Statistical Process Control (SPC) SeminarOps A La Carte Statistical Process Control (SPC) Seminar
Ops A La Carte Statistical Process Control (SPC) Seminar
 
SQL Server 2008 Migration
SQL Server 2008 MigrationSQL Server 2008 Migration
SQL Server 2008 Migration
 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
 
Building An SOA To Power The Smart Grid
Building An SOA To Power The Smart GridBuilding An SOA To Power The Smart Grid
Building An SOA To Power The Smart Grid
 
White Paper Data Quality Process Design For Ad Hoc Reporting
White Paper   Data Quality Process Design For Ad Hoc ReportingWhite Paper   Data Quality Process Design For Ad Hoc Reporting
White Paper Data Quality Process Design For Ad Hoc Reporting
 
Ops A La Carte SPC Seminar
Ops A La Carte SPC SeminarOps A La Carte SPC Seminar
Ops A La Carte SPC Seminar
 
DataCyte - The Future of Data Storage & Retrieval
DataCyte - The Future of Data Storage & RetrievalDataCyte - The Future of Data Storage & Retrieval
DataCyte - The Future of Data Storage & Retrieval
 
One Unified Platform for Deploying Enterprise Class Solutions across any ente...
One Unified Platform for Deploying Enterprise Class Solutions across any ente...One Unified Platform for Deploying Enterprise Class Solutions across any ente...
One Unified Platform for Deploying Enterprise Class Solutions across any ente...
 
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
 
Uses of Data Lakes
Uses of Data Lakes Uses of Data Lakes
Uses of Data Lakes
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 

Data Quality Process Design For Analytics And Reporting

  • 1. Jim Atwater Principal Consultant Management Analytics Practice McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 2. Ad-hoc reporting processes are largely relational,  aspiring to a more dimensional model  More than one data source is involved, including relational databases, spreadsheets, SharePoint lists and flat files  Source data is staged to a relational database and reported using standard tools like Excel and PowerPoint  Data Quality is a concern, guided by a need for “One version of the truth” McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 3. McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 4. No “T” in the “ETL” means:   no auditing or error handling  no added value until the report is generated  analysts end up doing the same work every reporting cycle QA and formatting step actually adds errors   Column names  User-defined aggregations  Macros and calculated fields Violates the prime directive   “One version of the truth” -> “None versions are the truth”  Automation only automates the errors McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 5. McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 6. Screen Definitions   A screen is a specific type of an automated test case  Screens can validate (among other things)  physical or logical structure  atomic-level data values  logical aggregation values  Screens enforce data quality  Identify errors and score their severity  Feed the Error Event Fact Table Screen Order   Allows screens to be run in parallel for better performance McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 7. Staged data from source systems   Data quality begins and ends with the source systems  Composed of feeds from source systems  Driven by source system “data map” analysis Run screens   Run in groups to optimize performance  Consumed by master screen-processing codebase for supportability Data quality metrics   Screen definition  Data quality score  Exception action  Screen type and category  Screen metadata (elapsed time, row count, byte count) McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 8. Error events   Star-schema data structure  Source-system feed data rows are at the fact level  Main facts are the screen definition key and the severity score  Primary consumer is the Information Steward  Source of the Data Quality detail report Audit dimensions   One dimension table for each report data source  One row for each class of error each time the report is updated  Integrated with the report data deliverables  Consumed by users as well as the Information Quality Lead  Acts as a physical guarantee of “one version of the truth” McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 9. Source System Owners   Owns data provided by the source systems  Works with Information Steward to define strategy and reports  Works with Information Quality Lead to define and enforce data quality metrics Information Steward   Owns relationships with report consumers to define reports and select data sources  Owns relationships with source system owners to define feeds Information Quality Lead   Owns business rules for data quality metrics  Owns relationship with source system owners to define and enforce data quality metrics McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 10. Key dependencies   Diversity of reports  Quantity and complexity of data sources  Responsiveness of source systems Milestones   Identify and automate of reports  Stack-rank reports by impact of decisions made  Map data from sources to reports  Define data quality screens  Establish infrastructure  Acquire hardware and staff resources  Create QA infrastructure  Data quality process  Iterative process between Users and Source Systems McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  • 11. McKeel Research LLC   Affiliated with Allyis, Incorporated  Contact:  James W. Atwater, 3rd  Principal Consultant, Management Analytics Practice  Office: (425) 996-0427  Cell: (425) 766-0832  Email/IM: jamesatwater@hotmail.com McKeel 12/9/2008 All Rights Reserved RESEARCH LLC