This document discusses strategies for creating an effective data validation and testing process. It provides examples of common data issues found during testing such as missing data, wrong translations, and duplicate records. Solutions discussed include identifying important test points, reviewing data mappings, developing automated and manual testing approaches, and assessing how much data needs validation. The presentation also includes a case study of a company that improved its process by centralizing documentation, improving communication, and automating more of its testing.
6. Handles more than 1 million customer transactions every hour.
• data imported into databases that contain > 2.5 petabytes of data
• the equivalent of 167 times the information contained in all the books in the US Library of
Congress.
Facebook handles 40 billion photos from its user base.
Google processes 1 Terabyte per hour
Twitter processes 85 million tweets per day
eBay processes 80 Terabytes per day
others
Big Impacts of Big Data
7. Data Warehouse Marketplace
“the worldwide data warehouse management software market is forecast
to generate nearly $17 billion in revenue by 2020” - Forrester
Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon
Business Intelligence Marketplace
“The business intelligence (BI) and analytics software market is forecast to grow to
$22.8 billion by the end of 2020” - Gartner
SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders
DWH, BI, Big Data Marketplaces
Big Data Marketplace
“By the end of 2020, companies will spend > USD $72 billion on on Big Data
hardware, software, & professional services” - IDC
Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata,
SAP, MongoDB, MapR, DataStax, Snowflake.
9. Impacts of Bad Data
“On average, poor data quality costs organizations $14.2 million
annually.”
a software division ofQuerySurge™
“Dirty data costs the average business 15% to 25% of revenue.”
“Cleaning up data will lead to average cost savings of 33%, while
boosting revenue by an average of 31%.”
11. What is Data Validation?
Data Validation Testing
The process of verifying your data is completely and accurately moved
through your systems according to the business requirements.
Legacy DB
CRM/ERP
DB
Finance DB
Source Data ETL Process Target DWH
Extract
Transform
Load
12. • Data Completeness
Verifying that all data has been loaded from the sources to the target Data Warehouse.
Validate the correct data displays in BI reports.
Data Validation Testing
• Data Transformation
Ensuring that all data has been transformed
correctly during the extract-transform-load (ETL)
process.
• BI Report Testing
Verify that BI Reports are formatted correctly, calculated fields are validated, and data is verified
against the underlying data.
DATA VALIDATION TEST TYPES
• BI Performance Testing
Ensure your BI Reports can be generated in a reasonable amount of time
• Data Quality
Ensuring that the ETL process correctly rejects,
substitutes default values, corrects or ignores and
reports invalid data.
13. Finding Bad Data
Issue Description Possible Causes
Missing Data Data that does not make it into the target database
• Invalid or incorrect lookup table in the
transformation logic
• Bad data from the source database (Needs
cleansing)
• Invalid joins
Truncation of Data Data being lost by truncation of the data field
• Invalid field lengths on target database
• Transformation logic not considering field
lengths from source
Data Type Mismatch Data types not set up correctly on target database Source data field not configured correctly
Null Translation
Null source values not being transformed to correct
target values
Development team did not include the null
translation in the transformation logic
Wrong Translation
Opposite of the Null Translation error. Field should be
null but is populated with a non-null value or field
should be populated, but with the wrong value
Development team incorrectly translated the
source field for certain values
Misplaced Data
Source data fields not being transformed to the
correct target data field
Development team inadvertently mapped
the source data field to the wrong target data
field
Extra Records
Records which should not be in the ETL are included
in the ETL
Development team did not include filter in
their code
Not Enough Records
Records which should be in the ETL are included in
the ETL
Development team had a filter in their code
which should not have been there
14. Finding Bad Data (cont.)
Issue Description Possible Causes
Transformation Logic
Errors/Holes
Testing sometimes can lead to finding “holes” in the
transformation logic or realizing the logic is unclear
Development team did not take into account
special cases. For example international
cities that contain special language specific
characters might need to be dealt with in the
ETL code
Simple/Small Errors Capitalization, spacing and other small errors
Development team did not add an additional
space after a comma for populating the
target field.
Sequence Generator
Ensuring that the sequence number of reports are in
the correct order is very important when processing
follow-up reports or answering to an audit
Development team did not configure the
sequence generator correctly resulting in
records with a duplicate sequence number
Undocumented
Requirements
Find requirements that are “understood” but are not
actually documented anywhere
Several of the members of the development
team did not understand the “understood”
undocumented requirements.
Duplicate Records
Duplicate records are two or more records that
contain the same data
Development team did not add the
appropriate code to filter out duplicate
records
Numeric Field Precision
Numbers that are not formatted to the correct
decimal point or not rounded per specifications
Development team rounded the numbers to
the wrong decimal point
Rejected Rows Data rows that get rejected due to data issues
Development team did not take into account
data conditions that could break the ETL for
a particular row
15. Challenges
• How much data needs to be validated/tested?
• How do I ensure I am testing the proper data
permutations?
• What are the critical data endpoints that need
to be tested?
• How do I verify that the data from my various
source systems is propagating through the
architecture?
• How do I validate data in the cloud
environments?
• Is bad data making it into the architecture?
• How much of the data testing can be automated?
16. COST
Data Mapping Development
Unit Testing
QA Test Cycle
UAT
Testing
End
User
Solutions
Finding Bad Data
• Identify testing points
• Review data mappings
• Data Testing Strategies
• comparisons (source vs. target)
• row counts
• minus queries
• automation tools
17. Solutions
Data Testing Permutations
• Analyze the data mappings
• Develop a test Data Set
o Review Transformation Logic
▪ Case Statements
▪ Field Merges/ Field Splitting
▪ Translations (Lookups)
▪ Derived
• Replication of production data
• Homegrown or Freeware
• Enterprise solutions
o IBM InfoSphere Optim, GenRocket, SAP, Computer Associates
Test Data Generation
18. Solutions
How much data to validate?
• Requirements
• Regulatory authorities may require 100% of your data be tested.
• In other cases, 90% or 80% may be the goal.
• Time, resource and scope driven
• Release timeline
• Available resources
• Scope of authoring and executing tests
• Risk Assessment
• Business Acceptance Criteria – End users define their primary data use cases.
• Critical Path – Validate the data the flows through the high priority data
endpoints within in your system.
𝑇𝑒𝑠𝑡 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 𝑡𝑜𝑡𝑎𝑙
# 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 ∗ (# 𝑜𝑓 ℎ𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑛𝑔 𝑝𝑒𝑟 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒)
= # 𝑜𝑓 𝑑𝑎𝑦𝑠
𝑇𝑒𝑠𝑡 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑡𝑜𝑡𝑎𝑙
# 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 ∗ (# 𝑜𝑓 ℎ𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑛𝑔 𝑝𝑒𝑟 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒)
= # 𝑜𝑓 𝑑𝑎𝑦𝑠
19. Solutions
Automation vs Manual
• Recurrence
• Avoid complicated single use test cases
• Focus on repeatable testing paths
• Ensure modularization of test data sets
• Test Data Sets
• Consider automation tool’s assigned hardware resources and performance
which must be able to handle the load of the data set under test
• Include time needed to prepare environments into your testing estimates
• Database Performance
• Set expectations on database hardware & responsiveness.
• SQL query response time will factor into overall test run times
20. Solutions
How do I test data in my cloud environment ?
• On-Prem vs Cloud
o Follow the same testing methodologies but with considerations for cloud
connections and scalability
o If an automated solution is being pursued, confirm the tools involved
allows for connectivity to your cloud environment
• Hybrid-Could Mapping
o Interface documentation
o Define entry & exit points if applicable
• Digital Transformation
o Clearly defined conversion
requirements and mappings
• Environment Scalability
• Define limitations on testing environment resources
22. Data Validation Assessment
What are the goals of a
Data Validation assessment?
• Receive an expert evaluation of your
current data validation process
• Provide recommendations on how to
improve your process
• Proposal for successful implementation
of your goals
23. Data Validation Assessment
Components of the Assessment
• Business analysis
• Data architecture analysis
• ETL testing process evaluation
• DataOps & DevOps evaluation
• Resource evaluation (optional)
• Metrics evaluation
• Risk assessment
24. Data Validation Assessment
Interview with Key Players
• Business/Data Analysts create requirements
• QA Testers develop and execute test plans and
test cases
• Architects set up environments
• Developers create ETL code, perform unit tests
• DBAs test for performance and stress
• Business Users perform functional User
Acceptance Tests
25. Data Validation Assessment
Process Review
• Review Requirements & Mapping documentation
• Testing Process Design
• Analysis of tools and DevOps/DataOps
• Reporting metrics evaluations
26. Data Validation Assessment
Deliverables
• Detailed analysis report with recommendations
for improvement
• Presentation to your team on our findings
• Proposal for successful implementation of your
goals
28. ETL Developer: Codes data movement based on Mapping Requirements
Data Warehouse
ETL
Data Tester: Tests data movement based on Mapping Requirements
Data Mart
ETL
Source Data Big Data lake
Testing Point #1 Testing Point #2 Testing Points #3
BI & Analytics
Testing Point #4
Tester tests BI
Reports
BI Analyst extracts
data for reports
Data Testing - Developer & Tester
29. Source-to-Target Map
It’s the critical element required to
efficiently plan the target Data
Stores. It also defines the Extract,
Transform, Load (ETL) process.
Intention:
✓ capture business rules
✓ data flow mapping and
✓ data movement requirements.
Mapping Doc specifies:
▪ Source input definition
▪ Target/output details
▪ Business & data transformation rules
▪ Absolute data quality requirements
▪ Optional data quality requirements.
Data Requirements = Mapping Document
30. Data Testing Strategies
Testing Methods
Minus Queries – Create a SQL source query and a SQL Target query. Utilizing SQL, subtract
source query results from target query results and subtract target query results from
source query results
Visual Compare – View source data and target
data and manually compare
Record Counts – Creating a SQL source and
target query to return a record counts and
comparing the values
Automation – Utilizing an automation tool to compare SQL source and target query results
31. Sampling
Level
1
Sampling a % of data by visually comparing data sets. Not repeatable.
Excel, Ad Hoc Reporting
Level
2
Using Excel or other homegrown method. Ad hoc reporting.
Minus Queries
Level
3
Utilizing SQL editor & minus queries to test data. More
detailed reporting.
Data Test Automation
Level
4
Repeatable test automation, agreed-upon process, centralized
reporting.
On which Level
should your
process be?
Data Quality Optimizing
Level
5
Full automation, tracking of ROI, predictive data issues, auditable
results. Business value is fully understood/supported by management.
Data Maturity Model - Test Execution
33. A company in the financial industry had a development and QA team assigned to
their ETL process. But there were still issues:
Case Study
• They were still suffering from incorrect data
fields populating their Business Intelligence
(BI) reports
• Development cycles were frequently delayed
• Management was losing confidence in the BI
reporting data
CASE STUDY
OVERVIEW
34. Senior RTTS resources were brought in to assess the process
• Interview key players
• Review process documentation and tools
• Minimal requirements
• Ticketing system was not being implemented for
traceability
• Testing process of low-level maturity
o Table row counts
o Sampling
o Excel comparisons
Problem areas identified:
Case Study
Resource needs:
35. Case Study
Recommendations for Improvement
• Centralized mapping documentation
o Linking requirements to work items
tickets to test cases.
• Improve communications between team
members we recommended a new Data
Analyst role
• Narrowed focus of the stand-up meetings
• Implemented automated solutions to
expand coverage for larger data sets