Today’s data warehouses are complex and contain heterogeneous data from many different sources. Testing these warehouses is complex, requiring exceptional human and technical resources. So how do you achieve the desired testing success? Geoff Horne believes that it is through test planning that includes technical artifacts such as data models, business rules, data mapping documents, and data warehouse loading design logic. Wayne shares planning checklists, a test plan outline, concepts for data profiling, and methods for data verification. He demonstrates how to effectively create a test strategy to discover empty fields, missing records, truncated data, duplicate records, and incorrectly applied business rules—all of which can dramatically impact the usefulness of the data warehouse. Learn common pitfalls, which can cost your business hundreds of thousands of dollars or more, when test planning shortcuts are taken. If you work in an environment that often performs data warehouse testing without proper planning and technical skills, this session is for you.
Data Warehouse Testing: It’s All about the Planning
1. W8
Concurrent Class
10/2/2013 1:45:00 PM
"Data Warehouse Testing: It’s
All about the Planning"
Presented by:
Geoff Horne
NZTester Magazine
Brought to you by:
340 Corporate Way, Suite 300, Orange Park, FL 32073
888-268-8770 ∙ 904-278-0524 ∙ sqeinfo@sqe.com ∙ www.sqe.com
2. Geoff Horne
NZTester Magazine
Geoff Horne has an extensive background in test program/project directorship and
management, architecture, and general consulting. In New Zealand Geoff established and ran
ISQA as a testing consultancy which enjoys a local and international clientele in Australia, the
US, and the United Kingdom. He has held senior test management roles across a number of
diverse industry sectors, and is editor and publisher of the recently launched NZTester
magazine. Geoff has authored a variety of white papers on software testing and is a regular
speaker at the STAR conferences.
6. 9/20/2013
Plan QA for DWH Lifecycle
Primary goals for verification
– Data completeness
– Data transformations
– Data quality
– Performance and scalability
– Integration testing
– User-acceptance testing
– Regression testing
7
Planning the DWH QA Strategy
Carefully review:
– Requirements documentation
– Data models for source and target schemas
– Source to target mappings
– ETL / stored proc design & logic
– CA deployment tasks / steps
– Required QA tools
8
4
7. 9/20/2013
Challenges for DWH Testers (1)
1.
2.
3.
4.
Often inadequate ETL design documents
Source table field values unexpectedly null
Excessive ETL errors discovered after entry to QA
Source data does not meet table mapping specs
(ex., dirty data)
5. Source to target mappings:
1. Often not reviewed by all stakeholders
2. Not consistently maintained through dev lifecycle
3. Therefore, in error
9
Challenges for DWH Testers (2)
6. Data models not maintained
7. Target data does not meet mapping specifications
8. Duplicate field values when defined to be
DISTINCT
9. ETL SQL / errors that lead to missing rows and
invalid field values
10. Constraint violations in source data
11. Table keys are incorrect for important RDB
linkages
10
5
8. 9/20/2013
Challenges for DWH Testers (3)
12. Huge source data volumes and of data types.
13. Source data quality that must be profiled before
loading to DWH
14. Redundancy, duplicate source data.
15. Many source data records to be rejected
16. ETL logs w/ messages to be acted upon.
17. Source field values may be missing where they
should always be present.
11
Challenges for DWH Testers (4)
19. Source data history & business rules may not
be available.
20. SME’s and business rules may not be available
21. Since data ETLs must often pass through
multiple phases
22. Transaction-level traceability will be difficult to
attain in a data warehouse.
23. The data warehouse will be a strategic
enterprise resource and heavily relied upon
12
6
9. 9/20/2013
Plan for QA Tools
13
Identify QA skills (1)
•
•
•
•
Understanding fundamental DWH and DB concepts
High skill w/SQL and stored procedures
Understanding of data used by the business
Developing strategies, test plans and test cases
specific to DWH and the business
• Creating effective ETL test cases / scenarios based on
loading technology and business requirements
• Understanding of data models, data mapping
documents, ETL design and ETL coding; ability to
provide feedback to designers and developers
14
7
10. 9/20/2013
Identify QA skills (2)
• Experience with Oracle, SQL Server, Sybase, DB2
technology
• Informatica session troubleshooting
• Deploying DB code to data bases
• Unix scripting, Autosys, Anthill, etc.
• SQL editors
• Data profiling
• Use of Excel & MS Access for data analysis
15
Basic ETL Verifications (1)
• Verify data mappings, source to target
• Verify that all tables fields were loaded from source
to staging
• Verify that keys were properly generated using
sequence generator
• Verify that not-null fields were populated
• Verify no data truncation in each field
• Verify data types and formats are as specified in
design phase
16
8
11. 9/20/2013
Basic ETL Verifications (2)
• Verify no duplicate records in target tables.
• Verify transformations based on data low level
design (LLD's)
• Verify that numeric fields are populated with
correct precision
• Verify that every ETL session completed with only
planned exceptions
• Verify all cleansing, transformation, error and
exception handling
• Verify PL/SQL calculations and data mappings
17
Examples of DWH Defects
1. Inadequate ETL and stored procedure design documents
2. Field values are null when specified as “Not Null”.
3. Field constraints and SQL not coded correctly for
Informatica ETL
4. Excessive ETL errors discovered after entry to QA
5. Source data does not meet table mapping specifications
(ex., dirty data)
6. Source to target mappings: 1) often not reviewed, 2) in
error and 2) not consistently maintained through dev
lifecycle
18
9
12. 9/20/2013
Examples of DWH Defects
7. Data models are not adequately maintained during
development lifecycle
8. Target data does not meet mapping specifications
9. Duplicate field values when defined to be DISTINCT
10. ETL SQL / transformation errors leading to missing rows
and invalid field values
11. Constraint violations in source
12. Target data is incorrectly stored in nonstandard formats
13. Table keys are incorrect for important relationship linkages
19
Verifying Data Loads
From RTTS
20
10
13. 9/20/2013
Planning for DWH QA (1)
Data integration planning (Data model, LLD’s)
1. Gain understanding of data to be reported by the
application (e.g., profiling)… and the tables upon
which each user report will be based upon
2. Review, understand data model – gain understanding
of keys, flows from source to target
3. Review, understand data LLD’s and mappings: add,
update sequences for all sources of each target table
21
Planning for DWH QA (2)
ETL planning and testing (source inputs & ETL design)
1. Participate in ETL design reviews
2. Gain in-depth knowledge of ETL sessions, the order
of execution, restraints, transformations
3. Participate in development ETL test case reviews
4. After ETL’s are run, use checklists for QA
assessments of rejects, session failures, errors
22
11
14. 9/20/2013
Planning for DWH QA (3)
Assess ETL logs: session, workflow, errors
1. Review ETL workflow outputs, source to target
counts
2. Verify source to target mapping docs with loaded
tables using TOAD and other tools
3. After ETL runs or manual data loads, assess data in
every table with focus on key fields (dirty data,
incorrect formats, duplicates, etc.). Use TOAD, Excel
tools. (SQL queries, filtering, etc.)
23
Planning for DWH QA (4)
GUI and report validations
1. Compare report data with target data.
2. Verify that reporting meets user expectations
Analytics test team data validation
1. Test data as it is integrated into application
2. Provide tools and tests for data validation
24
12
15. 9/20/2013
DQ tools / techniques for QA team
TOAD / SQL Navigator
•Data profiling for value range &
boundary analysis
•Null field analysis
•Row counting
•Data type analysis
•Referential integrity analysis
•Distinct value analysis by field
•Duplicate data analysis (fields and rows)
•Cardinality analysis
• Stored procedures & package
verification
Excel
•Data filtering for profile analysis
•Data value sampling
•Data type analysis
MS Access
•Table and data analysis across
schemas
QTP
•Automated testing of templates and
application screens
• RTTS QuerySurge
Analytics Tools
•J – statistics, visualization, data
manipulation
•Perl – data manipulation, scripting
•R – statistics
25
Bottom Line Recommendations
• Involve test team in entire DWH SDLC
• Profile source and target data
• Remember: DWH QA is much more than
source and target record counts
• Develop testers SQL and DWH skills
• Assure availability of source to target mapping
document
• Plan for regression and automated testing
26
13
16. 9/20/2013
Planning Dev/Unit Tests
Unit testing checklist
•
•
Some programmers are not well trained as testers. They may like to program, deploy the
code, and move on to the next development task without a thorough unit test. A checklist
will aid database programmers to systematically test their code before formal QA testing.
Check the mapping of fields that support data staging and in data marts. Check for
duplication of values generated using sequence generators. Check the correctness of
surrogate keys that uniquely identify rows of data. Check for data-type constraints of the
fields present in staging and core levels. Check the data loading status and error messages
after ETLs (extracts, transformations, loads).Look for string columns that are incorrectly leftor right-trimmed. Make sure all tables and specified fields were loaded from source to
staging. Verify that not-null fields were populated. Verify that no data truncation occurred in
each field. Make sure data types and formats are as specified during database design. Make
sure there are no duplicate records in target tables. Make sure data transformations are
correctly based on business rules. Verify that numeric fields are populated precisely. Make
sure every ETL session completed with only planned exceptions. Verify all data cleansing,
transformation, and error and exception handling. Verify stored procedure calculations and
data mappings. Some programmers are not well trained as testers. They may like to program,
deploy the code, and move on to the next development task without a thorough unit test. A
checklist will aid database programmers to systematically test their code before formal QA
testing.
27
Planning for Performance Tests
•
As the volume of data in the warehouse grows, ETL execution times can be expected to
increase, and performance of queries often degrade. These changes can be mitigated by
having a solid technical architecture and efficient ETL design. The aim of performance testing
is to point out potential weaknesses in the ETL design, such as reading a file multiple times or
creating unnecessary intermediate files. A performance and scalability testing checklist helps
discover performance issues.
•
Load the database with peak expected production volumes to help ensure that the volume of
data can be loaded by the ETL process within the agreed-on window. Compare ETL loading
times to loads performed with a smaller amount of data to anticipate scalability issues.
Compare the ETL processing times component by component to pinpoint any areas of
weakness. Monitor the timing of the reject process and consider how large volumes of
rejected data will be handled. Perform simple and multiple join queries to validate query
performance on large database volumes. Work with business users to develop sample
queries and acceptable performance criteria for each query.
28
14
17. 9/20/2013
Recommendations for data
verifications
Detailed Recommendations for Data Development and QA
1.
Need analysis of a.) source data quality and b.) data field profiles before input to Informatica and other
data-build services.
2.
QA should participate in all data model and data mapping reviews.
3.
Need complete review of ETL error logs and resolution of errors by ETL teams before DB turn-over to QA.
4.
Early use of QC during ETL and stored procedure testing to target vulnerable process areas.
5.
Substantially improved documentation of PL/SQL stored procedures.
6.
QA needs dev or separate environment for early data testing. QA should be able to modify data in order
to perform negative tests. (QA currently does only positive tests because the application and data base
tests work in parallel in the same environment.)
7.
Need substantially enhanced verification of target tables after each ETL load before data turn-over to QA.
8.
Need mandatory maintenance of data models and source to target mapping / transformation rules
documents from elaboration until transition.
9.
Investments in more Informatica and off-the-shelf data quality analysis tools for pre and post ETL.
10. Investments in automated DB regression test tools and training to support frequent data loads.
29
Plan QA for All DWH Dev. Phases
30
15