Many believe that regression testing an application with minimal data is sufficient. However, the data testing methodology becomes far more complex with big data applications. Testing can now be done within the data fabrication process as well as in the data delivery process. Today, comprehensive testing is often mandated by regulatory agencies—and more importantly by customers. Finding issues before deployment and saving your company’s reputation—and in some cases preventing litigation—is critical. Jason Rauen presents an overview of the architecture, processes, techniques, and lessons learned by an original big data company. Detecting defects up-front is vital. Learn how to test thousands, millions, and in some cases billions—yes, billions—of records directly, rendering sampling procedures obsolete. See how you can save your organization time and money—and have better data test coverage than ever before.
2. Jason Rauen
LexisNexis
Jason Rauen is a senior quality test analyst at Georgia-based LexisNexis Risk
Solutions. With more than fifteen years of experience, Jason has led the data testing
team in big data from its inception. He has presented big data scripting techniques at
HPCC Systems national Data Summit. His background includes working at companies
including Microsoft, AT&T, and LexisNexis, and instructing at Intel, Boeing,
Executrain, and the Department of the Navy. Jason has transitioned through various
aspects of technology including technical sales, customer support, training, quality
control/quality assurance, and into management.
4. 2/4/2014
2
Overview
• Architecture and why you need to know
– HPCC Systems/Hadoop
– Know Your Data/Environment
• Why Test Big Data and How it’s Different
– Issues
– Benefits
• Strategies and Concepts
– What to look for
– Sample Gathering (AUB)
– Stats
– Profiling
3
Architecture and why you need to know
Data Warehouse Architecture
Source Files
EXTRACT
TRANSFORM
LOAD
Staging
(Data
Cleansing)
4
DATA
WAREHOUSE
8. 2/4/2014
6
Why Test Big Data and How it’s Different
Why Test Big Data:
• Traditional methods not adequate – Traditional sampling
d i d i i b d hneeds improvement and is scenario based, not enough
samples, human error, etc….
• Size of the data is huge, from different
sources, and inconsistent
• Tied into current environment
• Government regulatory compliancesg y p
• Auditing requirements
• Company wide initiatives
• The business makes crucial decisions
based off of it
11
Why Test Big Data and How it’s Different
Want to keep your customers?
12
9. 2/4/2014
7
Why Test Big Data and How it’s Different
• When?
o Testing ‐ SDLC
o Routine Testingg
o Frequency ‐ Yearly/Monthly/Weekly/Daily/Hourly/On
Demand
• What? Types Testing
New Project – Source to Target (Transform)
Standard ‐ Production Validation
Emergency releases
• How?
o Using what you have available
o Freebies – Profiling tools, etc…
13
Why Test Big Data and How it’s Different
Issues:
• Lack of control
Timing of buildsTiming of builds
Samples and location of samples
• 3rd Party Apps
Lack of licenses, Costs, Training, and existing
knowledge
• Extra hardware• Extra hardware
• Upgrades
14
10. 2/4/2014
8
Why Test Big Data and How it’s Different
Benefits:
• Cost savings
• Better Coverage
No Samples
Increased Sampling
Focused Samples
• Faster (Time is $)
• Quicker to Diagnosing issues
• Better Data Integrity
• Collaboration with other groups
15
Strategies and Concepts
• What to look for……
Brand New, Incomplete, or Missing Builds (Data Cops)
Data progression Today/Yesterday FatherKey/Grandfatherkeyp g y/ y y/ y
Count of Deltas in release/deploy
Keys updated
Missing keys/New keys
Field Validations – mandatory fields blank, consistency, etc…
Key Layout issues
Corruption unprintable or invalid characters
Duplicate records of new and existing records
Data Fabrication Engine to Data delivery Engine deploys/sync
Queries with new data
16
11. 2/4/2014
9
Strategies and Concepts
JOIN
• Sample gathering
• New Key for testing
• Deployment Validation
‐ Data Fabrication
• Deployment Validation
‐ Data Delivery
And get a free cookie…
17
Strategies and Concepts
AUB for JOIN
A = Left key (New)
B = Right key (Old)B Right key (Old)
Types of JOINS
Inner Join Left Outer Join Right Outer Join
Full Outer Join Minus or Left Only
18
16. 2/4/2014
14
Strategies and Concepts
SQL
SELECT * FROM Products
ORDER BY productcode;
Pig
Products= ORDER
Products BY productcode;
ECL
SORT(
Products,productcode);ORDER BY productcode;
SELECT * FROM Products FULL
OUTER JOIN OtherProducts
ON Products.col1 =
OtherProducts.col1;
DUMP Products;
Products= JOIN Products
BY col1 FULL OUTER,
OtherProducts BY col1;
DUMP Products;
JOIN(Products,OtherPro
ducts, LEFT.col1 =
RIGHT.col1,FULL
OUTER);
27
Summary
Why Test Big Data and How it’s
Different
Architecture and why you need to know
Strategies and Concepts
28