Kim Hammar - Allstate Internship Presentation Data Engineering Analytics - Claims CAFe

Building a Fault-Tolerant ETL Pipeline for Claims CAFé
Internship Presentation - Summer 2018
Kim Hammar
Data Analytics Engineer
khamq@allstate.com or kimham@kth.se
August 30, 2018
Kim Hammar (DAE) Claims CAFé August 30, 2018 1 / 24

Extract
Transform
Load

Extract
Transform
Load
Corrupt data that cannot be parsed
Inconsistent Schema
Wrong File paths
Unexpected null values
Duplicates
Code bugs
Missing values
Inconsistent data types
Target database unavailable
Permission error
Scalability problems

Outline
1 Claims CAFé Background
2 Ensuring Data Quality
3 DEMO
4 Conclusion
5 Questions

What is Claims CAFé?
Claims CAFé: Claims (C)entral (A)nalytical (F)il(e)
claimId policy participant persVeh ...
B005 auto John Doe URK389 Audi ...
B007 home A.Svensson PTO291 Ford ...
B003 life C.Strömbäck RNU999 Volvo ...
B004 property G.Åsbrink WEM650 Benz ...
B002 health L.Löfven KQO209 Tesla ...



Claims
A one stop shop for claims analytics.
Optimized for analytical use-cases by saving raw denormalized data in
hadoop:
Increase scalability
Makes data processing easier: No more slow and complex SQL-Joins

What Actually Goes Into Building a Model
Data Science & Analytics
Data Engineering
Building models,
reports &
Visualizations
ETL processes,
data transformations &
data cleaning

Now:
complex and slow
join to pull data
complex and slow
join to pull data
complex and slow
join to pull data
complex and slow
join to pull data
transform & clean data
create derived ﬁelds
Predict claim automation:
Deep FNN model
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Customer report
& dashboards
Link Analysis
Fraud Detection
Customer1
Customer2
Claim1
Claim2
Claimant
Participant
3
10
10
10
4 9 4
5
10
11
1
Classify Image Vehicle Loss
Deep CNN Model






repairable = 0
total loss = 1
...
no loss = 0







Now:
complex and slow
join to pull data
complex and slow
join to pull data
complex and slow
join to pull data
complex and slow
join to pull data
Predict claim automation:
Deep FNN model
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Customer report
& dashboards
Link Analysis
Fraud Detection
Customer1
Customer2
Claim1
Claim2
Claimant
Participant
3
10
10
10
4 9 4
5
10
11
1
Classify Image Vehicle Loss
Deep CNN Model






repairable = 0
total loss = 1
...
no loss = 0






Future:
complex and slow
join to pull data
Claims
CAFé
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Customer1
Customer2
Claim1
Claim2
Claimant
Participant
3
10
10
10
4 9 4
5
10
11






repairable = 0
total loss = 1
...
no loss = 0







What Goes Into Making CAFé
Coffee Plants Harvest
Processing:
pulping the cherries,
filtering out bad ones,
sorting&drying
Prepare for consumption:
roasting,
grinding,
brewing
Consume:
drink,
sell
Making Coffee
Raw Data Sources ETL
X f () Y
Processing:
fill in null values,
create derived fields,
data cleaning
Prepare for consumption:
flatten,
create different views,
save to Hive
x
f(x)
Sekante
P
f(x0)
x0
Q
x0 + ε
f(x0 + ε) − f(x0)
ε
ε
f(x0 + ε)
Consume:
build models,
generate business reports
Making Claims CAFé

Claims CAFé Data Architecture
Data Sources ADW NextGen . . . Allstate
Canada
Claims
MDM
Claims MDM
Synchronizes and reconciles data
from various sources into a single place
Claims
CAFé
Claims CAFé
Serves the need of data scientists.
Contains data optimized for analytics
Derived ﬁelds
Data cleaning
Data restructuring
Machine Learning
x
f(x)
Sekante
P
f(x0)
x0
Q
x0 + ε
f(x0 + ε) − f(x0)
ε
ε
f(x0 + ε)
Analytics
[−4, −2) [−2, 0) [0, 2) [2, 4)
0
500
1,000
1,500
2,000
2,500
Reports
Number of
fraudulent claims
during the
third quarter: X

What Is a Claim From A Data Pespective?
Claim
Claim Details
...
Policy Details
...
Participants Details
...
Vehicle Details
...
...
...
Figure: Some of the data that a single Claim in Claims CAFé contains.

Data Quality Assurance
Data Quality is key to the success of Claims CAFé
Examples of data quality issues:
Null values in the wrong place
Duplicates
Missing values
Inconsistent data types
...
Data Quality Issues
Data Size
Why does my Model not work?
John Doe, Data Scientist

Data Quality Mechanisms Motivation
Failures are the norm, they are not an exception
Optimistic error estimate: P("CAFé error") =
P("Data quality issue")∪P("Hadoop failure")∪P("Network failure")∪
P("ClaimsMDM failure") ∪ P("CAFé code bug") ≈ 1/1000 = 0.001
CAFé stretch goal: near-real-time updates, say we pull data every 15
minutes =⇒ 96 pulls per day =⇒ 672 pulls per week
failure probability per week: 0.001 × 672 = 0.672 =⇒
a failure will happen on average every other week
We want built-in mechanisms in the CAFé pipeline to detect and
deal with errors before they aﬀect end-users: Tests!

How to detect data quality issues? Know your data
. . . Claim 1
...
Claim 2
...
Claim 3
...
Claim 4
...
Claim 5
...
How many claims
were recorded
in Rhode Island
2016-2017?
How many
null claims
exists in
Idaho?
What is the average
number of claim par-
ticipants?
What is the average
number of covered
claimants in a claim?
What is the maximum
number of vehicles re-
lated to a claim?

Anomaly Detection for Detecting Potential Data Issues
(1/3)
Spikes in the number of null values indicate a data issue.
06/18/18 06/18/18 06/19/18 06/19/18 06/20/18 06/20/18 06/21/18 06/21/18
Date
0
2000
4000
6000
8000
10000
12000
14000
16000
NullCounts
Indication of a data pull issue

(2/3)
04.2017
05.2017
06.2017
07.2017
08.2017
09.2017
10.2017
11.2017
12.2017
01.2018
02.2018
03.2018
6,000
8,000
10,000
12,000
14,000
Date
Count
Number of claims in Rhode Island over time
Why did this drop happen?

(3/3)
CoverageClaimants PersVeh ClaimsParty
0
20
40
2
1
3
1
4
39
Count
Comparing a single claim statistics with the average
Average cross entire dataset Actual value for this claim
39 Parties in this claim vs average on 3
is this valid data or a bug?

Regression Tests Background (1/2)
What is regression testing?
Regression tests verify modifications of a program or data.
If the modification fail the tests, the program can regress back.
I just refactored the CAFé pipeline, how do I know that I didn’t
break anything?
I just pulled new data into CAFé, how do I know I did not
introduce data quality issues?
Why do we want regression tests?
It increases the confidence in making code and data changes
We can detect bugs before they bother end-users
We can avoid unnecessary work: If the tests fail we can abort early
and save time.

Naive ETL Pipeline For Updating Claims CAFé:
Load & Transform
Claims
MDM
Claims
CAFé

Naive ETL Pipeline For Updating Claims CAFé:
Load & Transform
Claims
MDM
Claims
CAFé
A More Robust ETL Pipeline:
Claims
MDM
Regression
Tests
Claims
CAFé
Tests failed
Email
maintainers
Missing columns:...
Null values:...
Anomalies:...
...
Generate
report
Abort ETL
Load &
Transform
Tests passed

Regression Tests Pipeline for CAFé
Data Sources
Abort
Abort
Abort
Data paths does not exist
Invalid schema
Data paths exist
CAFé Regression Tests
Verify data paths of sources
Verify schema of sources
Verify data quality
Count null values and
compare with snapshot
Count columns and
compare with snapshot
Verdict
¬∃ duplicates ∧ null increase < x ∧ counts increase < y =⇒ build
∃ duplicates ∧ null increase < x ∧ counts increase < y =⇒ build & email
∃ duplicates ∧ null increase > x ∧ counts increase > y =⇒ abort & email
Valid schema
Email
maintainersSave test reportSave snapshot

Regression Tests Reporting
Test Report: 26/7-2018
Passed
79.0%
Failed
21.0%
Test Results
valid_data_paths: passed, time: 7s
valid_source_schema: passed, time: 19s
no_duplicates: failed, time: 26s
claim_counts: passed, time: 41s
claim_null_counts: passed, time: 38s
...
claimCount
claimNullCount
financialsCount
financialsNullCount
summaryCount
summaryNullCountidCount
idNullCount
claimOccurrenceCount
claimOccurrenceNullCount
policyCount
policyNullCount
persVehCount
persVehNullCount
claimsPartyCount
claimsPartyNullCount
locationCount
locationNullCount
Metric
0
500
Counts
Snapshot counts
07/23/18
08/06/18
08/20/18
09/03/18
09/17/18
10/01/18
10/15/18
10/29/18
Date
0
20
40
60
FailureCount
Build failures over time
07/23/18
08/06/18
08/20/18
09/03/18
09/17/18
10/01/18
10/15/18
10/29/18
Date
0
20
40
60
80
100
Count
Claims counts over time
claimNullCount
claimCount

Data Quality Mechanisms: Putting It All Together
Cluster
Daily
Data Pull
Build → Unit tests → Regression Tests
Build → Unit tests → Regression Tests
Abort
Merge Master Branch
Update CAFé
Push Commit
Trigger Jenkins Job
Trigger Jenkins Job

DEMO

Conclusion
Fault-Tolerant ETL
What
An ETL pipeline that can detect
and deal with data quality errors and bugs
Why
Avoid errors in production.
Make full use of data.
More productive analytics.
How
Know your data.
Use tests.
Integrate with CI jobs.
Lessons Learned
Setting verdicts is hard.
Without CI the tests will not be run.
Security must be in the design.
Release management is important.
Trade-oﬀs when choosing architecture.

Thank You!
Questions?

Links
Claims CAFé Schema Repository1
Claims CAFé Schema Confluence Page2
Claims CAFé Data Pipeline and Tests Repository3
Claims CAFé Tests Pipeline Confluence Page4
Claims CAFé Continuous Integration Confluence Page5
Claims CAFé API Documentation Confluence Page6
1
https://github.allstate.com/d3-cafe/ClaimsCAFeSchema
2
http://conflu.allstate.com/display/DOMF2/Schema
3
https://github.allstate.com/d3-cafe/CAFe
4
http://conflu.allstate.com/display/DOMF2/Tests
5
http://conflu.allstate.com/display/DOMF2/Continuous+Integration
6
http://conflu.allstate.com/display/DOMF2/API+Documentation

Kim Hammar - Allstate Internship Presentation Data Engineering Analytics - Claims CAFe

Recommended

Recommended

More Related Content

Similar to Kim Hammar - Allstate Internship Presentation Data Engineering Analytics - Claims CAFe

Similar to Kim Hammar - Allstate Internship Presentation Data Engineering Analytics - Claims CAFe (20)

More from Kim Hammar

More from Kim Hammar (20)

Recently uploaded

Recently uploaded (20)

Kim Hammar - Allstate Internship Presentation Data Engineering Analytics - Claims CAFe