SlideShare a Scribd company logo
1 of 27
Download to read offline
Building a Fault-Tolerant ETL Pipeline for Claims CAFé
Internship Presentation - Summer 2018
Kim Hammar
Data Analytics Engineer
khamq@allstate.com or kimham@kth.se
August 30, 2018
Kim Hammar (DAE) Claims CAFé August 30, 2018 1 / 24
Extract
Transform
Load
Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24
Extract
Transform
Load
Corrupt data that cannot be parsed
Inconsistent Schema
Wrong File paths
Unexpected null values
Duplicates
Code bugs
Missing values
Inconsistent data types
Target database unavailable
Permission error
Scalability problems
Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24
Outline
1 Claims CAFé Background
2 Ensuring Data Quality
3 DEMO
4 Conclusion
5 Questions
Kim Hammar (DAE) Claims CAFé August 30, 2018 3 / 24
What is Claims CAFé?
Claims CAFé: Claims (C)entral (A)nalytical (F)il(e)
claimId policy participant persVeh ...
B005 auto John Doe URK389 Audi ...
B007 home A.Svensson PTO291 Ford ...
B003 life C.Strömbäck RNU999 Volvo ...
B004 property G.Åsbrink WEM650 Benz ...
B002 health L.Löfven KQO209 Tesla ...



Claims
A one stop shop for claims analytics.
Optimized for analytical use-cases by saving raw denormalized data in
hadoop:
Increase scalability
Makes data processing easier: No more slow and complex SQL-Joins
Kim Hammar (DAE) Claims CAFé August 30, 2018 4 / 24
What Actually Goes Into Building a Model
Data Science & Analytics
Data Engineering
Building models,
reports &
Visualizations
ETL processes,
data transformations &
data cleaning
Kim Hammar (DAE) Claims CAFé August 30, 2018 5 / 24
Now:
complex and slow
join to pull data
complex and slow
join to pull data
complex and slow
join to pull data
complex and slow
join to pull data
transform & clean data
create derived fields
transform & clean data
create derived fields
transform & clean data
create derived fields
transform & clean data
create derived fields
Predict claim automation:
Deep FNN model
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Customer report
& dashboards
Link Analysis
Fraud Detection
Customer1
Customer2
Claim1
Claim2
Claimant
Participant
3
10
10
10
4 9 4
5
10
11
1
Classify Image Vehicle Loss
Deep CNN Model






repairable = 0
total loss = 1
...
no loss = 0






Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24
Now:
complex and slow
join to pull data
complex and slow
join to pull data
complex and slow
join to pull data
complex and slow
join to pull data
transform & clean data
create derived fields
transform & clean data
create derived fields
transform & clean data
create derived fields
transform & clean data
create derived fields
Predict claim automation:
Deep FNN model
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Customer report
& dashboards
Link Analysis
Fraud Detection
Customer1
Customer2
Claim1
Claim2
Claimant
Participant
3
10
10
10
4 9 4
5
10
11
1
Classify Image Vehicle Loss
Deep CNN Model






repairable = 0
total loss = 1
...
no loss = 0






Future:
complex and slow
join to pull data
transform & clean data
create derived fields
Claims
CAFé
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Customer1
Customer2
Claim1
Claim2
Claimant
Participant
3
10
10
10
4 9 4
5
10
11






repairable = 0
total loss = 1
...
no loss = 0






Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24
What Goes Into Making CAFé
Coffee Plants Harvest
Processing:
pulping the cherries,
filtering out bad ones,
sorting&drying
Prepare for consumption:
roasting,
grinding,
brewing
Consume:
drink,
sell
Making Coffee
Raw Data Sources ETL
X f () Y
Processing:
fill in null values,
create derived fields,
data cleaning
Prepare for consumption:
flatten,
create different views,
save to Hive
x
f(x)
Sekante
P
f(x0)
x0
Q
x0 + ε
f(x0 + ε) − f(x0)
ε
ε
f(x0 + ε)
Consume:
build models,
generate business reports
Making Claims CAFé
Kim Hammar (DAE) Claims CAFé August 30, 2018 7 / 24
Claims CAFé Data Architecture
Data Sources ADW NextGen . . . Allstate
Canada
Claims
MDM
Claims MDM
Synchronizes and reconciles data
from various sources into a single place
Claims
CAFé
Claims CAFé
Serves the need of data scientists.
Contains data optimized for analytics
Derived fields
Data cleaning
Data restructuring
Machine Learning
x
f(x)
Sekante
P
f(x0)
x0
Q
x0 + ε
f(x0 + ε) − f(x0)
ε
ε
f(x0 + ε)
Analytics
[−4, −2) [−2, 0) [0, 2) [2, 4)
0
500
1,000
1,500
2,000
2,500
Reports
Number of
fraudulent claims
during the
third quarter: X
Kim Hammar (DAE) Claims CAFé August 30, 2018 8 / 24
What Is a Claim From A Data Pespective?
Claim
Claim Details
...
Policy Details
...
Participants Details
...
Vehicle Details
...
...
...
Figure: Some of the data that a single Claim in Claims CAFé contains.
Kim Hammar (DAE) Claims CAFé August 30, 2018 9 / 24
Data Quality Assurance
Data Quality is key to the success of Claims CAFé
Examples of data quality issues:
Null values in the wrong place
Duplicates
Missing values
Inconsistent data types
...
Data Quality Issues
Data Size
Why does my Model not work?
John Doe, Data Scientist
Kim Hammar (DAE) Claims CAFé August 30, 2018 10 / 24
Data Quality Mechanisms Motivation
Failures are the norm, they are not an exception
Optimistic error estimate: P("CAFé error") =
P("Data quality issue")∪P("Hadoop failure")∪P("Network failure")∪
P("ClaimsMDM failure") ∪ P("CAFé code bug") ≈ 1/1000 = 0.001
CAFé stretch goal: near-real-time updates, say we pull data every 15
minutes =⇒ 96 pulls per day =⇒ 672 pulls per week
failure probability per week: 0.001 × 672 = 0.672 =⇒
a failure will happen on average every other week
We want built-in mechanisms in the CAFé pipeline to detect and
deal with errors before they affect end-users: Tests!
Kim Hammar (DAE) Claims CAFé August 30, 2018 11 / 24
How to detect data quality issues? Know your data
. . . Claim 1
...
Claim 2
...
Claim 3
...
Claim 4
...
Claim 5
...
How many claims
were recorded
in Rhode Island
2016-2017?
How many
null claims
exists in
Idaho?
What is the average
number of claim par-
ticipants?
What is the average
number of covered
claimants in a claim?
What is the maximum
number of vehicles re-
lated to a claim?
Kim Hammar (DAE) Claims CAFé August 30, 2018 12 / 24
Anomaly Detection for Detecting Potential Data Issues
(1/3)
Spikes in the number of null values indicate a data issue.
06/18/18 06/18/18 06/19/18 06/19/18 06/20/18 06/20/18 06/21/18 06/21/18
Date
0
2000
4000
6000
8000
10000
12000
14000
16000
NullCounts
Indication of a data pull issue
Kim Hammar (DAE) Claims CAFé August 30, 2018 13 / 24
Anomaly Detection for Detecting Potential Data Issues
(2/3)
04.2017
05.2017
06.2017
07.2017
08.2017
09.2017
10.2017
11.2017
12.2017
01.2018
02.2018
03.2018
6,000
8,000
10,000
12,000
14,000
Date
Count
Number of claims in Rhode Island over time
Why did this drop happen?
Kim Hammar (DAE) Claims CAFé August 30, 2018 14 / 24
Anomaly Detection for Detecting Potential Data Issues
(3/3)
CoverageClaimants PersVeh ClaimsParty
0
20
40
2
1
3
1
4
39
Count
Comparing a single claim statistics with the average
Average cross entire dataset Actual value for this claim
39 Parties in this claim vs average on 3
is this valid data or a bug?
Kim Hammar (DAE) Claims CAFé August 30, 2018 15 / 24
Regression Tests Background (1/2)
What is regression testing?
Regression tests verify modifications of a program or data.
If the modification fail the tests, the program can regress back.
I just refactored the CAFé pipeline, how do I know that I didn’t
break anything?
I just pulled new data into CAFé, how do I know I did not
introduce data quality issues?
Why do we want regression tests?
It increases the confidence in making code and data changes
We can detect bugs before they bother end-users
We can avoid unnecessary work: If the tests fail we can abort early
and save time.
Kim Hammar (DAE) Claims CAFé August 30, 2018 16 / 24
Regression Tests Background (2/2)
Naive ETL Pipeline For Updating Claims CAFé:
Load & Transform
Claims
MDM
Claims
CAFé
Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24
Regression Tests Background (2/2)
Naive ETL Pipeline For Updating Claims CAFé:
Load & Transform
Claims
MDM
Claims
CAFé
A More Robust ETL Pipeline:
Claims
MDM
Regression
Tests
Claims
CAFé
Tests failed
Email
maintainers
Missing columns:...
Null values:...
Anomalies:...
...
Generate
report
Abort ETL
Load &
Transform
Tests passed
Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24
Regression Tests Pipeline for CAFé
Data Sources
Abort
Abort
Abort
Data paths does not exist
Invalid schema
Data paths exist
CAFé Regression Tests
Verify data paths of sources
Verify schema of sources
Verify data quality
Count null values and
compare with snapshot
Count columns and
compare with snapshot
Verdict
¬∃ duplicates ∧ null increase < x ∧ counts increase < y =⇒ build
∃ duplicates ∧ null increase < x ∧ counts increase < y =⇒ build & email
∃ duplicates ∧ null increase > x ∧ counts increase > y =⇒ abort & email
Valid schema
Email
maintainersSave test reportSave snapshot
Kim Hammar (DAE) Claims CAFé August 30, 2018 18 / 24
Regression Tests Reporting
Test Report: 26/7-2018
Passed
79.0%
Failed
21.0%
Test Results
valid_data_paths: passed, time: 7s
valid_source_schema: passed, time: 19s
no_duplicates: failed, time: 26s
claim_counts: passed, time: 41s
claim_null_counts: passed, time: 38s
...
claimCount
claimNullCount
financialsCount
financialsNullCount
summaryCount
summaryNullCountidCount
idNullCount
claimOccurrenceCount
claimOccurrenceNullCount
policyCount
policyNullCount
persVehCount
persVehNullCount
claimsPartyCount
claimsPartyNullCount
locationCount
locationNullCount
Metric
0
500
Counts
Snapshot counts
07/23/18
08/06/18
08/20/18
09/03/18
09/17/18
10/01/18
10/15/18
10/29/18
Date
0
20
40
60
FailureCount
Build failures over time
07/23/18
08/06/18
08/20/18
09/03/18
09/17/18
10/01/18
10/15/18
10/29/18
Date
0
20
40
60
80
100
Count
Claims counts over time
claimNullCount
claimCount
Kim Hammar (DAE) Claims CAFé August 30, 2018 19 / 24
Data Quality Mechanisms: Putting It All Together
Cluster
Daily
Data Pull
Build → Unit tests → Regression Tests
Build → Unit tests → Regression Tests
Abort
Merge Master Branch
Update CAFé
Push Commit
Trigger Jenkins Job
Trigger Jenkins Job
Kim Hammar (DAE) Claims CAFé August 30, 2018 20 / 24
DEMO
Kim Hammar (DAE) Claims CAFé August 30, 2018 21 / 24
Conclusion
Fault-Tolerant ETL
What
An ETL pipeline that can detect
and deal with data quality errors and bugs
Why
Avoid errors in production.
Make full use of data.
More productive analytics.
How
Know your data.
Use tests.
Integrate with CI jobs.
Lessons Learned
Setting verdicts is hard.
Without CI the tests will not be run.
Security must be in the design.
Release management is important.
Trade-offs when choosing architecture.
Kim Hammar (DAE) Claims CAFé August 30, 2018 22 / 24
Thank You!
Questions?
Kim Hammar (DAE) Claims CAFé August 30, 2018 23 / 24
Links
Claims CAFé Schema Repository1
Claims CAFé Schema Confluence Page2
Claims CAFé Data Pipeline and Tests Repository3
Claims CAFé Tests Pipeline Confluence Page4
Claims CAFé Continuous Integration Confluence Page5
Claims CAFé API Documentation Confluence Page6
1
https://github.allstate.com/d3-cafe/ClaimsCAFeSchema
2
http://conflu.allstate.com/display/DOMF2/Schema
3
https://github.allstate.com/d3-cafe/CAFe
4
http://conflu.allstate.com/display/DOMF2/Tests
5
http://conflu.allstate.com/display/DOMF2/Continuous+Integration
6
http://conflu.allstate.com/display/DOMF2/API+Documentation
Kim Hammar (DAE) Claims CAFé August 30, 2018 24 / 24

More Related Content

Similar to Kim Hammar - Allstate Internship Presentation Data Engineering Analytics - Claims CAFe

MATH 533 RANK Achievement Education--math533rank.com
MATH 533 RANK Achievement Education--math533rank.comMATH 533 RANK Achievement Education--math533rank.com
MATH 533 RANK Achievement Education--math533rank.comkopiko162
 
MATH 533 RANK Redefined Education--math533rank.com
MATH 533 RANK Redefined Education--math533rank.comMATH 533 RANK Redefined Education--math533rank.com
MATH 533 RANK Redefined Education--math533rank.comkopiko180
 
MATH 533 RANK Lessons in Excellence-- math533rank.com
MATH 533 RANK Lessons in Excellence-- math533rank.comMATH 533 RANK Lessons in Excellence-- math533rank.com
MATH 533 RANK Lessons in Excellence-- math533rank.comRoelofMerwe118
 
MATH 533 Exceptional Education - snaptutorial.com
MATH 533   Exceptional Education - snaptutorial.comMATH 533   Exceptional Education - snaptutorial.com
MATH 533 Exceptional Education - snaptutorial.comDavisMurphyB11
 
Digital Mines ver 2.0 | 7 lessons on automation i learnt leading digital in...
Digital Mines ver 2.0  |  7 lessons on automation i learnt leading digital in...Digital Mines ver 2.0  |  7 lessons on automation i learnt leading digital in...
Digital Mines ver 2.0 | 7 lessons on automation i learnt leading digital in...Coert Du Plessis (杜康)
 
Using Grid data analytics to protect revenue, reduce network losses and impro...
Using Grid data analytics to protect revenue, reduce network losses and impro...Using Grid data analytics to protect revenue, reduce network losses and impro...
Using Grid data analytics to protect revenue, reduce network losses and impro...Schneider Electric
 
MATH 533 Education Specialist / snaptutorial.com
MATH 533 Education Specialist / snaptutorial.comMATH 533 Education Specialist / snaptutorial.com
MATH 533 Education Specialist / snaptutorial.comMcdonaldRyan97
 
Microsoft MCSE 70-469 it braindumps
Microsoft MCSE 70-469 it braindumpsMicrosoft MCSE 70-469 it braindumps
Microsoft MCSE 70-469 it braindumpslilylucy
 
Math 533 Believe Possibilities / snaptutorial.com
Math 533  Believe Possibilities / snaptutorial.comMath 533  Believe Possibilities / snaptutorial.com
Math 533 Believe Possibilities / snaptutorial.comDavis29a
 
564 complete class
 564 complete class 564 complete class
564 complete classhwcampusinfo
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsNeo4j
 
Decision making tool (ahp)
Decision making tool (ahp)Decision making tool (ahp)
Decision making tool (ahp)Amit Jain
 
Credit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceCredit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceIRJET Journal
 
Credit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceCredit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceIRJET Journal
 
Data Con LA 2022 - AutoDC + AutoML = your AI development superpower
Data Con LA 2022 - AutoDC + AutoML = your AI development superpowerData Con LA 2022 - AutoDC + AutoML = your AI development superpower
Data Con LA 2022 - AutoDC + AutoML = your AI development superpowerData Con LA
 
The Analytics Data Store: Information Supply Framework
The Analytics Data Store: Information Supply FrameworkThe Analytics Data Store: Information Supply Framework
The Analytics Data Store: Information Supply FrameworkMartyn Richard Jones
 
Importing Data - The Complete Course in All File Types and Data Tricks to Get...
Importing Data - The Complete Course in All File Types and Data Tricks to Get...Importing Data - The Complete Course in All File Types and Data Tricks to Get...
Importing Data - The Complete Course in All File Types and Data Tricks to Get...Jim Kaplan CIA CFE
 

Similar to Kim Hammar - Allstate Internship Presentation Data Engineering Analytics - Claims CAFe (20)

MATH 533 RANK Achievement Education--math533rank.com
MATH 533 RANK Achievement Education--math533rank.comMATH 533 RANK Achievement Education--math533rank.com
MATH 533 RANK Achievement Education--math533rank.com
 
MATH 533 RANK Redefined Education--math533rank.com
MATH 533 RANK Redefined Education--math533rank.comMATH 533 RANK Redefined Education--math533rank.com
MATH 533 RANK Redefined Education--math533rank.com
 
MATH 533 RANK Lessons in Excellence-- math533rank.com
MATH 533 RANK Lessons in Excellence-- math533rank.comMATH 533 RANK Lessons in Excellence-- math533rank.com
MATH 533 RANK Lessons in Excellence-- math533rank.com
 
Aif360 2018-21 dec-eno_kyam
Aif360 2018-21 dec-eno_kyamAif360 2018-21 dec-eno_kyam
Aif360 2018-21 dec-eno_kyam
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
MATH 533 Exceptional Education - snaptutorial.com
MATH 533   Exceptional Education - snaptutorial.comMATH 533   Exceptional Education - snaptutorial.com
MATH 533 Exceptional Education - snaptutorial.com
 
Digital Mines ver 2.0 | 7 lessons on automation i learnt leading digital in...
Digital Mines ver 2.0  |  7 lessons on automation i learnt leading digital in...Digital Mines ver 2.0  |  7 lessons on automation i learnt leading digital in...
Digital Mines ver 2.0 | 7 lessons on automation i learnt leading digital in...
 
Using Grid data analytics to protect revenue, reduce network losses and impro...
Using Grid data analytics to protect revenue, reduce network losses and impro...Using Grid data analytics to protect revenue, reduce network losses and impro...
Using Grid data analytics to protect revenue, reduce network losses and impro...
 
MATH 533 Education Specialist / snaptutorial.com
MATH 533 Education Specialist / snaptutorial.comMATH 533 Education Specialist / snaptutorial.com
MATH 533 Education Specialist / snaptutorial.com
 
Microsoft MCSE 70-469 it braindumps
Microsoft MCSE 70-469 it braindumpsMicrosoft MCSE 70-469 it braindumps
Microsoft MCSE 70-469 it braindumps
 
Math 533 Believe Possibilities / snaptutorial.com
Math 533  Believe Possibilities / snaptutorial.comMath 533  Believe Possibilities / snaptutorial.com
Math 533 Believe Possibilities / snaptutorial.com
 
564 complete class
 564 complete class 564 complete class
564 complete class
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
 
Decision making tool (ahp)
Decision making tool (ahp)Decision making tool (ahp)
Decision making tool (ahp)
 
Data Mining.ppt
Data Mining.pptData Mining.ppt
Data Mining.ppt
 
Credit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceCredit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data Science
 
Credit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceCredit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data Science
 
Data Con LA 2022 - AutoDC + AutoML = your AI development superpower
Data Con LA 2022 - AutoDC + AutoML = your AI development superpowerData Con LA 2022 - AutoDC + AutoML = your AI development superpower
Data Con LA 2022 - AutoDC + AutoML = your AI development superpower
 
The Analytics Data Store: Information Supply Framework
The Analytics Data Store: Information Supply FrameworkThe Analytics Data Store: Information Supply Framework
The Analytics Data Store: Information Supply Framework
 
Importing Data - The Complete Course in All File Types and Data Tricks to Get...
Importing Data - The Complete Course in All File Types and Data Tricks to Get...Importing Data - The Complete Course in All File Types and Data Tricks to Get...
Importing Data - The Complete Course in All File Types and Data Tricks to Get...
 

More from Kim Hammar

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesKim Hammar
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHKim Hammar
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion ResponseKim Hammar
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlKim Hammar
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionKim Hammar
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security AutomationKim Hammar
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Kim Hammar
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Kim Hammar
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingKim Hammar
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...Kim Hammar
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseKim Hammar
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Kim Hammar
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayKim Hammar
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Kim Hammar
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Kim Hammar
 

More from Kim Hammar (20)

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal Stopping
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber Defense
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal Stopping
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-Play
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
 

Recently uploaded

Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Recently uploaded (20)

Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

Kim Hammar - Allstate Internship Presentation Data Engineering Analytics - Claims CAFe

  • 1. Building a Fault-Tolerant ETL Pipeline for Claims CAFé Internship Presentation - Summer 2018 Kim Hammar Data Analytics Engineer khamq@allstate.com or kimham@kth.se August 30, 2018 Kim Hammar (DAE) Claims CAFé August 30, 2018 1 / 24
  • 2. Extract Transform Load Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24
  • 3. Extract Transform Load Corrupt data that cannot be parsed Inconsistent Schema Wrong File paths Unexpected null values Duplicates Code bugs Missing values Inconsistent data types Target database unavailable Permission error Scalability problems Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24
  • 4. Outline 1 Claims CAFé Background 2 Ensuring Data Quality 3 DEMO 4 Conclusion 5 Questions Kim Hammar (DAE) Claims CAFé August 30, 2018 3 / 24
  • 5. What is Claims CAFé? Claims CAFé: Claims (C)entral (A)nalytical (F)il(e) claimId policy participant persVeh ... B005 auto John Doe URK389 Audi ... B007 home A.Svensson PTO291 Ford ... B003 life C.Strömbäck RNU999 Volvo ... B004 property G.Åsbrink WEM650 Benz ... B002 health L.Löfven KQO209 Tesla ...    Claims A one stop shop for claims analytics. Optimized for analytical use-cases by saving raw denormalized data in hadoop: Increase scalability Makes data processing easier: No more slow and complex SQL-Joins Kim Hammar (DAE) Claims CAFé August 30, 2018 4 / 24
  • 6. What Actually Goes Into Building a Model Data Science & Analytics Data Engineering Building models, reports & Visualizations ETL processes, data transformations & data cleaning Kim Hammar (DAE) Claims CAFé August 30, 2018 5 / 24
  • 7. Now: complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields Predict claim automation: Deep FNN model b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Customer report & dashboards Link Analysis Fraud Detection Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11 1 Classify Image Vehicle Loss Deep CNN Model       repairable = 0 total loss = 1 ... no loss = 0       Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24
  • 8. Now: complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields Predict claim automation: Deep FNN model b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Customer report & dashboards Link Analysis Fraud Detection Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11 1 Classify Image Vehicle Loss Deep CNN Model       repairable = 0 total loss = 1 ... no loss = 0       Future: complex and slow join to pull data transform & clean data create derived fields Claims CAFé b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11       repairable = 0 total loss = 1 ... no loss = 0       Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24
  • 9. What Goes Into Making CAFé Coffee Plants Harvest Processing: pulping the cherries, filtering out bad ones, sorting&drying Prepare for consumption: roasting, grinding, brewing Consume: drink, sell Making Coffee Raw Data Sources ETL X f () Y Processing: fill in null values, create derived fields, data cleaning Prepare for consumption: flatten, create different views, save to Hive x f(x) Sekante P f(x0) x0 Q x0 + ε f(x0 + ε) − f(x0) ε ε f(x0 + ε) Consume: build models, generate business reports Making Claims CAFé Kim Hammar (DAE) Claims CAFé August 30, 2018 7 / 24
  • 10. Claims CAFé Data Architecture Data Sources ADW NextGen . . . Allstate Canada Claims MDM Claims MDM Synchronizes and reconciles data from various sources into a single place Claims CAFé Claims CAFé Serves the need of data scientists. Contains data optimized for analytics Derived fields Data cleaning Data restructuring Machine Learning x f(x) Sekante P f(x0) x0 Q x0 + ε f(x0 + ε) − f(x0) ε ε f(x0 + ε) Analytics [−4, −2) [−2, 0) [0, 2) [2, 4) 0 500 1,000 1,500 2,000 2,500 Reports Number of fraudulent claims during the third quarter: X Kim Hammar (DAE) Claims CAFé August 30, 2018 8 / 24
  • 11. What Is a Claim From A Data Pespective? Claim Claim Details ... Policy Details ... Participants Details ... Vehicle Details ... ... ... Figure: Some of the data that a single Claim in Claims CAFé contains. Kim Hammar (DAE) Claims CAFé August 30, 2018 9 / 24
  • 12. Data Quality Assurance Data Quality is key to the success of Claims CAFé Examples of data quality issues: Null values in the wrong place Duplicates Missing values Inconsistent data types ... Data Quality Issues Data Size Why does my Model not work? John Doe, Data Scientist Kim Hammar (DAE) Claims CAFé August 30, 2018 10 / 24
  • 13. Data Quality Mechanisms Motivation Failures are the norm, they are not an exception Optimistic error estimate: P("CAFé error") = P("Data quality issue")∪P("Hadoop failure")∪P("Network failure")∪ P("ClaimsMDM failure") ∪ P("CAFé code bug") ≈ 1/1000 = 0.001 CAFé stretch goal: near-real-time updates, say we pull data every 15 minutes =⇒ 96 pulls per day =⇒ 672 pulls per week failure probability per week: 0.001 × 672 = 0.672 =⇒ a failure will happen on average every other week We want built-in mechanisms in the CAFé pipeline to detect and deal with errors before they affect end-users: Tests! Kim Hammar (DAE) Claims CAFé August 30, 2018 11 / 24
  • 14. How to detect data quality issues? Know your data . . . Claim 1 ... Claim 2 ... Claim 3 ... Claim 4 ... Claim 5 ... How many claims were recorded in Rhode Island 2016-2017? How many null claims exists in Idaho? What is the average number of claim par- ticipants? What is the average number of covered claimants in a claim? What is the maximum number of vehicles re- lated to a claim? Kim Hammar (DAE) Claims CAFé August 30, 2018 12 / 24
  • 15. Anomaly Detection for Detecting Potential Data Issues (1/3) Spikes in the number of null values indicate a data issue. 06/18/18 06/18/18 06/19/18 06/19/18 06/20/18 06/20/18 06/21/18 06/21/18 Date 0 2000 4000 6000 8000 10000 12000 14000 16000 NullCounts Indication of a data pull issue Kim Hammar (DAE) Claims CAFé August 30, 2018 13 / 24
  • 16. Anomaly Detection for Detecting Potential Data Issues (2/3) 04.2017 05.2017 06.2017 07.2017 08.2017 09.2017 10.2017 11.2017 12.2017 01.2018 02.2018 03.2018 6,000 8,000 10,000 12,000 14,000 Date Count Number of claims in Rhode Island over time Why did this drop happen? Kim Hammar (DAE) Claims CAFé August 30, 2018 14 / 24
  • 17. Anomaly Detection for Detecting Potential Data Issues (3/3) CoverageClaimants PersVeh ClaimsParty 0 20 40 2 1 3 1 4 39 Count Comparing a single claim statistics with the average Average cross entire dataset Actual value for this claim 39 Parties in this claim vs average on 3 is this valid data or a bug? Kim Hammar (DAE) Claims CAFé August 30, 2018 15 / 24
  • 18. Regression Tests Background (1/2) What is regression testing? Regression tests verify modifications of a program or data. If the modification fail the tests, the program can regress back. I just refactored the CAFé pipeline, how do I know that I didn’t break anything? I just pulled new data into CAFé, how do I know I did not introduce data quality issues? Why do we want regression tests? It increases the confidence in making code and data changes We can detect bugs before they bother end-users We can avoid unnecessary work: If the tests fail we can abort early and save time. Kim Hammar (DAE) Claims CAFé August 30, 2018 16 / 24
  • 19. Regression Tests Background (2/2) Naive ETL Pipeline For Updating Claims CAFé: Load & Transform Claims MDM Claims CAFé Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24
  • 20. Regression Tests Background (2/2) Naive ETL Pipeline For Updating Claims CAFé: Load & Transform Claims MDM Claims CAFé A More Robust ETL Pipeline: Claims MDM Regression Tests Claims CAFé Tests failed Email maintainers Missing columns:... Null values:... Anomalies:... ... Generate report Abort ETL Load & Transform Tests passed Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24
  • 21. Regression Tests Pipeline for CAFé Data Sources Abort Abort Abort Data paths does not exist Invalid schema Data paths exist CAFé Regression Tests Verify data paths of sources Verify schema of sources Verify data quality Count null values and compare with snapshot Count columns and compare with snapshot Verdict ¬∃ duplicates ∧ null increase < x ∧ counts increase < y =⇒ build ∃ duplicates ∧ null increase < x ∧ counts increase < y =⇒ build & email ∃ duplicates ∧ null increase > x ∧ counts increase > y =⇒ abort & email Valid schema Email maintainersSave test reportSave snapshot Kim Hammar (DAE) Claims CAFé August 30, 2018 18 / 24
  • 22. Regression Tests Reporting Test Report: 26/7-2018 Passed 79.0% Failed 21.0% Test Results valid_data_paths: passed, time: 7s valid_source_schema: passed, time: 19s no_duplicates: failed, time: 26s claim_counts: passed, time: 41s claim_null_counts: passed, time: 38s ... claimCount claimNullCount financialsCount financialsNullCount summaryCount summaryNullCountidCount idNullCount claimOccurrenceCount claimOccurrenceNullCount policyCount policyNullCount persVehCount persVehNullCount claimsPartyCount claimsPartyNullCount locationCount locationNullCount Metric 0 500 Counts Snapshot counts 07/23/18 08/06/18 08/20/18 09/03/18 09/17/18 10/01/18 10/15/18 10/29/18 Date 0 20 40 60 FailureCount Build failures over time 07/23/18 08/06/18 08/20/18 09/03/18 09/17/18 10/01/18 10/15/18 10/29/18 Date 0 20 40 60 80 100 Count Claims counts over time claimNullCount claimCount Kim Hammar (DAE) Claims CAFé August 30, 2018 19 / 24
  • 23. Data Quality Mechanisms: Putting It All Together Cluster Daily Data Pull Build → Unit tests → Regression Tests Build → Unit tests → Regression Tests Abort Merge Master Branch Update CAFé Push Commit Trigger Jenkins Job Trigger Jenkins Job Kim Hammar (DAE) Claims CAFé August 30, 2018 20 / 24
  • 24. DEMO Kim Hammar (DAE) Claims CAFé August 30, 2018 21 / 24
  • 25. Conclusion Fault-Tolerant ETL What An ETL pipeline that can detect and deal with data quality errors and bugs Why Avoid errors in production. Make full use of data. More productive analytics. How Know your data. Use tests. Integrate with CI jobs. Lessons Learned Setting verdicts is hard. Without CI the tests will not be run. Security must be in the design. Release management is important. Trade-offs when choosing architecture. Kim Hammar (DAE) Claims CAFé August 30, 2018 22 / 24
  • 26. Thank You! Questions? Kim Hammar (DAE) Claims CAFé August 30, 2018 23 / 24
  • 27. Links Claims CAFé Schema Repository1 Claims CAFé Schema Confluence Page2 Claims CAFé Data Pipeline and Tests Repository3 Claims CAFé Tests Pipeline Confluence Page4 Claims CAFé Continuous Integration Confluence Page5 Claims CAFé API Documentation Confluence Page6 1 https://github.allstate.com/d3-cafe/ClaimsCAFeSchema 2 http://conflu.allstate.com/display/DOMF2/Schema 3 https://github.allstate.com/d3-cafe/CAFe 4 http://conflu.allstate.com/display/DOMF2/Tests 5 http://conflu.allstate.com/display/DOMF2/Continuous+Integration 6 http://conflu.allstate.com/display/DOMF2/API+Documentation Kim Hammar (DAE) Claims CAFé August 30, 2018 24 / 24