SlideShare a Scribd company logo
1 of 36
>Eastern Bank Data Engineering
How Eastern Bank Uses Big Data to
Better Serve & Protect its Customers
Brian Griffith
Principal Data Engineer
>Eastern Bank Data Engineering
Agenda
• Introduction
• Eastern Bank & the banking industry
– Data architecture and our big data journey
– Challenges
– Use Case:
• Debit card anomaly detection
2
>Eastern Bank Data Engineering
@bwgriffith
• Database developer and engineer for 15 years
• Working in the “big data” space for about 5
years
– Blizzard Entertainment – Irvine, CA
– Localytics – Boston, MA
• Now @ Eastern Bank, helping engineer their next
generation data platform
b.griffith@easternbank.com 3
>Eastern Bank Data Engineering
Eastern Bank
• 197 year old mutual bank (largest of its kind in the
country)
– Leader in corporate social responsibility
– 8th most charitable business in Massachusetts
• ~1 Million customers
• 4 Organizations:
– Banking: Eastern Bank
– Insurance: Eastern Insurance Group
– Wealth: Eastern Wealth Management
– R & D and Product Dev: Eastern Labs
4
>Eastern Bank Data Engineering
Banking is Evolving
• Customer activity moving more into the mobile
space
• Diverse services continuously emerging
• Customers value personalized service
– Relevant value added services
– Personal relationships
5
>Eastern Bank Data Engineering
Positioned for the Best of Both Worlds
• Like larger banks, leverage data in a manner
that allows us to offer improved features and
convenience
• Like smaller banks, leverage data in a manner
that allows us to offer more customized
services and relationships
6
>Eastern Bank Data Engineering 7
>Eastern Bank Data Engineering
Past Data Architecture Issues
• Customer data lives in transaction “silos”
– 3 Major data entities: Insurance, wealth, and
banking
– Data access via in-house or out-sourced solution
– Impedes analysis
• Regulatory compliance
– Technical Debt
– Auditing
– 3rd party dependencies
8
>Eastern Bank Data Engineering
Data Architecture Goals
• Abstraction from source systems
• Scale horizontally, not vertically
• Complete ownership of depth and breadth of
our data
• Improve data quality and stewardship
• Drive iterative analytics throughout the
enterprise
• “Make the bank smarter”
9
>Eastern Bank Data Engineering
Data Architecture
10
Tx
Data
Warehouse
Customer Master
Big Data Store
• Eastern endeavors to be relationship-driven, not
transaction driven. In a digital economy, face to
face interactions continue to decline. We need
to rely on data integration and analytics to know
our customers to best meet their evolving needs
• Our Data Architecture is built on four
interdependent “tiers” each with its own
capabilities and contributions to the overall
enterprise platform
>Eastern Bank Data Engineering 11
Hadoop
Tx
Data
Warehouse
Customer Master
Big Data Store
• Can be a significant driver of customer
intimacy in an increasingly digital world
• Allows us to leverage data we’ve never
thought of as “Customer Data” before
• Goes beyond what a customer has with us –
gives visibility into what a customer does with
us through behavioral analytics
• Scales ability to store with ability to process
• Platform natively supports data analytics
languages and machine learning tools
• Fast processing enables iterative exploration
>Eastern Bank Data Engineering
Architecture Diagram
12
>Eastern Bank Data Engineering
Big Data Challenges
13
>Eastern Bank Data Engineering
Challenges
• Governance!
– Ingestion
– Data Lineage
– Data Quality
– Managing growth
• Balancing what data we “can” keep vs data we “should” keep
• Security
– Personal Identifiable Information (PII)
– Mask and limit view of data
• Driving Consumption
– “If you build it, they will come”  Does not work by itself
– Constant evangelism
– Need to demonstrate value!
14
>Eastern Bank Data Engineering
Data Science
15
>Eastern Bank Data Engineering
Hadoop Data Science
Fraud Detection Proof of Concept
>Eastern Bank Data Engineering
Fraud in the Financial Industry
An Introduction
• In 2012, there was 31.1 million fraudulent
transactions, with a value of $6.1 billion1
1 The 2013 Federal Reserve Payments Study
17
>Eastern Bank Data Engineering
Debit Card Fraud
• Industry wide debit card fraud has been rising
at an significant rate
• > 400% in the last 3 years!
• Mostly due to breaches at large, national
retailers
18
>Eastern Bank Data Engineering
Use Case Generation
• Develop process to work in conjunction with
existing fraud detection tools
– Existing tools mostly rules based
• Leverage Hadoop to traverse broad customer
history for anomalous patterns
– Behavioral analysis
19
>Eastern Bank Data Engineering
Fraud Use Case Workflow
20
DATA
FEATURES
TRAINING
TESTING
sample trans &
claims to build
training data
identify account
behavior patterns
indicative of fraud
scoring model will
identify suspicious
accounts the day after
fraud happens
testing and
validating features
iteratively
>Eastern Bank Data Engineering
Data
• Claims – Customer
reported
• Only use customer’s
first claim
• Model trained on all
available transaction
data
21
>Eastern Bank Data Engineering
Features
• Variables indicative of fraud, formatted for
machine learning
• Example: dollarRatio = Ratio of dollar spend today vs hx
• Values calculated by comparing variables
today vs history
– Ratios, log(n), binary, etc…
• Higher value = more suspicious
• Hadoop performance
22
>Eastern Bank Data Engineering 23
Building and Evaluating the Model
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
FraudDetectionRate
Total Accounts
ROC for TestModel
training
testing
reference
Receiver operating characteristic shows model tuning.
Reviewing 20% of accounts finds ~80% of anomalies.
Reference line shows predicted result of random sample.
Feature Weight Std Error Z p(>|Z|)
(Intercept) -3.44 0.051 -66.93 < 2e-16
dollarRatio 0.09 0.007 11.75 < 2e-16
0
20
40
60
80
100
120
140
0% 20% 40% 60% 80% 100%
FalsePositiveRatio
Fraud Detection Rate
False Positive Rate for TestModel
testing
>Eastern Bank Data Engineering
Scoring
• How anomalous were a day’s transactions
– Value range: 0.00 – 1.00
– Comparing a day to customer’s history
• Assigned to each unique account
• Function of weights & feature values
24
>Eastern Bank Data Engineering 25
>Eastern Bank Data Engineering
Results & Testing
ACCOUNT Score Feature 1
Feature
2
Feature
3
Feature
4
Feature
5
Feature
6
xxxxxxxx 1 0.693 0.105 0.105 0.105 0.105 237.747
xxxxxxxx 0.9997 0.693 0.713 0.316 1.379 0.036 129.467
xxxxxxxx 0.9994 0.693 0.486 4.847 169.688 35.87 0.29
xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1
xxxxxxxx 0.9803 0.693 0.356 0.421 0.224 0.817 86.446
26
>Eastern Bank Data Engineering
Results & Testing
ACCOUNT Score Feature 1
Feature
2
Feature
3
Feature
4
Feature
5
Feature
6
xxxxxxxx 1 0.693 0.105 0.105 0.105 0.105 237.747
xxxxxxxx 0.9997 0.693 0.713 0.316 1.379 0.036 129.467
xxxxxxxx 0.9994 0.693 0.486 4.847 169.688 35.87 0.29
xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1
xxxxxxxx 0.9803 0.693 0.356 0.421 0.224 0.817 86.446
dollarRatio = Feature 6
27
>Eastern Bank Data Engineering
Results & Testing
ACCOUNT Score Feature 1
Feature
2
Feature
3
Feature
4
Feature
5
Feature
6
xxxxxxxx 1 0.693 0.105 0.105 0.105 0.105 237.747
Merchant Amount Timestamp
JETBLUE AIRW $2,142.00 4/30/15 9:35 AM
28
>Eastern Bank Data Engineering
Results & Testing
ACCOUNT Score Feature 1
Feature
2
Feature
3
Feature
4
Feature
5
Feature
6
xxxxxxxx 1 0.693 0.105 0.105 0.105 0.105 237.747
xxxxxxxx 0.9997 0.693 0.713 0.316 1.379 0.036 129.467
xxxxxxxx 0.9994 0.693 0.486 4.847 169.688 35.87 0.29
xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1
xxxxxxxx 0.9803 0.693 0.356 0.421 0.224 0.817 86.446
29
>Eastern Bank Data Engineering
Results & Testing
ACCOUNT Score Feature 1
Feature
2
Feature
3
Feature
4
Feature
5
Feature
6
xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1
Merchant Amount Timestamp
Internet Vendor $12.25 4/30/15 3:42 AM
Internet Vendor $3.01 4/30/15 3:42 AM
Internet Vendor $2.46 4/30/15 3:42 AM
Internet Vendor $1.49 4/30/15 3:42 AM
Internet Vendor $18.95 4/30/15 3:42 AM
30
>Eastern Bank Data Engineering
Iterating
31
.
• Build new features
• Remove ineffective features
• Address feature interaction
• Minimize False Positives
• Try Different Algorithms
>Eastern Bank Data Engineering
Next Steps
• Real time w/ Spark & MLLib
– Get closer to when fraud actually occurs
• Expanded customer reach via notifications
– Improved customer service
• More agile feedback loop based on customer
assessment
32
>Eastern Bank Data Engineering
Other Uses
• Comparing customer behaviors day over day
has carry over to many uses cases:
– Predicting churn
– Customer segmentation & personas
– Predicting Customer Lifetime Value (CLV)
33
>Eastern Bank Data Engineering
Wrap up
• Banking is evolving
• Hadoop addresses a very large gap in our
architecture
• Empowers us to know more about our customers
through all of their interactions with us
• Needs to be governed
• Customer fraud detection only the tip of the
iceberg
34
>Eastern Bank Data Engineering
Special Thanks
• Mark Leonard (Eastern Bank) – SVP, Data &
Development Director
• Joe Blue (MapR) – Data Scientist
35
>Eastern Bank Data Engineering
Thank You!
36

More Related Content

What's hot

Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry
Capgemini
 

What's hot (20)

Big Data & Analytics perspectives in Banking
Big Data & Analytics perspectives in BankingBig Data & Analytics perspectives in Banking
Big Data & Analytics perspectives in Banking
 
BI & Big data use case for banking - by rully feranata
BI & Big data use case for banking - by rully feranataBI & Big data use case for banking - by rully feranata
BI & Big data use case for banking - by rully feranata
 
IBM Banking videocast - 3/20/2013
IBM Banking videocast - 3/20/2013 IBM Banking videocast - 3/20/2013
IBM Banking videocast - 3/20/2013
 
Big Data Banking: Customer vs. Accounting
Big Data Banking: Customer vs. AccountingBig Data Banking: Customer vs. Accounting
Big Data Banking: Customer vs. Accounting
 
Big Data Analytics for Banking, a Point of View
Big Data Analytics for Banking, a Point of ViewBig Data Analytics for Banking, a Point of View
Big Data Analytics for Banking, a Point of View
 
Pi cube banking on predictive analytics151
Pi cube   banking on predictive analytics151Pi cube   banking on predictive analytics151
Pi cube banking on predictive analytics151
 
Big Data: Banking Industry Use Case
Big Data: Banking Industry Use Case Big Data: Banking Industry Use Case
Big Data: Banking Industry Use Case
 
Big Data en Retail
Big Data en RetailBig Data en Retail
Big Data en Retail
 
TechConnex Big Data Series - Big Data in Banking
TechConnex Big Data Series - Big Data in BankingTechConnex Big Data Series - Big Data in Banking
TechConnex Big Data Series - Big Data in Banking
 
How analytics will transform banking in luxembourg
How analytics will transform banking in luxembourgHow analytics will transform banking in luxembourg
How analytics will transform banking in luxembourg
 
BigData in Banking
BigData in BankingBigData in Banking
BigData in Banking
 
Big data &amp; analytics for banking new york lars hamberg
Big data &amp; analytics for banking new york   lars hambergBig data &amp; analytics for banking new york   lars hamberg
Big data &amp; analytics for banking new york lars hamberg
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Banking Big Data Analytics
Banking Big Data AnalyticsBanking Big Data Analytics
Banking Big Data Analytics
 
Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry
 
Data-driven Banking: Managing the Digital Transformation
Data-driven Banking: Managing the Digital TransformationData-driven Banking: Managing the Digital Transformation
Data-driven Banking: Managing the Digital Transformation
 
How advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sectorHow advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sector
 
Business Intelligence and Analytics in Banking
Business Intelligence and Analytics in BankingBusiness Intelligence and Analytics in Banking
Business Intelligence and Analytics in Banking
 
Analytics driving innovation and efficiency in Banking
Analytics driving innovation and efficiency in BankingAnalytics driving innovation and efficiency in Banking
Analytics driving innovation and efficiency in Banking
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 

Viewers also liked

DOES LEARNING ABOUT THE BRAIN HELP BETTER MANAGE UNCERTAINTY?
DOES LEARNING ABOUT THE BRAIN HELP BETTER MANAGE UNCERTAINTY?DOES LEARNING ABOUT THE BRAIN HELP BETTER MANAGE UNCERTAINTY?
DOES LEARNING ABOUT THE BRAIN HELP BETTER MANAGE UNCERTAINTY?
François Bogacz
 
Misty Gilbert Final Portfolio 2013 Information Technology King University
Misty Gilbert Final Portfolio 2013 Information Technology King UniversityMisty Gilbert Final Portfolio 2013 Information Technology King University
Misty Gilbert Final Portfolio 2013 Information Technology King University
Misty Gilbert
 

Viewers also liked (20)

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Guidelines for making project...........
Guidelines for making project...........Guidelines for making project...........
Guidelines for making project...........
 
IBM Health Innovation Forum 2013 - Smarter Healthcare
IBM Health Innovation Forum 2013 - Smarter HealthcareIBM Health Innovation Forum 2013 - Smarter Healthcare
IBM Health Innovation Forum 2013 - Smarter Healthcare
 
Mantenimiento del pc
Mantenimiento del pcMantenimiento del pc
Mantenimiento del pc
 
Edwards Signaling SA-ETH Data Sheet
Edwards Signaling SA-ETH Data SheetEdwards Signaling SA-ETH Data Sheet
Edwards Signaling SA-ETH Data Sheet
 
La menopausia y las hormonas
La menopausia y las hormonasLa menopausia y las hormonas
La menopausia y las hormonas
 
DOES LEARNING ABOUT THE BRAIN HELP BETTER MANAGE UNCERTAINTY?
DOES LEARNING ABOUT THE BRAIN HELP BETTER MANAGE UNCERTAINTY?DOES LEARNING ABOUT THE BRAIN HELP BETTER MANAGE UNCERTAINTY?
DOES LEARNING ABOUT THE BRAIN HELP BETTER MANAGE UNCERTAINTY?
 
Podium de los equipos locales en el endurance series en el club slot elda
Podium de los equipos locales en el endurance series en el club slot eldaPodium de los equipos locales en el endurance series en el club slot elda
Podium de los equipos locales en el endurance series en el club slot elda
 
1. anorexia
1. anorexia1. anorexia
1. anorexia
 
Misty Gilbert Final Portfolio 2013 Information Technology King University
Misty Gilbert Final Portfolio 2013 Information Technology King UniversityMisty Gilbert Final Portfolio 2013 Information Technology King University
Misty Gilbert Final Portfolio 2013 Information Technology King University
 
Is The Question “For Whom Did You Vote?” Relevant?
Is The Question “For Whom Did You Vote?” Relevant?Is The Question “For Whom Did You Vote?” Relevant?
Is The Question “For Whom Did You Vote?” Relevant?
 
Spanish Inheritance Tax
Spanish Inheritance TaxSpanish Inheritance Tax
Spanish Inheritance Tax
 
2014_HMDA
2014_HMDA2014_HMDA
2014_HMDA
 
Automation of reporting process
Automation of reporting processAutomation of reporting process
Automation of reporting process
 
Reducing Time Spent On Requirements
Reducing Time Spent On RequirementsReducing Time Spent On Requirements
Reducing Time Spent On Requirements
 
Presentacion cetm2011
Presentacion cetm2011Presentacion cetm2011
Presentacion cetm2011
 
Hunter Business Group Overview
Hunter Business Group OverviewHunter Business Group Overview
Hunter Business Group Overview
 
Portafolio
 Portafolio  Portafolio
Portafolio
 
Hostel SS - Karina Rosario
Hostel SS - Karina RosarioHostel SS - Karina Rosario
Hostel SS - Karina Rosario
 
Preguntas de investigación universitaria
Preguntas de investigación universitariaPreguntas de investigación universitaria
Preguntas de investigación universitaria
 

Similar to How Eastern Bank Uses Big Data to Better Serve and Protect its Customers

2016 DSG Webinar Azure HDInsight 2 V4
2016 DSG Webinar Azure HDInsight 2 V42016 DSG Webinar Azure HDInsight 2 V4
2016 DSG Webinar Azure HDInsight 2 V4
Janani Eshwaran
 
2016 DSG Webinar Azure HDInsight 2 V4
2016 DSG Webinar Azure HDInsight 2 V42016 DSG Webinar Azure HDInsight 2 V4
2016 DSG Webinar Azure HDInsight 2 V4
Janani Eshwaran
 
Smarter analytics101 v2.0.1
Smarter analytics101 v2.0.1Smarter analytics101 v2.0.1
Smarter analytics101 v2.0.1
Jenawahl
 
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
IBM Switzerland
 
Desai_edinburgh2001
Desai_edinburgh2001Desai_edinburgh2001
Desai_edinburgh2001
Vijay Desai
 

Similar to How Eastern Bank Uses Big Data to Better Serve and Protect its Customers (20)

Mindfull - The Power of Predictive
Mindfull - The Power of PredictiveMindfull - The Power of Predictive
Mindfull - The Power of Predictive
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Big Data solution for multi-national Bank
Big Data solution for multi-national BankBig Data solution for multi-national Bank
Big Data solution for multi-national Bank
 
DATA BI: put key insights at the finger tip of decision makers.
DATA BI: put key insights at the finger tip of decision makers.DATA BI: put key insights at the finger tip of decision makers.
DATA BI: put key insights at the finger tip of decision makers.
 
2016 DSG Webinar Azure HDInsight 2 V4
2016 DSG Webinar Azure HDInsight 2 V42016 DSG Webinar Azure HDInsight 2 V4
2016 DSG Webinar Azure HDInsight 2 V4
 
2016 DSG Webinar Azure HDInsight 2 V4
2016 DSG Webinar Azure HDInsight 2 V42016 DSG Webinar Azure HDInsight 2 V4
2016 DSG Webinar Azure HDInsight 2 V4
 
Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise
 
Adopting Analytics for decision making in a bank
Adopting Analytics for decision making in a bankAdopting Analytics for decision making in a bank
Adopting Analytics for decision making in a bank
 
Adopting Analytics in BFSI
Adopting Analytics in BFSIAdopting Analytics in BFSI
Adopting Analytics in BFSI
 
Big data
Big dataBig data
Big data
 
Smarter analytics101 v2.0.1
Smarter analytics101 v2.0.1Smarter analytics101 v2.0.1
Smarter analytics101 v2.0.1
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
NZS-4555 - IT Analytics Keynote - IT Analytics for the Enterprise
NZS-4555 - IT Analytics Keynote - IT Analytics for the EnterpriseNZS-4555 - IT Analytics Keynote - IT Analytics for the Enterprise
NZS-4555 - IT Analytics Keynote - IT Analytics for the Enterprise
 
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
 
Next Generation Fraud Solutions using Neo4j
Next Generation Fraud Solutions using Neo4jNext Generation Fraud Solutions using Neo4j
Next Generation Fraud Solutions using Neo4j
 
Analytics Service Framework
Analytics Service Framework Analytics Service Framework
Analytics Service Framework
 
Neo4j the Anti Crime Database
Neo4j the Anti Crime DatabaseNeo4j the Anti Crime Database
Neo4j the Anti Crime Database
 
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
EVOLVING PATTERNS IN BIG DATA - NEIL AVERYEVOLVING PATTERNS IN BIG DATA - NEIL AVERY
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
 
Inspire2015 Bank of America Merrill Lynch
Inspire2015 Bank of America Merrill LynchInspire2015 Bank of America Merrill Lynch
Inspire2015 Bank of America Merrill Lynch
 
Desai_edinburgh2001
Desai_edinburgh2001Desai_edinburgh2001
Desai_edinburgh2001
 

Recently uploaded

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

How Eastern Bank Uses Big Data to Better Serve and Protect its Customers

  • 1. >Eastern Bank Data Engineering How Eastern Bank Uses Big Data to Better Serve & Protect its Customers Brian Griffith Principal Data Engineer
  • 2. >Eastern Bank Data Engineering Agenda • Introduction • Eastern Bank & the banking industry – Data architecture and our big data journey – Challenges – Use Case: • Debit card anomaly detection 2
  • 3. >Eastern Bank Data Engineering @bwgriffith • Database developer and engineer for 15 years • Working in the “big data” space for about 5 years – Blizzard Entertainment – Irvine, CA – Localytics – Boston, MA • Now @ Eastern Bank, helping engineer their next generation data platform b.griffith@easternbank.com 3
  • 4. >Eastern Bank Data Engineering Eastern Bank • 197 year old mutual bank (largest of its kind in the country) – Leader in corporate social responsibility – 8th most charitable business in Massachusetts • ~1 Million customers • 4 Organizations: – Banking: Eastern Bank – Insurance: Eastern Insurance Group – Wealth: Eastern Wealth Management – R & D and Product Dev: Eastern Labs 4
  • 5. >Eastern Bank Data Engineering Banking is Evolving • Customer activity moving more into the mobile space • Diverse services continuously emerging • Customers value personalized service – Relevant value added services – Personal relationships 5
  • 6. >Eastern Bank Data Engineering Positioned for the Best of Both Worlds • Like larger banks, leverage data in a manner that allows us to offer improved features and convenience • Like smaller banks, leverage data in a manner that allows us to offer more customized services and relationships 6
  • 7. >Eastern Bank Data Engineering 7
  • 8. >Eastern Bank Data Engineering Past Data Architecture Issues • Customer data lives in transaction “silos” – 3 Major data entities: Insurance, wealth, and banking – Data access via in-house or out-sourced solution – Impedes analysis • Regulatory compliance – Technical Debt – Auditing – 3rd party dependencies 8
  • 9. >Eastern Bank Data Engineering Data Architecture Goals • Abstraction from source systems • Scale horizontally, not vertically • Complete ownership of depth and breadth of our data • Improve data quality and stewardship • Drive iterative analytics throughout the enterprise • “Make the bank smarter” 9
  • 10. >Eastern Bank Data Engineering Data Architecture 10 Tx Data Warehouse Customer Master Big Data Store • Eastern endeavors to be relationship-driven, not transaction driven. In a digital economy, face to face interactions continue to decline. We need to rely on data integration and analytics to know our customers to best meet their evolving needs • Our Data Architecture is built on four interdependent “tiers” each with its own capabilities and contributions to the overall enterprise platform
  • 11. >Eastern Bank Data Engineering 11 Hadoop Tx Data Warehouse Customer Master Big Data Store • Can be a significant driver of customer intimacy in an increasingly digital world • Allows us to leverage data we’ve never thought of as “Customer Data” before • Goes beyond what a customer has with us – gives visibility into what a customer does with us through behavioral analytics • Scales ability to store with ability to process • Platform natively supports data analytics languages and machine learning tools • Fast processing enables iterative exploration
  • 12. >Eastern Bank Data Engineering Architecture Diagram 12
  • 13. >Eastern Bank Data Engineering Big Data Challenges 13
  • 14. >Eastern Bank Data Engineering Challenges • Governance! – Ingestion – Data Lineage – Data Quality – Managing growth • Balancing what data we “can” keep vs data we “should” keep • Security – Personal Identifiable Information (PII) – Mask and limit view of data • Driving Consumption – “If you build it, they will come”  Does not work by itself – Constant evangelism – Need to demonstrate value! 14
  • 15. >Eastern Bank Data Engineering Data Science 15
  • 16. >Eastern Bank Data Engineering Hadoop Data Science Fraud Detection Proof of Concept
  • 17. >Eastern Bank Data Engineering Fraud in the Financial Industry An Introduction • In 2012, there was 31.1 million fraudulent transactions, with a value of $6.1 billion1 1 The 2013 Federal Reserve Payments Study 17
  • 18. >Eastern Bank Data Engineering Debit Card Fraud • Industry wide debit card fraud has been rising at an significant rate • > 400% in the last 3 years! • Mostly due to breaches at large, national retailers 18
  • 19. >Eastern Bank Data Engineering Use Case Generation • Develop process to work in conjunction with existing fraud detection tools – Existing tools mostly rules based • Leverage Hadoop to traverse broad customer history for anomalous patterns – Behavioral analysis 19
  • 20. >Eastern Bank Data Engineering Fraud Use Case Workflow 20 DATA FEATURES TRAINING TESTING sample trans & claims to build training data identify account behavior patterns indicative of fraud scoring model will identify suspicious accounts the day after fraud happens testing and validating features iteratively
  • 21. >Eastern Bank Data Engineering Data • Claims – Customer reported • Only use customer’s first claim • Model trained on all available transaction data 21
  • 22. >Eastern Bank Data Engineering Features • Variables indicative of fraud, formatted for machine learning • Example: dollarRatio = Ratio of dollar spend today vs hx • Values calculated by comparing variables today vs history – Ratios, log(n), binary, etc… • Higher value = more suspicious • Hadoop performance 22
  • 23. >Eastern Bank Data Engineering 23 Building and Evaluating the Model 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% FraudDetectionRate Total Accounts ROC for TestModel training testing reference Receiver operating characteristic shows model tuning. Reviewing 20% of accounts finds ~80% of anomalies. Reference line shows predicted result of random sample. Feature Weight Std Error Z p(>|Z|) (Intercept) -3.44 0.051 -66.93 < 2e-16 dollarRatio 0.09 0.007 11.75 < 2e-16 0 20 40 60 80 100 120 140 0% 20% 40% 60% 80% 100% FalsePositiveRatio Fraud Detection Rate False Positive Rate for TestModel testing
  • 24. >Eastern Bank Data Engineering Scoring • How anomalous were a day’s transactions – Value range: 0.00 – 1.00 – Comparing a day to customer’s history • Assigned to each unique account • Function of weights & feature values 24
  • 25. >Eastern Bank Data Engineering 25
  • 26. >Eastern Bank Data Engineering Results & Testing ACCOUNT Score Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 xxxxxxxx 1 0.693 0.105 0.105 0.105 0.105 237.747 xxxxxxxx 0.9997 0.693 0.713 0.316 1.379 0.036 129.467 xxxxxxxx 0.9994 0.693 0.486 4.847 169.688 35.87 0.29 xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1 xxxxxxxx 0.9803 0.693 0.356 0.421 0.224 0.817 86.446 26
  • 27. >Eastern Bank Data Engineering Results & Testing ACCOUNT Score Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 xxxxxxxx 1 0.693 0.105 0.105 0.105 0.105 237.747 xxxxxxxx 0.9997 0.693 0.713 0.316 1.379 0.036 129.467 xxxxxxxx 0.9994 0.693 0.486 4.847 169.688 35.87 0.29 xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1 xxxxxxxx 0.9803 0.693 0.356 0.421 0.224 0.817 86.446 dollarRatio = Feature 6 27
  • 28. >Eastern Bank Data Engineering Results & Testing ACCOUNT Score Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 xxxxxxxx 1 0.693 0.105 0.105 0.105 0.105 237.747 Merchant Amount Timestamp JETBLUE AIRW $2,142.00 4/30/15 9:35 AM 28
  • 29. >Eastern Bank Data Engineering Results & Testing ACCOUNT Score Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 xxxxxxxx 1 0.693 0.105 0.105 0.105 0.105 237.747 xxxxxxxx 0.9997 0.693 0.713 0.316 1.379 0.036 129.467 xxxxxxxx 0.9994 0.693 0.486 4.847 169.688 35.87 0.29 xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1 xxxxxxxx 0.9803 0.693 0.356 0.421 0.224 0.817 86.446 29
  • 30. >Eastern Bank Data Engineering Results & Testing ACCOUNT Score Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1 Merchant Amount Timestamp Internet Vendor $12.25 4/30/15 3:42 AM Internet Vendor $3.01 4/30/15 3:42 AM Internet Vendor $2.46 4/30/15 3:42 AM Internet Vendor $1.49 4/30/15 3:42 AM Internet Vendor $18.95 4/30/15 3:42 AM 30
  • 31. >Eastern Bank Data Engineering Iterating 31 . • Build new features • Remove ineffective features • Address feature interaction • Minimize False Positives • Try Different Algorithms
  • 32. >Eastern Bank Data Engineering Next Steps • Real time w/ Spark & MLLib – Get closer to when fraud actually occurs • Expanded customer reach via notifications – Improved customer service • More agile feedback loop based on customer assessment 32
  • 33. >Eastern Bank Data Engineering Other Uses • Comparing customer behaviors day over day has carry over to many uses cases: – Predicting churn – Customer segmentation & personas – Predicting Customer Lifetime Value (CLV) 33
  • 34. >Eastern Bank Data Engineering Wrap up • Banking is evolving • Hadoop addresses a very large gap in our architecture • Empowers us to know more about our customers through all of their interactions with us • Needs to be governed • Customer fraud detection only the tip of the iceberg 34
  • 35. >Eastern Bank Data Engineering Special Thanks • Mark Leonard (Eastern Bank) – SVP, Data & Development Director • Joe Blue (MapR) – Data Scientist 35
  • 36. >Eastern Bank Data Engineering Thank You! 36

Editor's Notes

  1. In this presentation we will be talking about Eastern Bank’s Journey with Hadoop. This journey is relatively young as we have only had Hadoop under our roof for about 6 months, but in that time we have done some interesting things, as well as learned some valuable lessons. In this talk I will begin by reviewing the data challenges we face in the banking industry. I’ll then discuss how hadoop fits into our overall data architecture to help us realize our overall data strategy Finally I will dive into a few hadoop-centric if you will use cases revolving around debit card anomaly detection
  2. Been working in databases my whole career. Started out developing small scale OLTP and reporting databases Transitioned to more data warehousing (star schema) Then moved out west to Blizzard and helped engineer and build out their first hadoop deployment Then decided to move back east and worked at a local startup Localytics and their broad AWS system I then was presented with an opportunity @ EB to help build out a completely new data architecture from the ground up.
  3. Being mutual means we are not publically traded as our share holders, are our customers Eastern Bank actually consists of multiple entities: EIG – Insurance EWM – Wealth Management Labs – R & D
  4. Banking industry, as a whole, is changing. Customer activity is moving more and more into the online and mobile space. Recent marketing research has shown that customers are prioritizing the availability of web and mobile services, even if they do not intend to use them. As a result, new financial services are emerging in the marketplace. Even with all of this talk of the digital space, customer studies continue to show that customers value a personal relationship above all else. Maintaining this relationship gets more difficult with size
  5. EB finds itself in a unique position: Big Banks all extolling the virtues of big data. Smaller banks can’t compete there so they are focusing on developing a 360 degree view of the customer. We’re uniquely positioned to do both. We have the skills and the scale to leverage big data – and – We’re still at a size where we know our customers well enough to get to a leverage a 360 view of their relationship with us. Targeted products and improved features and products
  6. Our old data architecture shared various issues common with financial institutions. Customer data resides in siloes, making it difficult to get a clear picture of a complete customer. This is done for both security and performance considerations. These siloes present many difficulties: There are 3 “major” processing branches to the bank: Insurance, Wealth and Banking; all with their own set of data sources Data is accessed within these siloes in a mix of in house or out-sourced applications These siloes also make consumption from downstream systems, like BI tools, difficult Data augmentation is very difficult As a financial institution, we have to adhere to strict regulatory compliance Heavy technical debt was incurred due to the vast variety of source and reporting systems Lots of auditing overhead Vendor application imposed various 3rd party dependencies, increasing support complexity
  7. To address these deficiencies in our new architecture, we had several goals We wanted to abstract our self from our source systems so as they change overtime, our downstream systems will remain unaffected We need a system that can predictably scale in terms of cost and performance We wanted to have complete ownership of the depth and breadth of our data. Meaning we didn’t want to be limited in terms of what and how much data we can keep. We wanted an architecture that enabled improved data quality and stewardship. Pushing data ownership down through the business lines. We also wanted a system that drove iterative analytics throughout the enterprise. If a particular business line wanted to partake in their own data science experiment, we want to be able to provide them the optimal platform. Finally, I use this quote as my team’s mantra. We want to make the bank smarter. I’m not saying the bank isn’t smart now, far from it. I’m speaking to the fact that we want to empower the bank to be proactive in leveraging all of its data assets at its disposal to make the most informed decisions possible.
  8. While this may look like a pyramid, this illustration is meant to reflect granularity or field of vision of the data available to each level of our new stack. At the bottom we have our systems of record that quickly and precisely execute transactions. And example of this being teller transactions Next up we have our data warehouse layer, which adds history and allows us to report against these transactions Above that we have our Customer Master system, which shows us the breadth of the relationships our customer have with us. And example of this identifies a customer’s banking account relationships as well as any Insurance or wealth relationships
  9. Hadoop Makes us reconsider what we save and what is valuable. Thinking different, etc.. Hadoop allows us -- demands us to think differently about keeping and using data. Exhaust data – log files, transaction detail history, email, any evidence of interactions, and irregarless of format – can now be “customer data” Hadoop allows us to store and process vast amounts of this formally untapped data to achieve customer intimacy at web scale. If we think about what data represents for customer behavior, we can know what a customer does with us, not just what they have with us. This knowledge will allow us to customize services to customers, and fosters a more intimate relationship, even if we don’t see them every day in a branch.
  10. In terms of Hadoop, when implementing a system that can ingest any type of data, we are immediately faced with challenges. Unchecked, your large data store would quickly become cluttered, and you sight of what you actually have. Some people equate this to having your big data lake become a swamp. However, I have little girls, so I equate it to trying to find Waldo. At Eastern Bank we need to be mindful of our customers, and make every effort to protect their data. And in case you’re wondering… Waldo is here.
  11. To prevent these issues, we institute some strict governance policies. These policies govern: What data goes into hadoop Validation of data against source systems where available Data Lineage - We need to track what happens to that data once in the system. Is it manipulated? If so, how and by who? Who has access to this data? And finally we constantly need to balance what data we can keep vs what data we should keep These polices are developed and managed, not by a bunch of data nerds, but by a multi-disciplinary team consisting of all business lines of the bank (info sec, systems eng, deposit ops, etc..) Driving stewardship into lines The good news is, is that larger banks are doing this, so some precedence has been set. We also need to secure hadoop, in terms of who can see what types of data. Sometimes this means that copies of data Need to be created to mask certain PII information for analytics. And finally, with a new technology like hadoop, constant evangelism is needed to drive consumption. Speak to making the bank smarter Speak to security in banking industry
  12. We partnered with MapR to build our initial fraud model as a proof of concept. From this POC my team was brought up to speed on how the ML learning process works from an engineering stand point, and more importantly how we can maintain and iterate this model for future development. Also, when talking about fraud modeling, there is a lot of “secret sauce”, so with some of the data representations you’re about to see…. I made some stuff up. But I can tell you it all reflects what we see on a day to day basis.
  13. Every 3 years the Federal reserve releases a study on financial fraud. In 2012, the estimated number of “third party fraud” transactions was 31.1 million, which equates to a value of $6.1B A majority of these, as you can from these charts, were centered around debit card activity. These numbers drive why fraud is an excellent candidate for a proof of concept. The subject matter is high visibility with a known monetary impact.
  14. In the past three years fraud has exploded across the industry. >400%
  15. So for this case study, we wanted to develop a DAILY process to work in conjunction with our existing fraud detection tools, not replace them. We wanted to leverage hadoop to traverse our individual customer’s histories for anomalous patterns. We wanted to look at individual customer behavior in more detail than what is currently available with vendor solutions to detect as much fraud as possible. This use case forced us to look at data differently
  16. The worflow for this use case consists of 4 primary steps. Collecting data. This includes not only transaction data, but claims data as well (know frauds!),which would be used for training and testing of fraud model Next is the design of features, which help idenfity patterns indicative of fraud. Examples of these may include the $$ transacted today vs history, # transactions, etc… Next we will train our model and start scoring accounts And Finally we will analyze suspicious accounts to track feature performance and false positives
  17. For this exercise we will use two different data sets. Claims data will be used for training and testing our model. Having customers that file claims based on fraudulent activity against their card, gives us the ability to train our model against actual fraud patterns. However, you we can’t just jump in and use all of it. For model building, it is important that only the first known fraud on an account is used. Otherwise, the model may see the prior fraudulent behavior, giving it an unrealistic advantage. Transactional data is then used to generate feature values and ultimately a “score” for each account.
  18. Features are calculated variables that are predictive of fraud in a format readily consumed by machine learning algorithms Think of variables that predict fraud. For example: $$/day or #trans/day Values for feature are calculated by comparing their value for today vs history Features are engineered so that high values equate to more suspicious activity This is why we are using hadoop. Processing vast amount of data in minutes.
  19. The true goal is to provide a robust estimate of expected model performance. For the purposes of this exercise we are going to use a Logistical Regression model for several reasons, the most important being: It is easy to implement in code It offers insight into which features influenced the score As part of the model building process we can also test its performance against know fraud in our testing set. The performance of the model can be visualized in an ROC chart. The dotted red line represents the “brute force” amount of transactions we’d need to look at to find the corresponding amount of fraud… if we look at half the transactions, we’ll likely find half the fraud. The goal of predictive analytics is to pull that curve as far up and to the left as we can. How can we look at fewer transactions and still find more fraud? When the blue line (training) and the green line (testing) both move up and to the left, we’re onto a model that shows strong correlation between our features and the outcome we’re looking for. Finally the graph on the right represents our false positive rate.
  20. Once our model is built and tested, we are ready to score accounts. In terms of the scoring process the following steps will be taken: Pull a list of unique accounts that transacted on the day being scored Pull all available transactions for that account, up to and including, the day being scored. Generate features based on transaction values. Remember these features are generated for both “today” as well as everyday in the past. Finally, generate a score based on the feature values and their weights.
  21. Segway to Validation
  22. For this validation step we’re going to sort accounts by their score; 1 being the maximum score an account can obtain. The feature values that influenced that score are listed across. The higher the feature value, the more influence it had on the score.
  23. Lets take a look at the top scorer here. It looks like Feature 6 is heavily influencing this score. For this example Feature 6 is a ration that compares the $$ spend today vs the account’s historical daily dollar spend. So lets take a look at this account’s transactions.
  24. It looks like our top scorer had only one transaction, which was a significant airline purchase, which in all likelihood is legitimate. This is an example of a false positive. Feature 6, while properly calculated, placed too much value on this single, large transaction, which overly influenced the score.
  25. Moving down the list, we see that the 4th record down has a good distribution of feature values Features 2, 3, 4 and 5 all show some significance.
  26. Pulling this accounts transactions for the day scored, shows some highly suspicious activity. A large amount of online transactions in a very short period of time. This is clearly an account worth investigating.
  27. Based on this testing we then iterate on the model by: Building new features Removing features that don’t perform well Address feature interactions. Do two features influence each other: For example, having a feature that evaluates $$ spend, and another that evaluates # of trans/day, could potentially cause an interaction. Perhaps, a calculation that combines the two like an average or median would work better? Constantly strive to minimize false positives. Try other algorithms At Eastern Bank we use an approach similar to the scientific method during this process. We come up with a Hypothesis, test it, and document the outcomes. Talk to examples higher volume, $$ etc..
  28. The next steps of this use case is to bring it out of a batch process, and into a real time framework, using Spark and MLLib As a result of being faster with scoring (same day) we can look to expand our customer interaction with real time alerting through email or mobile And from these interactions build a more agile feedback loop that will allow us to retrain our model quicker based on feedback from real fraud as well as false positives. Closer to real time = the more valuable to customers we’ll be.
  29. So to wrap up, Evolution to online and mobile Customers want value added services that are relevant to them Hadoop addresses a very large gap in our data architecture, by allowing us to ingest a variety of data that previously was not available to us, or could not be analyzed effectively. Hadoop allows us to become smarter about our customers, which in turn lets us cater service and products in a more effective manner. More benefits to the customer Better understand them through transactions However, all of this power and agility, needs to be governed to avoid risk. And finally, the fraud detection use case is only the tip of the iceberg in terms of what hadoop can do. We now have a solid foundation to let us bridge into newer technologies such as streaming, noSQL and others.