Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
How Eastern Bank Uses Big Data to Better Serve and Protect its Customers
1. >Eastern Bank Data Engineering
How Eastern Bank Uses Big Data to
Better Serve & Protect its Customers
Brian Griffith
Principal Data Engineer
2. >Eastern Bank Data Engineering
Agenda
• Introduction
• Eastern Bank & the banking industry
– Data architecture and our big data journey
– Challenges
– Use Case:
• Debit card anomaly detection
2
3. >Eastern Bank Data Engineering
@bwgriffith
• Database developer and engineer for 15 years
• Working in the “big data” space for about 5
years
– Blizzard Entertainment – Irvine, CA
– Localytics – Boston, MA
• Now @ Eastern Bank, helping engineer their next
generation data platform
b.griffith@easternbank.com 3
4. >Eastern Bank Data Engineering
Eastern Bank
• 197 year old mutual bank (largest of its kind in the
country)
– Leader in corporate social responsibility
– 8th most charitable business in Massachusetts
• ~1 Million customers
• 4 Organizations:
– Banking: Eastern Bank
– Insurance: Eastern Insurance Group
– Wealth: Eastern Wealth Management
– R & D and Product Dev: Eastern Labs
4
5. >Eastern Bank Data Engineering
Banking is Evolving
• Customer activity moving more into the mobile
space
• Diverse services continuously emerging
• Customers value personalized service
– Relevant value added services
– Personal relationships
5
6. >Eastern Bank Data Engineering
Positioned for the Best of Both Worlds
• Like larger banks, leverage data in a manner
that allows us to offer improved features and
convenience
• Like smaller banks, leverage data in a manner
that allows us to offer more customized
services and relationships
6
8. >Eastern Bank Data Engineering
Past Data Architecture Issues
• Customer data lives in transaction “silos”
– 3 Major data entities: Insurance, wealth, and
banking
– Data access via in-house or out-sourced solution
– Impedes analysis
• Regulatory compliance
– Technical Debt
– Auditing
– 3rd party dependencies
8
9. >Eastern Bank Data Engineering
Data Architecture Goals
• Abstraction from source systems
• Scale horizontally, not vertically
• Complete ownership of depth and breadth of
our data
• Improve data quality and stewardship
• Drive iterative analytics throughout the
enterprise
• “Make the bank smarter”
9
10. >Eastern Bank Data Engineering
Data Architecture
10
Tx
Data
Warehouse
Customer Master
Big Data Store
• Eastern endeavors to be relationship-driven, not
transaction driven. In a digital economy, face to
face interactions continue to decline. We need
to rely on data integration and analytics to know
our customers to best meet their evolving needs
• Our Data Architecture is built on four
interdependent “tiers” each with its own
capabilities and contributions to the overall
enterprise platform
11. >Eastern Bank Data Engineering 11
Hadoop
Tx
Data
Warehouse
Customer Master
Big Data Store
• Can be a significant driver of customer
intimacy in an increasingly digital world
• Allows us to leverage data we’ve never
thought of as “Customer Data” before
• Goes beyond what a customer has with us –
gives visibility into what a customer does with
us through behavioral analytics
• Scales ability to store with ability to process
• Platform natively supports data analytics
languages and machine learning tools
• Fast processing enables iterative exploration
14. >Eastern Bank Data Engineering
Challenges
• Governance!
– Ingestion
– Data Lineage
– Data Quality
– Managing growth
• Balancing what data we “can” keep vs data we “should” keep
• Security
– Personal Identifiable Information (PII)
– Mask and limit view of data
• Driving Consumption
– “If you build it, they will come” Does not work by itself
– Constant evangelism
– Need to demonstrate value!
14
16. >Eastern Bank Data Engineering
Hadoop Data Science
Fraud Detection Proof of Concept
17. >Eastern Bank Data Engineering
Fraud in the Financial Industry
An Introduction
• In 2012, there was 31.1 million fraudulent
transactions, with a value of $6.1 billion1
1 The 2013 Federal Reserve Payments Study
17
18. >Eastern Bank Data Engineering
Debit Card Fraud
• Industry wide debit card fraud has been rising
at an significant rate
• > 400% in the last 3 years!
• Mostly due to breaches at large, national
retailers
18
19. >Eastern Bank Data Engineering
Use Case Generation
• Develop process to work in conjunction with
existing fraud detection tools
– Existing tools mostly rules based
• Leverage Hadoop to traverse broad customer
history for anomalous patterns
– Behavioral analysis
19
20. >Eastern Bank Data Engineering
Fraud Use Case Workflow
20
DATA
FEATURES
TRAINING
TESTING
sample trans &
claims to build
training data
identify account
behavior patterns
indicative of fraud
scoring model will
identify suspicious
accounts the day after
fraud happens
testing and
validating features
iteratively
21. >Eastern Bank Data Engineering
Data
• Claims – Customer
reported
• Only use customer’s
first claim
• Model trained on all
available transaction
data
21
22. >Eastern Bank Data Engineering
Features
• Variables indicative of fraud, formatted for
machine learning
• Example: dollarRatio = Ratio of dollar spend today vs hx
• Values calculated by comparing variables
today vs history
– Ratios, log(n), binary, etc…
• Higher value = more suspicious
• Hadoop performance
22
23. >Eastern Bank Data Engineering 23
Building and Evaluating the Model
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
FraudDetectionRate
Total Accounts
ROC for TestModel
training
testing
reference
Receiver operating characteristic shows model tuning.
Reviewing 20% of accounts finds ~80% of anomalies.
Reference line shows predicted result of random sample.
Feature Weight Std Error Z p(>|Z|)
(Intercept) -3.44 0.051 -66.93 < 2e-16
dollarRatio 0.09 0.007 11.75 < 2e-16
0
20
40
60
80
100
120
140
0% 20% 40% 60% 80% 100%
FalsePositiveRatio
Fraud Detection Rate
False Positive Rate for TestModel
testing
24. >Eastern Bank Data Engineering
Scoring
• How anomalous were a day’s transactions
– Value range: 0.00 – 1.00
– Comparing a day to customer’s history
• Assigned to each unique account
• Function of weights & feature values
24
30. >Eastern Bank Data Engineering
Results & Testing
ACCOUNT Score Feature 1
Feature
2
Feature
3
Feature
4
Feature
5
Feature
6
xxxxxxxx 0.9979 0 14.844 3.088 52.461 41.066 1
Merchant Amount Timestamp
Internet Vendor $12.25 4/30/15 3:42 AM
Internet Vendor $3.01 4/30/15 3:42 AM
Internet Vendor $2.46 4/30/15 3:42 AM
Internet Vendor $1.49 4/30/15 3:42 AM
Internet Vendor $18.95 4/30/15 3:42 AM
30
31. >Eastern Bank Data Engineering
Iterating
31
.
• Build new features
• Remove ineffective features
• Address feature interaction
• Minimize False Positives
• Try Different Algorithms
32. >Eastern Bank Data Engineering
Next Steps
• Real time w/ Spark & MLLib
– Get closer to when fraud actually occurs
• Expanded customer reach via notifications
– Improved customer service
• More agile feedback loop based on customer
assessment
32
33. >Eastern Bank Data Engineering
Other Uses
• Comparing customer behaviors day over day
has carry over to many uses cases:
– Predicting churn
– Customer segmentation & personas
– Predicting Customer Lifetime Value (CLV)
33
34. >Eastern Bank Data Engineering
Wrap up
• Banking is evolving
• Hadoop addresses a very large gap in our
architecture
• Empowers us to know more about our customers
through all of their interactions with us
• Needs to be governed
• Customer fraud detection only the tip of the
iceberg
34
35. >Eastern Bank Data Engineering
Special Thanks
• Mark Leonard (Eastern Bank) – SVP, Data &
Development Director
• Joe Blue (MapR) – Data Scientist
35
In this presentation we will be talking about Eastern Bank’s Journey with Hadoop.
This journey is relatively young as we have only had Hadoop under our roof for about 6 months, but in that time we have done some interesting things, as well as learned some valuable lessons.
In this talk I will begin by reviewing the data challenges we face in the banking industry.
I’ll then discuss how hadoop fits into our overall data architecture to help us realize our overall data strategy
Finally I will dive into a few hadoop-centric if you will use cases revolving around debit card anomaly detection
Been working in databases my whole career.
Started out developing small scale OLTP and reporting databases
Transitioned to more data warehousing (star schema)
Then moved out west to Blizzard and helped engineer and build out their first hadoop deployment
Then decided to move back east and worked at a local startup Localytics and their broad AWS system
I then was presented with an opportunity @ EB to help build out a completely new data architecture from the ground up.
Being mutual means we are not publically traded as our share holders, are our customers
Eastern Bank actually consists of multiple entities:
EIG – Insurance
EWM – Wealth Management
Labs – R & D
Banking industry, as a whole, is changing. Customer activity is moving more and more into the online and mobile space.
Recent marketing research has shown that customers are prioritizing the availability of web and mobile services, even if they do not intend to use them.
As a result, new financial services are emerging in the marketplace.
Even with all of this talk of the digital space, customer studies continue to show that customers value a personal relationship above all else.
Maintaining this relationship gets more difficult with size
EB finds itself in a unique position:
Big Banks all extolling the virtues of big data.
Smaller banks can’t compete there so they are focusing on developing a 360 degree view of the customer.
We’re uniquely positioned to do both.
We have the skills and the scale to leverage big data – and –
We’re still at a size where we know our customers well enough to get to a leverage a 360 view of their relationship with us.
Targeted products and improved features and products
Our old data architecture shared various issues common with financial institutions.
Customer data resides in siloes, making it difficult to get a clear picture of a complete customer. This is done for both security and performance considerations.
These siloes present many difficulties:
There are 3 “major” processing branches to the bank: Insurance, Wealth and Banking; all with their own set of data sources
Data is accessed within these siloes in a mix of in house or out-sourced applications
These siloes also make consumption from downstream systems, like BI tools, difficult
Data augmentation is very difficult
As a financial institution, we have to adhere to strict regulatory compliance
Heavy technical debt was incurred due to the vast variety of source and reporting systems
Lots of auditing overhead
Vendor application imposed various 3rd party dependencies, increasing support complexity
To address these deficiencies in our new architecture, we had several goals
We wanted to abstract our self from our source systems so as they change overtime, our downstream systems will remain unaffected
We need a system that can predictably scale in terms of cost and performance
We wanted to have complete ownership of the depth and breadth of our data. Meaning we didn’t want to be limited in terms of what and how much data we can keep.
We wanted an architecture that enabled improved data quality and stewardship. Pushing data ownership down through the business lines.
We also wanted a system that drove iterative analytics throughout the enterprise. If a particular business line wanted to partake in their own data science experiment, we want to be able to provide them the optimal platform.
Finally, I use this quote as my team’s mantra. We want to make the bank smarter. I’m not saying the bank isn’t smart now, far from it. I’m speaking to the fact that we want to empower the bank to be proactive in leveraging all of its data assets at its disposal to make the most informed decisions possible.
While this may look like a pyramid, this illustration is meant to reflect granularity or field of vision of the data available to each level of our new stack.
At the bottom we have our systems of record that quickly and precisely execute transactions.
And example of this being teller transactions
Next up we have our data warehouse layer, which adds history and allows us to report against these transactions
Above that we have our Customer Master system, which shows us the breadth of the relationships our customer have with us.
And example of this identifies a customer’s banking account relationships as well as any Insurance or wealth relationships
Hadoop
Makes us reconsider what we save and what is valuable. Thinking different, etc..
Hadoop allows us -- demands us to think differently about keeping and using data.
Exhaust data – log files, transaction detail history, email, any evidence of interactions, and irregarless of format – can now be “customer data”
Hadoop allows us to store and process vast amounts of this formally untapped data to achieve customer intimacy at web scale. If we think about what data represents for customer behavior, we can know what a customer does with us, not just what they have with us.
This knowledge will allow us to customize services to customers, and fosters a more intimate relationship, even if we don’t see them every day in a branch.
In terms of Hadoop, when implementing a system that can ingest any type of data, we are immediately faced with challenges.
Unchecked, your large data store would quickly become cluttered, and you sight of what you actually have.
Some people equate this to having your big data lake become a swamp.
However, I have little girls, so I equate it to trying to find Waldo.
At Eastern Bank we need to be mindful of our customers, and make every effort to protect their data.
And in case you’re wondering… Waldo is here.
To prevent these issues, we institute some strict governance policies. These policies govern:
What data goes into hadoop
Validation of data against source systems where available
Data Lineage - We need to track what happens to that data once in the system.
Is it manipulated? If so, how and by who?
Who has access to this data?
And finally we constantly need to balance what data we can keep vs what data we should keep
These polices are developed and managed, not by a bunch of data nerds, but by a multi-disciplinary team consisting of all business lines of the bank (info sec, systems eng, deposit ops, etc..) Driving stewardship into lines
The good news is, is that larger banks are doing this, so some precedence has been set.
We also need to secure hadoop, in terms of who can see what types of data. Sometimes this means that copies of data
Need to be created to mask certain PII information for analytics.
And finally, with a new technology like hadoop, constant evangelism is needed to drive consumption.
Speak to making the bank smarter
Speak to security in banking industry
We partnered with MapR to build our initial fraud model as a proof of concept. From this POC my team was brought up to speed on how the ML learning process works from an engineering stand point, and more importantly how we can maintain and iterate this model for future development.
Also, when talking about fraud modeling, there is a lot of “secret sauce”, so with some of the data representations you’re about to see…. I made some stuff up. But I can tell you it all reflects what we see on a day to day basis.
Every 3 years the Federal reserve releases a study on financial fraud.
In 2012, the estimated number of “third party fraud” transactions was 31.1 million, which equates to a value of $6.1B
A majority of these, as you can from these charts, were centered around debit card activity.
These numbers drive why fraud is an excellent candidate for a proof of concept. The subject matter is high visibility with a known monetary impact.
In the past three years fraud has exploded across the industry.
>400%
So for this case study, we wanted to develop a DAILY process to work in conjunction with our existing fraud detection tools, not replace them.
We wanted to leverage hadoop to traverse our individual customer’s histories for anomalous patterns.
We wanted to look at individual customer behavior in more detail than what is currently available with vendor solutions to detect as much fraud as possible.
This use case forced us to look at data differently
The worflow for this use case consists of 4 primary steps.
Collecting data. This includes not only transaction data, but claims data as well (know frauds!),which would be used for training and testing of fraud model
Next is the design of features, which help idenfity patterns indicative of fraud. Examples of these may include the $$ transacted today vs history, # transactions, etc…
Next we will train our model and start scoring accounts
And Finally we will analyze suspicious accounts to track feature performance and false positives
For this exercise we will use two different data sets.
Claims data will be used for training and testing our model. Having customers that file claims based on fraudulent activity against their card, gives us the ability to train our model against actual fraud patterns. However, you we can’t just jump in and use all of it. For model building, it is important that only the first known fraud on an account is used. Otherwise, the model may see the prior fraudulent behavior, giving it an unrealistic advantage.
Transactional data is then used to generate feature values and ultimately a “score” for each account.
Features are calculated variables that are predictive of fraud in a format readily consumed by machine learning algorithms
Think of variables that predict fraud. For example:
$$/day or #trans/day
Values for feature are calculated by comparing their value for today vs history
Features are engineered so that high values equate to more suspicious activity
This is why we are using hadoop. Processing vast amount of data in minutes.
The true goal is to provide a robust estimate of expected model performance.
For the purposes of this exercise we are going to use a Logistical Regression model for several reasons, the most important being:
It is easy to implement in code
It offers insight into which features influenced the score
As part of the model building process we can also test its performance against know fraud in our testing set. The performance of the model can be visualized in an ROC chart.
The dotted red line represents the “brute force” amount of transactions we’d need to look at to find the corresponding amount of fraud… if we look at half the transactions, we’ll likely find half the fraud.
The goal of predictive analytics is to pull that curve as far up and to the left as we can. How can we look at fewer transactions and still find more fraud?
When the blue line (training) and the green line (testing) both move up and to the left, we’re onto a model that shows strong correlation between our features and the outcome we’re looking for.
Finally the graph on the right represents our false positive rate.
Once our model is built and tested, we are ready to score accounts.
In terms of the scoring process the following steps will be taken:
Pull a list of unique accounts that transacted on the day being scored
Pull all available transactions for that account, up to and including, the day being scored.
Generate features based on transaction values. Remember these features are generated for both “today” as well as everyday in the past.
Finally, generate a score based on the feature values and their weights.
Segway to Validation
For this validation step we’re going to sort accounts by their score; 1 being the maximum score an account can obtain.
The feature values that influenced that score are listed across. The higher the feature value, the more influence it had on the score.
Lets take a look at the top scorer here.
It looks like Feature 6 is heavily influencing this score.
For this example Feature 6 is a ration that compares the $$ spend today vs the account’s historical daily dollar spend.
So lets take a look at this account’s transactions.
It looks like our top scorer had only one transaction, which was a significant airline purchase, which in all likelihood is legitimate.
This is an example of a false positive. Feature 6, while properly calculated, placed too much value on this single, large transaction, which overly influenced the score.
Moving down the list, we see that the 4th record down has a good distribution of feature values
Features 2, 3, 4 and 5 all show some significance.
Pulling this accounts transactions for the day scored, shows some highly suspicious activity. A large amount of online transactions in a very short period of time.
This is clearly an account worth investigating.
Based on this testing we then iterate on the model by:
Building new features
Removing features that don’t perform well
Address feature interactions. Do two features influence each other: For example, having a feature that evaluates $$ spend, and another that evaluates # of trans/day, could potentially cause an interaction. Perhaps, a calculation that combines the two like an average or median would work better?
Constantly strive to minimize false positives.
Try other algorithms
At Eastern Bank we use an approach similar to the scientific method during this process.
We come up with a Hypothesis, test it, and document the outcomes.
Talk to examples higher volume, $$ etc..
The next steps of this use case is to bring it out of a batch process, and into a real time framework, using Spark and MLLib
As a result of being faster with scoring (same day) we can look to expand our customer interaction with real time alerting through email or mobile
And from these interactions build a more agile feedback loop that will allow us to retrain our model quicker based on feedback from real fraud as well as false positives.
Closer to real time = the more valuable to customers we’ll be.
So to wrap up,
Evolution to online and mobile
Customers want value added services that are relevant to them
Hadoop addresses a very large gap in our data architecture, by allowing us to ingest a variety of data that previously was not available to us, or could not be analyzed effectively.
Hadoop allows us to become smarter about our customers, which in turn lets us cater service and products in a more effective manner.
More benefits to the customer
Better understand them through transactions
However, all of this power and agility, needs to be governed to avoid risk.
And finally, the fraud detection use case is only the tip of the iceberg in terms of what hadoop can do. We now have a solid foundation to let us bridge into newer technologies such as streaming, noSQL and others.