Machine Learning for Fraud Detection

Bigger Data. Better Results.™
Machine Learning for Fraud
Detec3on
Nitesh Kumar, PhD
nitesh@skytree.net

Who Am I?

•  Applied Math PhD
•  Deriva3ve/ Op3ons Pricing Background
•  7 years doing analy3cs
•  Data Science at Skytree for 2 years

Skytree Inc.

•  Came out of Alex Gray’s (CTO) FastLab @ Georgia Tech
•  SoTware Company that provides Machine Learning SoTware
•  Built to func3on on top of Hadoop
•  Automa3on, speed, and scalability
•  User can interact through command line interface, APIs, and GUI
•  20 million dollars in series A
•  TAB: Michael Jordan, James Demmel, Dave Pa[erson, Pat Hanrahan

What is Skytree?

•  Machine Learning Plaorm
GBM, K-‐means, RF, SVD/ PCA, Linear/ Logis3c, SVM, collabora3ve ﬁltering etc.
•  Built for Big Data
Scales linearly with data size and compute nodes (map-‐reduce, hadoop)
•  Usability
SDK in Python, Java, REST, even GUI
Data prepara3on through Spark
•  Automa3on
1-‐click modeling
•  ML on Bigger Data produces Be[er Results
Larger datasets lead to higher accuracy

Outline

•  Introduc3on
Why Skytree, Big Data, and Machine Learning for Fraud?

•  Machine Learning in Financial Services
Issues, methods, and solu3on
•  Live Demo of Skytree on real-‐world dataset (command line, API, GUI)
Time and setup permidng

Introduc3on

•  Fraud is a Big problem (Big Data, Big Cost)

•  Why is Machine Learning necessary?
•  Comprehensive solu3on?

Fraud is a Big Data Problem

•  “More than 23 billion credit card transac3ons are processed annually in USA”
CreditCards.com

•  Credit card transac3on alone generates mul3ple Terabytes of data a year
•  Each transac3on has 100-‐300 a[ributes

•  Distributed data across mul3ple nodes

Fraud is a Big Cost Problem
•  “Businesses lose an es3mated $3.5
billion annually to fraud and ﬁnancial
crime.”
Forbes, 2014
•  “Total value of credit card transac3ons
in the U.S. in 2012: $2.48 trillion”
CreditCard.com
h[p://www.federalreserve.gov/releases/g19/
Current/

Why Machine Learning?

•  Tradi3onal ideas of ﬁnding pa[erns through hand craTed, careful querying, does
not scale to large datasets

•  Prior rule based engines do not make use of informa3on from mul3ple a[ributes at
the same 3me
•  Machine Learning concerns with algorithms that can learn from data
Mul3variate Sta3s3cs
Automated predic3ve analy3cs

• 

Even a 3ny increase in accuracy can lead to millions of dollars in savings

Gap between Machine Learning and Big Data

Ø  Awakening
to

Big
Data,

experimen3ng

with
ML?

Ø  ML
is
necessary

to
derive
value

out
of
Big
Data

ML on Bigger Data produces Be[er Results
•  Weak and Strong Law of Large numbers

•  “We have shown that for a prototypical natural language classifica3on task, the
performance of learners can benefit significantly from much larger training sets.”
Banco and Brill, Proceedings of ACL, 2001.
•  “Breiman’s procedure (random forest) is consistent and adapts to sparsity, in the
sense that its rate of convergence depends only on the number of strong features
and not on how many noise variables are present.” Gerard Biau, JMLR, 2012
•  Some%mes Big Data is all you need!

Experiment: ML on Bigger Data produces Be[er Results
•  Source dataset: DNA dataset from Pascal Large Scale Learning Challenge.
•  A 4M-‐row dataset was held out for tes3ng. Training datasets with 20M, 40M,
80M, 160M, 320M, 640M, 5120M elements, arranged into 200 columns, were
used. No featuriza3on was applied.
•  Op3mal model for each training dataset size was found by tuning Gradient
Boos3ng Machine on a holdout dataset with Skytree smart-‐search.
•  AUC (Area under ROC curve) was used for evalua3on.
•  Experiment by Skytree Inc, 2015

Bigger Data, Be[er Results on Real World Data
Dataset Size
AUC
20,000,000
93.9%
40,000,000
95.0%
80,000,000
95.6%
160,000,000
96.2%
320,000,000
96.7%
640,000,000
97.2%
5,120,000,000
98.1%

Machine Learning Solu3on for Financial Services
Mul3ple
algorithms for
higher accuracy
• Gradient Boos3ng
• Random Decision
Forest
• SVM
• Stacked models
(combined
models)
• Mixed models
(combine
supervised and
unsupervised
models)
Automa3c
Parameter
Selec3on
• Automa3cally
create best
performing model
for any algorithm
in fewer itera3on
• Allow for usage by
domain experts
(non data
scien3sts)
• Higher Accuracy
machine can tune
be[er than
humans
Speed and
Scalability

• Big Data scale
• Catch latest
trends in fraud
• Improve accuracy
• Iterate over
mul3ple
algorithms and
parameters
• Faster model
crea3on and
model update
Visualiza3on and
Op3miza3on

• Op3mize directly
for dollars
• Visualize model
performance
• Provide knobs to
choose a model
• Ensure op3mality
of models without
over ﬁdng
• Visualize models
to interpret results

Machine Learning for Fraud Detec3on

•  Countering Fraud is a Machine Learning Problem

•  Challenges
•  Solu3on (GBM and advanced)

Fraud Detec3on
•  Counter complex and transient fraud pa[erns

•  Analyze mul3ple and large datasets to discover and predict fraud
“More than 23 billion credit card transac3ons are processed annually in USA” CreditCards.com

Machine Learning Problem
Supervised

Learning:

Predict
Fraud

Collect historical transac3ons
Learn from past examples of fraud
Predict fraud (in real-‐3me)
Unsupervised

Learning:

Discover
Fraud

Segment transac3ons
Inves3gate poten3ally new fraud
Detect Outliers
Mixed

Approach:

Discover
and

predict
Fraud

Detect “Points of Compromise” to prevent fraud

Common Issues

•  Imbalanced Datasets
Too few examples of ‘known’ fraud
•  What to op3mize?
Fraud capture rate
False posi3ve rate: what is the cost associated?
Total loss incurred due to fraud
What loss func3on to use
•  How to handle missing values?
•  Which algorithm to use?

[Current] Industry Standard Solu3on

GBM algorithm (Friedman, 2001 and variants)

•  Sequen3ally combines simple models, with each “new” model correc3ng the mistakes of the
previous ones
•  Base Model in this case is decision trees
•  Inspired by gradient descent in op3miza3on

GBM Pros

•  Automa3cally handles missing values
•  Highly accurate models
•  Captures nonlinearity in the data
•  Does not require deep understanding of the data

GBM Cons
•  Does not handle datasets with high dimensions well
•  Minimizes bias, not necessarily variance
•  Chance of over ﬁdng the training data when data is noisy
•  Not the best at handling very high imbalance in the data
•  Requires extensive parameter tuning
•  Not simple to distribute

GBM: overcoming the odds
•  Does not handle datasets with high dimensions well
•  SVMs handle datasets with high dimensionality
•  Minimizes bias, not necessarily variance
•  Ensemble of GBM (eGBM, Skytree, 2013) and stochas3c GBM (sGBM)
•  eGBM: Idea is to use ensembles of GBMs where each GBM is built using bootstrap
samples
•  sGBM: Each base learner (decision tree) uses diﬀerent samples
•  Mixed Models
•  Combine Linear/ Logis3c models with GBM by blending/ stacking
•  High chance of over ﬁdng the training data
•  Carefully check for generaliza3on error
•  Restrict to simple base learners (shallow decision trees) etc.

GBM: overcoming the odds

•  Not the best at handling very high imbalance in the data
•  Ensemble GBMs, stochas3c GBMs, Random Forests etc.
•  Requires extensive parameter tuning
•  Smart-‐Search (Skytree Inc.,2014)
•  Patent-‐pending technology
•  Op3miza3on that itera3vely learns from the previous itera3ons
•  Successively improves the space in which to search for the best solu3on
•  Faster way to obtain the op3mal set of parameters
•  Not simple to distribute
•  Bring High Performance Compu3ng (HPC) distribu3ng

Lets see how it works!

•  Skytree Workspace

•  Demo
•  CLI
•  Python SDK
•  GUI

Uniﬁed Data Scien3st Workspace

Machine Learning for Fraud Detection

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Machine Learning for Fraud Detection

Similaire à Machine Learning for Fraud Detection (20)

Dernier

Dernier (20)

Machine Learning for Fraud Detection