Machine Learning (ML) for Fraud Detection.
- fraud is a big problem (big data, big cost)
- ML on bigger data produces better results
- Industry standard today (for detecting fraud)
- How to improve fraud detection!
2. Bigger Data. Better Results.™
Who Am I?
• Applied Math PhD
• Deriva3ve/ Op3ons Pricing Background
• 7 years doing analy3cs
• Data Science at Skytree for 2 years
3. Bigger Data. Better Results.™
Skytree Inc.
• Came out of Alex Gray’s (CTO) FastLab @ Georgia Tech
• SoTware Company that provides Machine Learning SoTware
• Built to func3on on top of Hadoop
• Automa3on, speed, and scalability
• User can interact through command line interface, APIs, and GUI
• 20 million dollars in series A
• TAB: Michael Jordan, James Demmel, Dave Pa[erson, Pat Hanrahan
4. What is Skytree?
• Machine Learning Plaorm
GBM, K-‐means, RF, SVD/ PCA, Linear/ Logis3c, SVM, collabora3ve filtering etc.
• Built for Big Data
Scales linearly with data size and compute nodes (map-‐reduce, hadoop)
• Usability
SDK in Python, Java, REST, even GUI
Data prepara3on through Spark
• Automa3on
1-‐click modeling
• ML on Bigger Data produces Be[er Results
Larger datasets lead to higher accuracy
5. Bigger Data. Better Results.™
Outline
• Introduc3on
Why Skytree, Big Data, and Machine Learning for Fraud?
• Machine Learning in Financial Services
Issues, methods, and solu3on
• Live Demo of Skytree on real-‐world dataset (command line, API, GUI)
Time and setup permidng
6. Bigger Data. Better Results.™
Introduc3on
• Fraud is a Big problem (Big Data, Big Cost)
• Why is Machine Learning necessary?
• Comprehensive solu3on?
7. Fraud is a Big Data Problem
• “More than 23 billion credit card transac3ons are processed annually in USA”
CreditCards.com
• Credit card transac3on alone generates mul3ple Terabytes of data a year
• Each transac3on has 100-‐300 a[ributes
• Distributed data across mul3ple nodes
8. Fraud is a Big Cost Problem
• “Businesses lose an es3mated $3.5
billion annually to fraud and financial
crime.”
Forbes, 2014
• “Total value of credit card transac3ons
in the U.S. in 2012: $2.48 trillion”
CreditCard.com
h[p://www.federalreserve.gov/releases/g19/
Current/
9. Why Machine Learning?
• Tradi3onal ideas of finding pa[erns through hand craTed, careful querying, does
not scale to large datasets
• Prior rule based engines do not make use of informa3on from mul3ple a[ributes at
the same 3me
• Machine Learning concerns with algorithms that can learn from data
Mul3variate Sta3s3cs
Automated predic3ve analy3cs
•
Even a 3ny increase in accuracy can lead to millions of dollars in savings
10. Gap between Machine Learning and Big Data
Ø Awakening
to
Big
Data,
experimen3ng
with
ML?
Ø ML
is
necessary
to
derive
value
out
of
Big
Data
11. ML on Bigger Data produces Be[er Results
• Weak and Strong Law of Large numbers
• “We have shown that for a prototypical natural language classifica3on task, the
performance of learners can benefit significantly from much larger training sets.”
Banco and Brill, Proceedings of ACL, 2001.
• “Breiman’s procedure (random forest) is consistent and adapts to sparsity, in the
sense that its rate of convergence depends only on the number of strong features
and not on how many noise variables are present.” Gerard Biau, JMLR, 2012
• Some%mes Big Data is all you need!
12. Experiment: ML on Bigger Data produces Be[er Results
• Source dataset: DNA dataset from Pascal Large Scale Learning Challenge.
• A 4M-‐row dataset was held out for tes3ng. Training datasets with 20M, 40M,
80M, 160M, 320M, 640M, 5120M elements, arranged into 200 columns, were
used. No featuriza3on was applied.
• Op3mal model for each training dataset size was found by tuning Gradient
Boos3ng Machine on a holdout dataset with Skytree smart-‐search.
• AUC (Area under ROC curve) was used for evalua3on.
• Experiment by Skytree Inc, 2015
13. Bigger Data, Be[er Results on Real World Data
Dataset Size
AUC
20,000,000
93.9%
40,000,000
95.0%
80,000,000
95.6%
160,000,000
96.2%
320,000,000
96.7%
640,000,000
97.2%
5,120,000,000
98.1%
14. Machine Learning Solu3on for Financial Services
Mul3ple
algorithms for
higher accuracy
• Gradient Boos3ng
• Random Decision
Forest
• SVM
• Stacked models
(combined
models)
• Mixed models
(combine
supervised and
unsupervised
models)
Automa3c
Parameter
Selec3on
• Automa3cally
create best
performing model
for any algorithm
in fewer itera3on
• Allow for usage by
domain experts
(non data
scien3sts)
• Higher Accuracy
machine can tune
be[er than
humans
Speed and
Scalability
• Big Data scale
• Catch latest
trends in fraud
• Improve accuracy
• Iterate over
mul3ple
algorithms and
parameters
• Faster model
crea3on and
model update
Visualiza3on and
Op3miza3on
• Op3mize directly
for dollars
• Visualize model
performance
• Provide knobs to
choose a model
• Ensure op3mality
of models without
over fidng
• Visualize models
to interpret results
15. Bigger Data. Better Results.™
Machine Learning for Fraud Detec3on
• Countering Fraud is a Machine Learning Problem
• Challenges
• Solu3on (GBM and advanced)
16. Fraud Detec3on
• Counter complex and transient fraud pa[erns
• Analyze mul3ple and large datasets to discover and predict fraud
“More than 23 billion credit card transac3ons are processed annually in USA” CreditCards.com
17. Machine Learning Problem
Supervised
Learning:
Predict
Fraud
Collect historical transac3ons
Learn from past examples of fraud
Predict fraud (in real-‐3me)
Unsupervised
Learning:
Discover
Fraud
Segment transac3ons
Inves3gate poten3ally new fraud
Detect Outliers
Mixed
Approach:
Discover
and
predict
Fraud
Detect “Points of Compromise” to prevent fraud
18. Common Issues
• Imbalanced Datasets
Too few examples of ‘known’ fraud
• What to op3mize?
Fraud capture rate
False posi3ve rate: what is the cost associated?
Total loss incurred due to fraud
What loss func3on to use
• How to handle missing values?
• Which algorithm to use?
19. [Current] Industry Standard Solu3on
GBM algorithm (Friedman, 2001 and variants)
• Sequen3ally combines simple models, with each “new” model correc3ng the mistakes of the
previous ones
• Base Model in this case is decision trees
• Inspired by gradient descent in op3miza3on
20. GBM Pros
• Automa3cally handles missing values
• Highly accurate models
• Captures nonlinearity in the data
• Does not require deep understanding of the data
21. GBM Cons
• Does not handle datasets with high dimensions well
• Minimizes bias, not necessarily variance
• Chance of over fidng the training data when data is noisy
• Not the best at handling very high imbalance in the data
• Requires extensive parameter tuning
• Not simple to distribute
22. GBM: overcoming the odds
• Does not handle datasets with high dimensions well
• SVMs handle datasets with high dimensionality
• Minimizes bias, not necessarily variance
• Ensemble of GBM (eGBM, Skytree, 2013) and stochas3c GBM (sGBM)
• eGBM: Idea is to use ensembles of GBMs where each GBM is built using bootstrap
samples
• sGBM: Each base learner (decision tree) uses different samples
• Mixed Models
• Combine Linear/ Logis3c models with GBM by blending/ stacking
• High chance of over fidng the training data
• Carefully check for generaliza3on error
• Restrict to simple base learners (shallow decision trees) etc.
23. GBM: overcoming the odds
• Not the best at handling very high imbalance in the data
• Ensemble GBMs, stochas3c GBMs, Random Forests etc.
• Requires extensive parameter tuning
• Smart-‐Search (Skytree Inc.,2014)
• Patent-‐pending technology
• Op3miza3on that itera3vely learns from the previous itera3ons
• Successively improves the space in which to search for the best solu3on
• Faster way to obtain the op3mal set of parameters
• Not simple to distribute
• Bring High Performance Compu3ng (HPC) distribu3ng
24. Machine Learning Solu3on for Financial Services
Mul3ple
algorithms for
higher accuracy
• Gradient Boos3ng
• Random Decision
Forest
• SVM
• Stacked models
(combined
models)
• Mixed models
(combine
supervised and
unsupervised
models)
Automa3c
Parameter
Selec3on
• Automa3cally
create best
performing model
for any algorithm
in fewer itera3on
• Allow for usage by
domain experts
(non data
scien3sts)
• Higher Accuracy
machine can tune
be[er than
humans
Speed and
Scalability
• Big Data scale
• Catch latest
trends in fraud
• Improve accuracy
• Iterate over
mul3ple
algorithms and
parameters
• Faster model
crea3on and
model update
Visualiza3on and
Op3miza3on
• Op3mize directly
for dollars
• Visualize model
performance
• Provide knobs to
choose a model
• Ensure op3mality
of models without
over fidng
• Visualize models
to interpret results
25. Bigger Data. Better Results.™
Lets see how it works!
• Skytree Workspace
• Demo
• CLI
• Python SDK
• GUI