SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Bigger Data. Better Results.™
Machine  Learning  for  Fraud  
Detec3on
Nitesh  Kumar,  PhD
nitesh@skytree.net
Bigger Data. Better Results.™
Who  Am  I?  

•  Applied  Math  PhD  
•  Deriva3ve/  Op3ons  Pricing  Background
•  7  years  doing  analy3cs  
•  Data  Science  at  Skytree  for  2  years  
Bigger Data. Better Results.™
Skytree  Inc.

•  Came  out  of  Alex  Gray’s  (CTO)  FastLab  @  Georgia  Tech  
•  SoTware  Company  that  provides  Machine  Learning  SoTware
•  Built  to  func3on  on  top  of  Hadoop  
•  Automa3on,  speed,  and  scalability  
•  User  can  interact  through  command  line  interface,  APIs,  and  GUI
•  20  million  dollars  in  series  A
•  TAB:  Michael  Jordan,  James  Demmel,  Dave  Pa[erson,  Pat  Hanrahan
What  is  Skytree?

•  Machine  Learning  Plaorm  
GBM,  K-­‐means,  RF,  SVD/  PCA,  Linear/  Logis3c,  SVM,  collabora3ve  filtering  etc.  
•  Built  for  Big  Data
Scales  linearly  with  data  size  and  compute  nodes  (map-­‐reduce,  hadoop)
•  Usability    
SDK  in  Python,  Java,  REST,  even  GUI
Data  prepara3on  through  Spark
•  Automa3on
1-­‐click  modeling      
•  ML  on  Bigger  Data  produces  Be[er  Results  
Larger  datasets  lead  to  higher  accuracy  
Bigger Data. Better Results.™
Outline

•  Introduc3on
Why  Skytree,  Big  Data,  and  Machine  Learning  for  Fraud?      

•  Machine  Learning  in  Financial  Services  
Issues,  methods,  and  solu3on
•  Live  Demo  of  Skytree  on  real-­‐world  dataset  (command  line,  API,  GUI)
Time  and  setup  permidng
Bigger Data. Better Results.™
Introduc3on

•  Fraud  is  a  Big  problem  (Big  Data,  Big  Cost)

•  Why  is  Machine  Learning  necessary?
•  Comprehensive  solu3on?  
Fraud  is  a  Big  Data  Problem

•  “More  than  23  billion  credit  card  transac3ons  are  processed  annually  in  USA”  
CreditCards.com

•  Credit  card  transac3on  alone  generates  mul3ple  Terabytes  of  data  a  year
•  Each  transac3on  has  100-­‐300  a[ributes    

•  Distributed  data  across  mul3ple  nodes    
Fraud  is  a  Big  Cost  Problem
•  “Businesses  lose  an  es3mated  $3.5  
billion  annually  to  fraud  and  financial  
crime.”
Forbes,  2014
•  “Total  value  of  credit  card  transac3ons  
in  the  U.S.  in  2012:  $2.48  trillion”
CreditCard.com
h[p://www.federalreserve.gov/releases/g19/
Current/
Why  Machine  Learning?

•  Tradi3onal  ideas  of  finding  pa[erns  through  hand  craTed,  careful  querying,  does  
not  scale  to  large  datasets  

•  Prior  rule  based  engines  do  not  make  use  of  informa3on  from  mul3ple  a[ributes  at  
the  same  3me
•  Machine  Learning  concerns  with  algorithms  that  can  learn  from  data    
Mul3variate  Sta3s3cs  
Automated  predic3ve  analy3cs

•  

Even  a  3ny  increase  in  accuracy  can  lead  to  millions  of  dollars  in  savings  
Gap  between  Machine  Learning  and  Big  Data  
	
  
Ø  Awakening	
  to	
  
Big	
  Data,	
  
experimen3ng	
  
with	
  ML?	
  
	
  
Ø  ML	
  is	
  necessary	
  
to	
  derive	
  value	
  
out	
  of	
  Big	
  Data	
  	
  
ML  on  Bigger  Data  produces  Be[er  Results
•  Weak  and  Strong  Law  of  Large  numbers  

•    “We  have  shown  that  for  a  prototypical  natural  language  classifica3on  task,  the  
performance  of  learners  can  benefit  significantly  from  much  larger  training  sets.”  
Banco  and  Brill,  Proceedings  of  ACL,  2001.
•  “Breiman’s  procedure  (random  forest)  is  consistent  and  adapts  to  sparsity,  in  the  
sense  that  its  rate  of  convergence  depends  only  on  the  number  of  strong  features  
and  not  on  how  many  noise  variables  are  present.”  Gerard  Biau,  JMLR,  2012
•  Some%mes  Big  Data  is  all  you  need!  
Experiment:  ML  on  Bigger  Data  produces  Be[er  Results
•  Source  dataset:  DNA  dataset  from  Pascal  Large  Scale  Learning  Challenge.
•  A  4M-­‐row  dataset  was  held  out  for  tes3ng.  Training  datasets  with  20M,  40M,  
80M,  160M,  320M,  640M,  5120M  elements,  arranged  into  200  columns,  were  
used.    No  featuriza3on  was  applied.
•  Op3mal  model  for  each  training  dataset  size  was  found  by  tuning  Gradient  
Boos3ng  Machine  on  a  holdout  dataset  with  Skytree  smart-­‐search.
•  AUC  (Area  under  ROC  curve)  was  used  for  evalua3on.
•  Experiment  by  Skytree  Inc,  2015
Bigger  Data,  Be[er  Results  on  Real  World  Data  
Dataset  Size
 AUC
20,000,000
 93.9%
40,000,000
 95.0%
80,000,000
 95.6%
160,000,000
 96.2%
320,000,000
 96.7%
640,000,000
 97.2%  
5,120,000,000
 98.1%
Machine  Learning  Solu3on  for  Financial  Services
Mul3ple  
algorithms  for  
higher  accuracy
• Gradient  Boos3ng
• Random  Decision  
Forest
• SVM
• Stacked  models  
(combined  
models)
• Mixed  models  
(combine  
supervised  and  
unsupervised  
models)
Automa3c  
Parameter  
Selec3on  
• Automa3cally  
create  best  
performing  model  
for  any  algorithm  
in  fewer  itera3on
• Allow  for  usage  by  
domain  experts  
(non  data  
scien3sts)  
• Higher  Accuracy  
machine  can  tune  
be[er  than  
humans  
Speed  and  
Scalability    

• Big  Data  scale
• Catch  latest  
trends  in  fraud  
• Improve  accuracy  
• Iterate  over  
mul3ple  
algorithms  and  
parameters
• Faster  model  
crea3on  and  
model  update
Visualiza3on  and  
Op3miza3on

  
• Op3mize  directly  
for  dollars
• Visualize  model  
performance  
• Provide  knobs  to  
choose  a  model  
• Ensure  op3mality  
of  models  without  
over  fidng  
• Visualize  models  
to  interpret  results  
Bigger Data. Better Results.™
Machine  Learning  for  Fraud  Detec3on

•  Countering  Fraud  is  a  Machine  Learning  Problem

•  Challenges
•  Solu3on  (GBM  and  advanced)
Fraud  Detec3on
•  Counter  complex  and  transient  fraud  pa[erns

•  Analyze  mul3ple  and  large  datasets  to  discover  and  predict  fraud  
“More  than  23  billion  credit  card  transac3ons  are  processed  annually  in  USA”  CreditCards.com
Machine  Learning  Problem
Supervised	
  
Learning:	
  
Predict	
  Fraud	
  
Collect  historical  transac3ons
Learn  from  past  examples  of  fraud
Predict  fraud  (in  real-­‐3me)
Unsupervised	
  
Learning:	
  
Discover	
  Fraud	
  
Segment  transac3ons
Inves3gate  poten3ally  new  fraud
Detect  Outliers
Mixed	
  
Approach:	
  
Discover	
  and	
  
predict	
  Fraud	
  

Detect  “Points  of  Compromise”  to  prevent  fraud    
Common  Issues

•  Imbalanced  Datasets
Too  few  examples  of  ‘known’  fraud
•  What  to  op3mize?
Fraud  capture  rate  
False  posi3ve  rate:  what  is  the  cost  associated?  
Total  loss  incurred  due  to  fraud  
What  loss  func3on  to  use
•  How  to  handle  missing  values?  
•  Which  algorithm  to  use?
[Current]  Industry  Standard  Solu3on

GBM  algorithm  (Friedman,  2001  and  variants)

•  Sequen3ally  combines  simple  models,  with  each  “new”  model  correc3ng  the  mistakes  of  the  
previous  ones
•  Base  Model  in  this  case  is  decision  trees
•  Inspired  by  gradient  descent  in  op3miza3on  
GBM  Pros

•  Automa3cally  handles  missing  values
•  Highly  accurate  models
•  Captures  nonlinearity  in  the  data  
•  Does  not  require  deep  understanding  of  the  data  
      
GBM  Cons
•  Does  not  handle  datasets  with  high  dimensions  well  
•  Minimizes  bias,  not  necessarily  variance
•  Chance  of  over  fidng  the  training  data  when  data  is  noisy  
•  Not  the  best  at  handling  very  high  imbalance  in  the  data    
•  Requires  extensive  parameter  tuning    
•  Not  simple  to  distribute  
GBM:  overcoming  the  odds
•  Does  not  handle  datasets  with  high  dimensions  well  
•  SVMs  handle  datasets  with  high  dimensionality  
•  Minimizes  bias,  not  necessarily  variance
•  Ensemble  of  GBM  (eGBM,  Skytree,  2013)  and  stochas3c  GBM  (sGBM)
•  eGBM:  Idea  is  to  use  ensembles  of  GBMs  where  each  GBM  is  built  using  bootstrap  
samples
•  sGBM:  Each  base  learner  (decision  tree)  uses  different  samples
•  Mixed  Models
•  Combine  Linear/  Logis3c  models  with  GBM  by  blending/  stacking
•  High  chance  of  over  fidng  the  training  data  
•  Carefully  check  for  generaliza3on  error
•  Restrict  to  simple  base  learners  (shallow  decision  trees)  etc.  
GBM:  overcoming  the  odds

•  Not  the  best  at  handling  very  high  imbalance  in  the  data
•  Ensemble  GBMs,  stochas3c  GBMs,  Random  Forests  etc.  
•  Requires  extensive  parameter  tuning    
•  Smart-­‐Search  (Skytree  Inc.,2014)
•  Patent-­‐pending  technology
•  Op3miza3on  that  itera3vely  learns  from  the  previous  itera3ons
•  Successively  improves  the  space  in  which  to  search  for  the  best  solu3on  
•  Faster  way  to  obtain  the  op3mal  set  of  parameters
•  Not  simple  to  distribute
•  Bring  High  Performance  Compu3ng  (HPC)  distribu3ng  



        
Machine  Learning  Solu3on  for  Financial  Services
Mul3ple  
algorithms  for  
higher  accuracy
• Gradient  Boos3ng
• Random  Decision  
Forest
• SVM
• Stacked  models  
(combined  
models)
• Mixed  models  
(combine  
supervised  and  
unsupervised  
models)
Automa3c  
Parameter  
Selec3on  
• Automa3cally  
create  best  
performing  model  
for  any  algorithm  
in  fewer  itera3on
• Allow  for  usage  by  
domain  experts  
(non  data  
scien3sts)  
• Higher  Accuracy  
machine  can  tune  
be[er  than  
humans  
Speed  and  
Scalability    

• Big  Data  scale
• Catch  latest  
trends  in  fraud  
• Improve  accuracy  
• Iterate  over  
mul3ple  
algorithms  and  
parameters
• Faster  model  
crea3on  and  
model  update
Visualiza3on  and  
Op3miza3on

  
• Op3mize  directly  
for  dollars
• Visualize  model  
performance  
• Provide  knobs  to  
choose  a  model  
• Ensure  op3mality  
of  models  without  
over  fidng  
• Visualize  models  
to  interpret  results  
Bigger Data. Better Results.™
Lets  see  how  it  works!

•  Skytree  Workspace  

•  Demo
•  CLI
•  Python  SDK  
•  GUI
Unified  Data  Scien3st  Workspace

Contenu connexe

Tendances

Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learningdataalcott
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithmsankit panigrahy
 
Credit Card Fraud Detection
Credit Card Fraud DetectionCredit Card Fraud Detection
Credit Card Fraud DetectionBinayakreddy
 
Credit card payment_fraud_detection
Credit card payment_fraud_detectionCredit card payment_fraud_detection
Credit card payment_fraud_detectionPEIPEI HAN
 
A Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine LearningA Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine Learningijtsrd
 
Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)k.surya kumar
 
credit card fraud detection
credit card fraud detectioncredit card fraud detection
credit card fraud detectionjagan477830
 
Machine Learning in Banking Sector
Machine Learning in Banking SectorMachine Learning in Banking Sector
Machine Learning in Banking SectorKnoldus Inc.
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION K Srinivas Rao
 
Loan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachLoan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachEslam Nader
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random ForestIRJET Journal
 
Machine Learning in Cyber Security
Machine Learning in Cyber SecurityMachine Learning in Cyber Security
Machine Learning in Cyber SecurityRishi Kant
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionkalpesh1908
 
Real-Time Fraud Detection in Payment Transactions
Real-Time Fraud Detection in Payment TransactionsReal-Time Fraud Detection in Payment Transactions
Real-Time Fraud Detection in Payment TransactionsChristian Gügi
 
Fraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive AnalyticsFraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive AnalyticsAlejandro Correa Bahnsen, PhD
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 
Credit Card Fraud Detection Tutorial
Credit Card Fraud Detection TutorialCredit Card Fraud Detection Tutorial
Credit Card Fraud Detection TutorialKNIMESlides
 

Tendances (20)

Credit card fraud dection
Credit card fraud dectionCredit card fraud dection
Credit card fraud dection
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learning
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
 
Fraud detection
Fraud detectionFraud detection
Fraud detection
 
Credit Card Fraud Detection
Credit Card Fraud DetectionCredit Card Fraud Detection
Credit Card Fraud Detection
 
Credit card payment_fraud_detection
Credit card payment_fraud_detectionCredit card payment_fraud_detection
Credit card payment_fraud_detection
 
A Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine LearningA Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine Learning
 
Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)
 
credit card fraud detection
credit card fraud detectioncredit card fraud detection
credit card fraud detection
 
Machine Learning in Banking Sector
Machine Learning in Banking SectorMachine Learning in Banking Sector
Machine Learning in Banking Sector
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION
 
Loan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachLoan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approach
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random Forest
 
Machine Learning in Cyber Security
Machine Learning in Cyber SecurityMachine Learning in Cyber Security
Machine Learning in Cyber Security
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Real-Time Fraud Detection in Payment Transactions
Real-Time Fraud Detection in Payment TransactionsReal-Time Fraud Detection in Payment Transactions
Real-Time Fraud Detection in Payment Transactions
 
Fraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive AnalyticsFraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive Analytics
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Credit Card Fraud Detection Tutorial
Credit Card Fraud Detection TutorialCredit Card Fraud Detection Tutorial
Credit Card Fraud Detection Tutorial
 

En vedette

Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learningwgyn
 
PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014Sri Ambati
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationScott Mongeau
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learningAdam Gibson
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomSudarson Roy Pratihar
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsGreg Makowski
 
Presentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlPresentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlDominic Sroda Korkoryi
 
Fraud Prevention in Corporate Environment
Fraud Prevention in Corporate EnvironmentFraud Prevention in Corporate Environment
Fraud Prevention in Corporate EnvironmentSECURITYLLC
 
Big data usage in fraud
Big data usage in fraudBig data usage in fraud
Big data usage in fraudIntellipaat
 
Fraud-Fighting Trends 2017
Fraud-Fighting Trends 2017Fraud-Fighting Trends 2017
Fraud-Fighting Trends 2017Sarah Beldo
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnKan Ouivirach, Ph.D.
 
Current Threat Landscape, Global Trends and Best Practices within Financial F...
Current Threat Landscape, Global Trends and Best Practices within Financial F...Current Threat Landscape, Global Trends and Best Practices within Financial F...
Current Threat Landscape, Global Trends and Best Practices within Financial F...IBM Sverige
 
Lessons-learnt in EA articulation (worksheet)
Lessons-learnt in EA articulation (worksheet)Lessons-learnt in EA articulation (worksheet)
Lessons-learnt in EA articulation (worksheet)Tetradian Consulting
 
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...CA Technologies
 
Masters thesis - Fraud & Big Data
Masters thesis - Fraud & Big DataMasters thesis - Fraud & Big Data
Masters thesis - Fraud & Big DataStephanie Canovas
 
Graph Database in Graph Intelligence
Graph Database in Graph IntelligenceGraph Database in Graph Intelligence
Graph Database in Graph IntelligenceChen Zhang
 
Operations Management Suite, the Penguins and the others
Operations Management Suite, the Penguins and the othersOperations Management Suite, the Penguins and the others
Operations Management Suite, the Penguins and the othersChristian Heitkamp
 

En vedette (20)

Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learning
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and Mitigation
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learning
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical Applications
 
Presentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlPresentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & control
 
Fraud Prevention in Corporate Environment
Fraud Prevention in Corporate EnvironmentFraud Prevention in Corporate Environment
Fraud Prevention in Corporate Environment
 
Big data usage in fraud
Big data usage in fraudBig data usage in fraud
Big data usage in fraud
 
Fraud-Fighting Trends 2017
Fraud-Fighting Trends 2017Fraud-Fighting Trends 2017
Fraud-Fighting Trends 2017
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Current Threat Landscape, Global Trends and Best Practices within Financial F...
Current Threat Landscape, Global Trends and Best Practices within Financial F...Current Threat Landscape, Global Trends and Best Practices within Financial F...
Current Threat Landscape, Global Trends and Best Practices within Financial F...
 
Lessons-learnt in EA articulation (worksheet)
Lessons-learnt in EA articulation (worksheet)Lessons-learnt in EA articulation (worksheet)
Lessons-learnt in EA articulation (worksheet)
 
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
 
Masters thesis - Fraud & Big Data
Masters thesis - Fraud & Big DataMasters thesis - Fraud & Big Data
Masters thesis - Fraud & Big Data
 
Graph Database in Graph Intelligence
Graph Database in Graph IntelligenceGraph Database in Graph Intelligence
Graph Database in Graph Intelligence
 
Operations Management Suite, the Penguins and the others
Operations Management Suite, the Penguins and the othersOperations Management Suite, the Penguins and the others
Operations Management Suite, the Penguins and the others
 

Similaire à Machine Learning for Fraud Detection

Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroSi Krishan
 
Graphs and Financial Services Analytics
Graphs and Financial Services AnalyticsGraphs and Financial Services Analytics
Graphs and Financial Services AnalyticsNeo4j
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningTamir Taha
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Share Credit_Card_Fraud_Detection_ML_MP (1).pptx
Share Credit_Card_Fraud_Detection_ML_MP (1).pptxShare Credit_Card_Fraud_Detection_ML_MP (1).pptx
Share Credit_Card_Fraud_Detection_ML_MP (1).pptxyatintaneja6
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaRahul Bhatia
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...mlaij
 
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...mlaij
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsGanesan Narayanasamy
 
Deep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDeep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDatabricks
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 

Similaire à Machine Learning for Fraud Detection (20)

Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 
credit card.pptx
credit card.pptxcredit card.pptx
credit card.pptx
 
Graphs and Financial Services Analytics
Graphs and Financial Services AnalyticsGraphs and Financial Services Analytics
Graphs and Financial Services Analytics
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Share Credit_Card_Fraud_Detection_ML_MP (1).pptx
Share Credit_Card_Fraud_Detection_ML_MP (1).pptxShare Credit_Card_Fraud_Detection_ML_MP (1).pptx
Share Credit_Card_Fraud_Detection_ML_MP (1).pptx
 
ICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptxICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptx
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
 
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
 
ML basics.pptx
ML basics.pptxML basics.pptx
ML basics.pptx
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 
Deep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDeep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle Grove
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Machine Learning for Fraud Detection

  • 1. Bigger Data. Better Results.™ Machine  Learning  for  Fraud   Detec3on Nitesh  Kumar,  PhD nitesh@skytree.net
  • 2. Bigger Data. Better Results.™ Who  Am  I?   •  Applied  Math  PhD   •  Deriva3ve/  Op3ons  Pricing  Background •  7  years  doing  analy3cs   •  Data  Science  at  Skytree  for  2  years  
  • 3. Bigger Data. Better Results.™ Skytree  Inc. •  Came  out  of  Alex  Gray’s  (CTO)  FastLab  @  Georgia  Tech   •  SoTware  Company  that  provides  Machine  Learning  SoTware •  Built  to  func3on  on  top  of  Hadoop   •  Automa3on,  speed,  and  scalability   •  User  can  interact  through  command  line  interface,  APIs,  and  GUI •  20  million  dollars  in  series  A •  TAB:  Michael  Jordan,  James  Demmel,  Dave  Pa[erson,  Pat  Hanrahan
  • 4. What  is  Skytree? •  Machine  Learning  Plaorm   GBM,  K-­‐means,  RF,  SVD/  PCA,  Linear/  Logis3c,  SVM,  collabora3ve  filtering  etc.   •  Built  for  Big  Data Scales  linearly  with  data  size  and  compute  nodes  (map-­‐reduce,  hadoop) •  Usability     SDK  in  Python,  Java,  REST,  even  GUI Data  prepara3on  through  Spark •  Automa3on 1-­‐click  modeling       •  ML  on  Bigger  Data  produces  Be[er  Results   Larger  datasets  lead  to  higher  accuracy  
  • 5. Bigger Data. Better Results.™ Outline •  Introduc3on Why  Skytree,  Big  Data,  and  Machine  Learning  for  Fraud?       •  Machine  Learning  in  Financial  Services   Issues,  methods,  and  solu3on •  Live  Demo  of  Skytree  on  real-­‐world  dataset  (command  line,  API,  GUI) Time  and  setup  permidng
  • 6. Bigger Data. Better Results.™ Introduc3on •  Fraud  is  a  Big  problem  (Big  Data,  Big  Cost) •  Why  is  Machine  Learning  necessary? •  Comprehensive  solu3on?  
  • 7. Fraud  is  a  Big  Data  Problem •  “More  than  23  billion  credit  card  transac3ons  are  processed  annually  in  USA”   CreditCards.com •  Credit  card  transac3on  alone  generates  mul3ple  Terabytes  of  data  a  year •  Each  transac3on  has  100-­‐300  a[ributes     •  Distributed  data  across  mul3ple  nodes    
  • 8. Fraud  is  a  Big  Cost  Problem •  “Businesses  lose  an  es3mated  $3.5   billion  annually  to  fraud  and  financial   crime.” Forbes,  2014 •  “Total  value  of  credit  card  transac3ons   in  the  U.S.  in  2012:  $2.48  trillion” CreditCard.com h[p://www.federalreserve.gov/releases/g19/ Current/
  • 9. Why  Machine  Learning? •  Tradi3onal  ideas  of  finding  pa[erns  through  hand  craTed,  careful  querying,  does   not  scale  to  large  datasets   •  Prior  rule  based  engines  do  not  make  use  of  informa3on  from  mul3ple  a[ributes  at   the  same  3me •  Machine  Learning  concerns  with  algorithms  that  can  learn  from  data     Mul3variate  Sta3s3cs   Automated  predic3ve  analy3cs •  Even  a  3ny  increase  in  accuracy  can  lead  to  millions  of  dollars  in  savings  
  • 10. Gap  between  Machine  Learning  and  Big  Data     Ø  Awakening  to   Big  Data,   experimen3ng   with  ML?     Ø  ML  is  necessary   to  derive  value   out  of  Big  Data    
  • 11. ML  on  Bigger  Data  produces  Be[er  Results •  Weak  and  Strong  Law  of  Large  numbers   •   “We  have  shown  that  for  a  prototypical  natural  language  classifica3on  task,  the   performance  of  learners  can  benefit  significantly  from  much  larger  training  sets.”   Banco  and  Brill,  Proceedings  of  ACL,  2001. •  “Breiman’s  procedure  (random  forest)  is  consistent  and  adapts  to  sparsity,  in  the   sense  that  its  rate  of  convergence  depends  only  on  the  number  of  strong  features   and  not  on  how  many  noise  variables  are  present.”  Gerard  Biau,  JMLR,  2012 •  Some%mes  Big  Data  is  all  you  need!  
  • 12. Experiment:  ML  on  Bigger  Data  produces  Be[er  Results •  Source  dataset:  DNA  dataset  from  Pascal  Large  Scale  Learning  Challenge. •  A  4M-­‐row  dataset  was  held  out  for  tes3ng.  Training  datasets  with  20M,  40M,   80M,  160M,  320M,  640M,  5120M  elements,  arranged  into  200  columns,  were   used.    No  featuriza3on  was  applied. •  Op3mal  model  for  each  training  dataset  size  was  found  by  tuning  Gradient   Boos3ng  Machine  on  a  holdout  dataset  with  Skytree  smart-­‐search. •  AUC  (Area  under  ROC  curve)  was  used  for  evalua3on. •  Experiment  by  Skytree  Inc,  2015
  • 13. Bigger  Data,  Be[er  Results  on  Real  World  Data   Dataset  Size AUC 20,000,000 93.9% 40,000,000 95.0% 80,000,000 95.6% 160,000,000 96.2% 320,000,000 96.7% 640,000,000 97.2%   5,120,000,000 98.1%
  • 14. Machine  Learning  Solu3on  for  Financial  Services Mul3ple   algorithms  for   higher  accuracy • Gradient  Boos3ng • Random  Decision   Forest • SVM • Stacked  models   (combined   models) • Mixed  models   (combine   supervised  and   unsupervised   models) Automa3c   Parameter   Selec3on   • Automa3cally   create  best   performing  model   for  any  algorithm   in  fewer  itera3on • Allow  for  usage  by   domain  experts   (non  data   scien3sts)   • Higher  Accuracy   machine  can  tune   be[er  than   humans   Speed  and   Scalability     • Big  Data  scale • Catch  latest   trends  in  fraud   • Improve  accuracy   • Iterate  over   mul3ple   algorithms  and   parameters • Faster  model   crea3on  and   model  update Visualiza3on  and   Op3miza3on   • Op3mize  directly   for  dollars • Visualize  model   performance   • Provide  knobs  to   choose  a  model   • Ensure  op3mality   of  models  without   over  fidng   • Visualize  models   to  interpret  results  
  • 15. Bigger Data. Better Results.™ Machine  Learning  for  Fraud  Detec3on •  Countering  Fraud  is  a  Machine  Learning  Problem •  Challenges •  Solu3on  (GBM  and  advanced)
  • 16. Fraud  Detec3on •  Counter  complex  and  transient  fraud  pa[erns •  Analyze  mul3ple  and  large  datasets  to  discover  and  predict  fraud   “More  than  23  billion  credit  card  transac3ons  are  processed  annually  in  USA”  CreditCards.com
  • 17. Machine  Learning  Problem Supervised   Learning:   Predict  Fraud   Collect  historical  transac3ons Learn  from  past  examples  of  fraud Predict  fraud  (in  real-­‐3me) Unsupervised   Learning:   Discover  Fraud   Segment  transac3ons Inves3gate  poten3ally  new  fraud Detect  Outliers Mixed   Approach:   Discover  and   predict  Fraud   Detect  “Points  of  Compromise”  to  prevent  fraud    
  • 18. Common  Issues •  Imbalanced  Datasets Too  few  examples  of  ‘known’  fraud •  What  to  op3mize? Fraud  capture  rate   False  posi3ve  rate:  what  is  the  cost  associated?   Total  loss  incurred  due  to  fraud   What  loss  func3on  to  use •  How  to  handle  missing  values?   •  Which  algorithm  to  use?
  • 19. [Current]  Industry  Standard  Solu3on GBM  algorithm  (Friedman,  2001  and  variants) •  Sequen3ally  combines  simple  models,  with  each  “new”  model  correc3ng  the  mistakes  of  the   previous  ones •  Base  Model  in  this  case  is  decision  trees •  Inspired  by  gradient  descent  in  op3miza3on  
  • 20. GBM  Pros •  Automa3cally  handles  missing  values •  Highly  accurate  models •  Captures  nonlinearity  in  the  data   •  Does  not  require  deep  understanding  of  the  data        
  • 21. GBM  Cons •  Does  not  handle  datasets  with  high  dimensions  well   •  Minimizes  bias,  not  necessarily  variance •  Chance  of  over  fidng  the  training  data  when  data  is  noisy   •  Not  the  best  at  handling  very  high  imbalance  in  the  data     •  Requires  extensive  parameter  tuning     •  Not  simple  to  distribute  
  • 22. GBM:  overcoming  the  odds •  Does  not  handle  datasets  with  high  dimensions  well   •  SVMs  handle  datasets  with  high  dimensionality   •  Minimizes  bias,  not  necessarily  variance •  Ensemble  of  GBM  (eGBM,  Skytree,  2013)  and  stochas3c  GBM  (sGBM) •  eGBM:  Idea  is  to  use  ensembles  of  GBMs  where  each  GBM  is  built  using  bootstrap   samples •  sGBM:  Each  base  learner  (decision  tree)  uses  different  samples •  Mixed  Models •  Combine  Linear/  Logis3c  models  with  GBM  by  blending/  stacking •  High  chance  of  over  fidng  the  training  data   •  Carefully  check  for  generaliza3on  error •  Restrict  to  simple  base  learners  (shallow  decision  trees)  etc.  
  • 23. GBM:  overcoming  the  odds •  Not  the  best  at  handling  very  high  imbalance  in  the  data •  Ensemble  GBMs,  stochas3c  GBMs,  Random  Forests  etc.   •  Requires  extensive  parameter  tuning     •  Smart-­‐Search  (Skytree  Inc.,2014) •  Patent-­‐pending  technology •  Op3miza3on  that  itera3vely  learns  from  the  previous  itera3ons •  Successively  improves  the  space  in  which  to  search  for  the  best  solu3on   •  Faster  way  to  obtain  the  op3mal  set  of  parameters •  Not  simple  to  distribute •  Bring  High  Performance  Compu3ng  (HPC)  distribu3ng          
  • 24. Machine  Learning  Solu3on  for  Financial  Services Mul3ple   algorithms  for   higher  accuracy • Gradient  Boos3ng • Random  Decision   Forest • SVM • Stacked  models   (combined   models) • Mixed  models   (combine   supervised  and   unsupervised   models) Automa3c   Parameter   Selec3on   • Automa3cally   create  best   performing  model   for  any  algorithm   in  fewer  itera3on • Allow  for  usage  by   domain  experts   (non  data   scien3sts)   • Higher  Accuracy   machine  can  tune   be[er  than   humans   Speed  and   Scalability     • Big  Data  scale • Catch  latest   trends  in  fraud   • Improve  accuracy   • Iterate  over   mul3ple   algorithms  and   parameters • Faster  model   crea3on  and   model  update Visualiza3on  and   Op3miza3on   • Op3mize  directly   for  dollars • Visualize  model   performance   • Provide  knobs  to   choose  a  model   • Ensure  op3mality   of  models  without   over  fidng   • Visualize  models   to  interpret  results  
  • 25. Bigger Data. Better Results.™ Lets  see  how  it  works! •  Skytree  Workspace   •  Demo •  CLI •  Python  SDK   •  GUI