SlideShare une entreprise Scribd logo
1  sur  27
Startup Machine Learning:
Bootstrapping a fraud
detection system
Michael Manapat
Stripe
@mlmanapat
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/stripe-machine-learning
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
• About me: Engineering Manager of the

Machine Learning Products Team at Stripe
• About Stripe: Payments infrastructure for the
internet
Fraud
• Card numbers are stolen by
hacking, malware, etc.
• “Dumps” are sold in “carding”
forums
• Fraudsters use numbers in
dumps to buy goods, which
they then resell
• Cardholders dispute
transactions
• Merchant ends up bearing
cost of fraud
Machine Learning
• We want to detect fraud in real-time
• Imagine we had a black box “classifier” which we fed all
the properties we have for a transaction (e.g., amount)
• The black box responds with the probability that the
transaction is fraudulent
• We use the black box elsewhere in our system: e.g.,
Stripe’s API will query it for every transaction and
immediately declines a charge if the probability of fraud
is high enough
Input data
Choosing the “features” (feature engineering) is a
hard problem that we won’t cover here
First attempt
Two issues:
• Probability(fraud) needs to be between 0 and 1
• card_country is not numerical (it’s “categorical”)
Logistic regression
• Instead of modeling p = Probability(fraud) as a
linear function, we model the log-odds of fraud
• p is a sigmoidal function of the right side
Categorical variables
• If we have a variable that takes one of N discrete
values, we “encode” that by adding N - 1 “dummy”
variables
• Ex: Let’s say card_country can be “AU,” “GB,” or
“US.” We add booleans for “card = AU” and “card
= GB”
• We don’t want a linear relationship among variables
Our final model is
Fitting a regression
• Guess values for a, b, c, d, and Z
• Compute the “likelihood” of the training observations
given these values for the parameters
• Find a, b, c, d, and Z that maximize likelihood
(optimization problem—gradient descent)
pandas brings
R-like data
frames to
Python
• We want models to generalize well, i.e., to give
accurate predictions on new data
• We don’t want to “overfit” to randomness in the
data we use to train the model, so we evaluate our
performance on data not used to generate the
model
FPR = fraction of
non-fraud predicted
to be fraud
TPR = fraction of
fraud predicted
to be fraud
threshold = 0.52
Evaluating the model - ROC, AUC
Nonlinear models
• (Logistic) regressions are linear models: if you
double one input value, the log-odds also double
• What if the impact of amount depends on another
variable? For example, maybe larger amounts are
more predictive of fraud for GB cards.*
• What if the effect of amount is nonlinear? For
example, maybe small and large charges are more
likely to be fraudulent than charges with moderate
amounts.
Decision Trees
p = 0.34 p = 0.63 p = 0.63 p = 0.85
Fitting a decision tree
• Start with a node (first node is all the data)
• Pick the split that maximizes the decrease in Gini
(weighted by size of child nodes)
• Example gain:

(0.4998) - (

(41064/59893) * 0.4765 +

(18829/59893) * 0.4132)

= 0.043
• Continue recursively until

stopping criterion reached
Random forests
• Decision trees are “easy” to overfit
• We train N trees, each on a (bootstrapped) sample of
the training data
• At each split, we only consider a subset of the available
features—say, sqrt(total # of features) of them
• This reduces correlation among the trees
• The score is the average of the score produced by
each tree
Choosing methods
• Use regression if: the
relationship between the target
and the inputs is linear, or you
want to be able to isolate the
impact of each variable on the
target
• Use a tree/forest if: there are
complex dependencies
between inputs or the impact
on the target of an input is
nonlinear
James, Witten, Hastie, Tibshirani
Introduction to Statistical Learning
Where do you stick the
model?
• Make model scoring a service: work common to all
model evaluations happens in one place (e.g., logging
of scores and feature values for later analysis)
• Easier option: save Python model objects and have
scoring be a Python service (e.g., with Tornado)
• Advantages: easy to set-up
• Disadvantages: all the problems with pickling,
another production runtime (if you’re not already
using Python), GIL (no concurrent model evaluation)
Other option: create (custom) serialization format,
save models in Python, and load in a service in a
different language (e.g., Scala/Go)
• Advantages: Runtime consistency, fun evaluation
optimizations (e.g, concurrently scoring all the trees
in a forest), type checking
• Disadvantages: Have to write serializer/deserializer
(PMML is a “standard” but no scikit support)
Better if your RPC protocol supports type-checking
(e.g. protobuf or thrift)!
• Feature engineering: figuring out what inputs are
valuable to the model (e.g., the “card_use_24h”
input)
• Getting data into the right format in production: say
you generate training data on Hadoop—what do
you do in production?
• Evaluating the production model performance and
training new models? (Counterfactual evaluation)
Harder problems
Thanks
@mlmanapat
Slides, Jupyter notebook, data, and related talks at
http://mlmanapat.com
Shameless plug:

Stripe is hiring engineers and data scientists
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/stripe-
machine-learning

Contenu connexe

Plus de C4Media

Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 

Plus de C4Media (20)

Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Startup ML:Bbootstrapping a Fraud Detection System

  • 1. Startup Machine Learning: Bootstrapping a fraud detection system Michael Manapat Stripe @mlmanapat
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /stripe-machine-learning
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. • About me: Engineering Manager of the
 Machine Learning Products Team at Stripe • About Stripe: Payments infrastructure for the internet
  • 5. Fraud • Card numbers are stolen by hacking, malware, etc. • “Dumps” are sold in “carding” forums • Fraudsters use numbers in dumps to buy goods, which they then resell • Cardholders dispute transactions • Merchant ends up bearing cost of fraud
  • 6. Machine Learning • We want to detect fraud in real-time • Imagine we had a black box “classifier” which we fed all the properties we have for a transaction (e.g., amount) • The black box responds with the probability that the transaction is fraudulent • We use the black box elsewhere in our system: e.g., Stripe’s API will query it for every transaction and immediately declines a charge if the probability of fraud is high enough
  • 7. Input data Choosing the “features” (feature engineering) is a hard problem that we won’t cover here
  • 8. First attempt Two issues: • Probability(fraud) needs to be between 0 and 1 • card_country is not numerical (it’s “categorical”)
  • 9. Logistic regression • Instead of modeling p = Probability(fraud) as a linear function, we model the log-odds of fraud • p is a sigmoidal function of the right side
  • 10. Categorical variables • If we have a variable that takes one of N discrete values, we “encode” that by adding N - 1 “dummy” variables • Ex: Let’s say card_country can be “AU,” “GB,” or “US.” We add booleans for “card = AU” and “card = GB” • We don’t want a linear relationship among variables Our final model is
  • 11. Fitting a regression • Guess values for a, b, c, d, and Z • Compute the “likelihood” of the training observations given these values for the parameters • Find a, b, c, d, and Z that maximize likelihood (optimization problem—gradient descent)
  • 13.
  • 14. • We want models to generalize well, i.e., to give accurate predictions on new data • We don’t want to “overfit” to randomness in the data we use to train the model, so we evaluate our performance on data not used to generate the model
  • 15.
  • 16. FPR = fraction of non-fraud predicted to be fraud TPR = fraction of fraud predicted to be fraud threshold = 0.52 Evaluating the model - ROC, AUC
  • 17. Nonlinear models • (Logistic) regressions are linear models: if you double one input value, the log-odds also double • What if the impact of amount depends on another variable? For example, maybe larger amounts are more predictive of fraud for GB cards.* • What if the effect of amount is nonlinear? For example, maybe small and large charges are more likely to be fraudulent than charges with moderate amounts.
  • 18. Decision Trees p = 0.34 p = 0.63 p = 0.63 p = 0.85
  • 19. Fitting a decision tree • Start with a node (first node is all the data) • Pick the split that maximizes the decrease in Gini (weighted by size of child nodes) • Example gain:
 (0.4998) - (
 (41064/59893) * 0.4765 +
 (18829/59893) * 0.4132)
 = 0.043 • Continue recursively until
 stopping criterion reached
  • 20. Random forests • Decision trees are “easy” to overfit • We train N trees, each on a (bootstrapped) sample of the training data • At each split, we only consider a subset of the available features—say, sqrt(total # of features) of them • This reduces correlation among the trees • The score is the average of the score produced by each tree
  • 21.
  • 22. Choosing methods • Use regression if: the relationship between the target and the inputs is linear, or you want to be able to isolate the impact of each variable on the target • Use a tree/forest if: there are complex dependencies between inputs or the impact on the target of an input is nonlinear James, Witten, Hastie, Tibshirani Introduction to Statistical Learning
  • 23. Where do you stick the model? • Make model scoring a service: work common to all model evaluations happens in one place (e.g., logging of scores and feature values for later analysis) • Easier option: save Python model objects and have scoring be a Python service (e.g., with Tornado) • Advantages: easy to set-up • Disadvantages: all the problems with pickling, another production runtime (if you’re not already using Python), GIL (no concurrent model evaluation)
  • 24. Other option: create (custom) serialization format, save models in Python, and load in a service in a different language (e.g., Scala/Go) • Advantages: Runtime consistency, fun evaluation optimizations (e.g, concurrently scoring all the trees in a forest), type checking • Disadvantages: Have to write serializer/deserializer (PMML is a “standard” but no scikit support) Better if your RPC protocol supports type-checking (e.g. protobuf or thrift)!
  • 25. • Feature engineering: figuring out what inputs are valuable to the model (e.g., the “card_use_24h” input) • Getting data into the right format in production: say you generate training data on Hadoop—what do you do in production? • Evaluating the production model performance and training new models? (Counterfactual evaluation) Harder problems
  • 26. Thanks @mlmanapat Slides, Jupyter notebook, data, and related talks at http://mlmanapat.com Shameless plug:
 Stripe is hiring engineers and data scientists
  • 27. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/stripe- machine-learning