Machine Learning Experimentation at Sift Science

•Télécharger en tant que PPTX, PDF•

0 j'aime•1,115 vues

Alex Paino, a Software Engineer at Sift Science, discusses how we use machine learning to prevent several types of abusive user behavior for thousands of customers. Measuring the accuracy of the thousands of classifiers used in a manner that correctly represents the value provided to customers is a huge challenge for us. Alex describes how we think about this problem and what we have done to address it. This includes an overview of the various tools and methodologies we employ that allow us to quickly summarize the results of an experiment, break ties in mixed result experiments, and drill into specific models and samples.

Ingénierie

ML Experimentation at Sift
Alex Paino
atpaino@siftcience.com
Follow along at: http://go.siftscience.com/ml-experimentation
1

Agenda
Background
Motivation
Running experiments correctly
Comparing experiments correctly
Building tools to ensure correctness
2

About Sift Science
- Abuse prevention platform powered by machine learning
- Learns in real-time
- Several abuse prevention products and counting:
3
Payment Fraud Content Abuse Promo Abuse Account Abuse

Motivation - Why is this important?
1. Experiments must happen to improve an ML system
5

Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
6

Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
9
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days

Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
- Need to simulate the online case as closely as possible
10
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days

Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
11

Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
12
Train
Test
t
users

Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
- Watch for class skew - ours is over 50:1 → need to downsample
13
Train
Test
t
users

Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
14
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction

Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
- Can’t leak groundtruth into feature vectors
15
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction

Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
16
t

Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
- Need to evaluate scores our customers use to
make decisions
17
t

Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
18

Comparing Experiments Correctly - Background
21
Customer-specific
Global
Global
Models
Sift Score

Comparing Experiments Correctly - Background
22
Customer-specific
(Payment Abuse)
Global (Payment Abuse)
Global (Payment Abuse)
Payment Abuse Models
Payment
Abuse Score
Customer-specific
(Account Abuse)
Global (Account Abuse)
Global (Account Abuse)
Account Abuse Models
Account
Abuse Score
Customer-specific
(Promotion Abuse)
Global (Promotion Abuse)
Global (Promotion Abuse)
Promotion Abuse Models
Promotion
Abuse Score
Customer-specific
(Content Abuse)
Global (Content Abuse)
Global (Content Abuse)
Content Abuse Models
Content
Abuse Score

Comparing Experiments Correctly - Background
23
Thousands of
configurations
to evaluate!

Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
24

Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
25

Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
27
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =

Comparing Experiments Correctly - Lessons
Lesson: require statistical significance everywhere
- Examine significant differences in per-customer summary stats
29

Building tools to ensure correctness
- Big productivity win
33

Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
34

Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
35

Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
37

Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
38

Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
39

Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
40
ROC

Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
41
ROC Score distribution

Building tools to ensure correctness - Examples
Example: Jupyter notebooks
for deep-dives
42

Key Takeaways
1. Need to carefully design experiments to remove biases
44

Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
45

Recommandé

Online learning talkEmily Chin

First 100k users are always the hardestRashmi Sinha

Freemium - Christian Kirsch - ProductCamp Boston 2012ProductCamp Boston

Is Quality the new Freemium?Tathagat Varma

Platfora Data Visualization MeetupPlatfora

Parsable's cultureYan-David Erlich

Final Wunderlist Presentationclkalafsky

Six Sigma, BPM, Digitalization -Different Paths to the Same Destination? | Bi...Bizagi

Recommandé

Online learning talkEmily Chin

First 100k users are always the hardestRashmi Sinha

Freemium - Christian Kirsch - ProductCamp Boston 2012ProductCamp Boston

Is Quality the new Freemium?Tathagat Varma

Platfora Data Visualization MeetupPlatfora

Parsable's cultureYan-David Erlich

Final Wunderlist Presentationclkalafsky

Six Sigma, BPM, Digitalization -Different Paths to the Same Destination? | Bi...Bizagi

The Coupa Organic Platform from A to Z: Maximizing the ValueCoupa Software

The Evolution of Hadoop at StripeColin Marc

Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Codemotion

Braintree and our new v.zero SDK for iOSAlberto López Martín

Django Zebra Lightning TalkLee Trout

Paymill vs Stripebetabeers

Omise fintech研究会Jun Hasegawa

Pay and Get Paid: How To Integrate Stripe Into Your AppFlatiron School

[daddly] Stripe勉強会運用編 2016/11/30Naoshi ONO

Entrepreneur + Developer Gangbang: Co-workingkamal.fariz

Payments using Stripe.comBilly Cravens

Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...GreenhouseSoftware

Cro webinar what you're doing wrong in your cro program (sharable version)VWO

Machine Learning 101 for Product Managers by Amazon Sr PMProduct School

Beyond Simple A/B testingRatio

AI-900 - Fundamental Principles of ML.pptxkprasad8

Lessons learned from measuring software development processesMarkus Unterauer

Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsKun Liu

Data-Driven Product Management by Shutterfly Director of ProductProduct School

Big Data Science - hype?BalaBit

SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...Distilled

Aspect Opinion Mining From User Reviews on the webKarishma chaudhary

Contenu connexe

En vedette

The Coupa Organic Platform from A to Z: Maximizing the ValueCoupa Software

The Evolution of Hadoop at StripeColin Marc

Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Codemotion

Braintree and our new v.zero SDK for iOSAlberto López Martín

Django Zebra Lightning TalkLee Trout

Paymill vs Stripebetabeers

Omise fintech研究会Jun Hasegawa

Pay and Get Paid: How To Integrate Stripe Into Your AppFlatiron School

[daddly] Stripe勉強会運用編 2016/11/30Naoshi ONO

Entrepreneur + Developer Gangbang: Co-workingkamal.fariz

Payments using Stripe.comBilly Cravens

Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...GreenhouseSoftware

En vedette (12)

The Coupa Organic Platform from A to Z: Maximizing the Value

The Evolution of Hadoop at Stripe

Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...

Braintree and our new v.zero SDK for iOS

Django Zebra Lightning Talk

Paymill vs Stripe

Omise fintech研究会

Pay and Get Paid: How To Integrate Stripe Into Your App

[daddly] Stripe勉強会運用編 2016/11/30

Entrepreneur + Developer Gangbang: Co-working

Payments using Stripe.com

Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...

Similaire à Machine Learning Experimentation at Sift Science

Cro webinar what you're doing wrong in your cro program (sharable version)VWO

Machine Learning 101 for Product Managers by Amazon Sr PMProduct School

Beyond Simple A/B testingRatio

AI-900 - Fundamental Principles of ML.pptxkprasad8

Lessons learned from measuring software development processesMarkus Unterauer

Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsKun Liu

Data-Driven Product Management by Shutterfly Director of ProductProduct School

Big Data Science - hype?BalaBit

SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...Distilled

Aspect Opinion Mining From User Reviews on the webKarishma chaudhary

[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOpsFuture Processing

You cant control what you cant measure - Measuring requirements qualityMarkus Unterauer

Machine learning in productionTuri, Inc.

Hanno Jarvet - VSM, Planning and Problem Solving - ConFuDevConFu

PQF OverviewMartin Hutchings

GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAMHilary Ip

Sanitized tb swstmppp1516julyAgile Testing alliance

Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem SolvingDevConFu

Barga Data Science lecture 10Roger Barga

ClickZ Live: Smart AnalyticsKristin Low

Similaire à Machine Learning Experimentation at Sift Science (20)

Cro webinar what you're doing wrong in your cro program (sharable version)

Machine Learning 101 for Product Managers by Amazon Sr PM

Beyond Simple A/B testing

AI-900 - Fundamental Principles of ML.pptx

Lessons learned from measuring software development processes

Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts

Data-Driven Product Management by Shutterfly Director of Product

Big Data Science - hype?

SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...

Aspect Opinion Mining From User Reviews on the web

[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps

You cant control what you cant measure - Measuring requirements quality

Machine learning in production

Hanno Jarvet - VSM, Planning and Problem Solving - ConFu

PQF Overview

GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAM

Sanitized tb swstmppp1516july

Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem Solving

Barga Data Science lecture 10

ClickZ Live: Smart Analytics

Dernier

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

Extrusion Processes and Their Limitations120cr0395

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

Porous Ceramics seminar and technical writingrakeshbaidya232001

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

Introduction and different types of Ethernet.pptxupamatechverse

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

Dernier (20)

Coefficient of Thermal Expansion and their Importance.pptx

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf

Extrusion Processes and Their Limitations

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR

Microscopic Analysis of Ceramic Materials.pptx

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

Porous Ceramics seminar and technical writing

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

Roadmap to Membership of RICS - Pathways and Routes

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)

Introduction and different types of Ethernet.pptx

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE

UNIT-III FMM. DIMENSIONAL ANALYSIS

Machine Learning Experimentation at Sift Science

1. ML Experimentation at Sift Alex Paino atpaino@siftcience.com Follow along at: http://go.siftscience.com/ml-experimentation 1

2. Agenda Background Motivation Running experiments correctly Comparing experiments correctly Building tools to ensure correctness 2

3. About Sift Science - Abuse prevention platform powered by machine learning - Learns in real-time - Several abuse prevention products and counting: 3 Payment Fraud Content Abuse Promo Abuse Account Abuse

4. About Sift Science 4

5. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 5

6. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 2. Evaluation needs to correctly identify positive changes Evaluation as a loss function for your stack 6

7. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 2. Evaluation needs to correctly identify positive changes Evaluation as a loss function for your stack 3. Getting this right is a subtle and tricky problem 7

8. How do we run experiments? 8

9. Running experiments correctly - Background - Large delay in feedback for Sift - up to 90 days - → offline experiments over historical data 9 Created account Updated credit card info Updated settings Purchased item Chargeback t 90 days

10. Running experiments correctly - Background - Large delay in feedback for Sift - up to 90 days - → offline experiments over historical data - Need to simulate the online case as closely as possible 10 Created account Updated credit card info Updated settings Purchased item Chargeback t 90 days

11. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits 11

12. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits - Disjoint in time and set of users 12 Train Test t users

13. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits - Disjoint in time and set of users - Watch for class skew - ours is over 50:1 → need to downsample 13 Train Test t users

14. Running experiments correctly - Lessons Lesson: preventing cheating - External data sources need to be versioned 14 t Created account Updated credit card info Login from IP Address A IP Address B Known Tor Exit Node Tor Exit Node DB Login from IP Address B Login from IP Address B Transaction

15. Running experiments correctly - Lessons Lesson: preventing cheating - External data sources need to be versioned - Can’t leak groundtruth into feature vectors 15 t Created account Updated credit card info Login from IP Address A IP Address B Known Tor Exit Node Tor Exit Node DB Login from IP Address B Login from IP Address B Transaction

16. Running experiments correctly - Lessons Lesson: considering scores at key decision points - Scores given for any event (e.g. user login) 16 t

17. Running experiments correctly - Lessons Lesson: considering scores at key decision points - Scores given for any event (e.g. user login) - Need to evaluate scores our customers use to make decisions 17 t

18. Running experiments correctly - Lessons Lesson: parity with the online system - Our system does online learning → so should the offline experiments 18

19. Running experiments correctly - Lessons Lesson: parity with the online system - Our system does online learning → so should the offline experiments - Reusing the same code paths 19

20. How do we compare experiments? 20

21. Comparing Experiments Correctly - Background 21 Customer-specific Global Global Models Sift Score

22. Comparing Experiments Correctly - Background 22 Customer-specific (Payment Abuse) Global (Payment Abuse) Global (Payment Abuse) Payment Abuse Models Payment Abuse Score Customer-specific (Account Abuse) Global (Account Abuse) Global (Account Abuse) Account Abuse Models Account Abuse Score Customer-specific (Promotion Abuse) Global (Promotion Abuse) Global (Promotion Abuse) Promotion Abuse Models Promotion Abuse Score Customer-specific (Content Abuse) Global (Content Abuse) Global (Content Abuse) Content Abuse Models Content Abuse Score

23. Comparing Experiments Correctly - Background 23 Thousands of configurations to evaluate!

24. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate 24

25. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate Each with different features, models, class skew, and noise levels 25

26. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate Each with different features, models, class skew, and noise levels → Need some way to consolidate these evaluations 26 ??

27. Comparing Experiments Correctly - Lessons Lesson: pitfalls with consolidating results - Can’t throw all samples together → different score distributions 27 Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect + =

28. Comparing Experiments Correctly - Lessons Lesson: pitfalls with consolidating results - Can’t throw all samples together → different score distributions - Weighted averages are tricky 28 Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect + =

29. Comparing Experiments Correctly - Lessons Lesson: require statistical significance everywhere - Examine significant differences in per-customer summary stats 29

30. Comparing Experiments Correctly - Lessons Lesson: require statistical significance everywhere - Examine significant differences in per-customer summary stats - Use confidence intervals where possible, e.g. for AUC ROC 30 http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdf http://www.cs.nyu.edu/~mohri/pub/area.pdf

31. How do we ensure correctness? 31

32. Building tools to ensure correctness 32

33. Building tools to ensure correctness - Big productivity win 33

34. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely 34

35. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely - Saves the team from drawing incorrect conclusions 35

36. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely - Saves the team from drawing incorrect conclusions 36 vs

37. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 37

38. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 38

39. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 39

40. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 40 ROC

41. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 41 ROC Score distribution

42. Building tools to ensure correctness - Examples Example: Jupyter notebooks for deep-dives 42

43. Key Takeaways 43

44. Key Takeaways 1. Need to carefully design experiments to remove biases 44

45. Key Takeaways 1. Need to carefully design experiments to remove biases 2. Require statistical significance when comparing results to filter out noise 45

46. Key Takeaways 1. Need to carefully design experiments to remove biases 2. Require statistical significance when comparing results to filter out noise 3. The right tools can help ensure all of your analyses are correct while improving productivity 46

47. Questions? 47

Notes de l'éditeur

...today I’ll be talking to you about how we conduct machine learning experiments here at Sift.
I’ll start with the necessary background on Sift, and then touch on why this is such an important topic before diving into our experiences with this topic, where I’ll cover how we run experiments correctly, how we compare experiments correctly, and how we have built tools that ensure all experiments have this correctness baked in.
First, a little about Sift. Sift uses machine learning to prevent various forms of abuse on the internet for our customers. To do this, our customers send us three types of data: page view data sent via our Javascript snippet, event data for important events such as the creation of an order or account through our events API, and feedback through our labels API or our web Console. (this console is what our customers’ analysts use to investigate potential cases of abuse) Especially relevant to this discussion is the fact that we now offer 4 distinct abuse prevention products as of our launch last Tuesday, and that we do this for thousands of customers.
Sample integration Have another slide w/ workflows?
Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
Ok, so we’ve said it’s important to get evaluation right. The first step along that path is running correct, representative experiments. Here’s how we do this at Sift.
When I say “correct”, what I mean is that these evaluations are not biased Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months. Because of this we have to run experiments offline over historical data. The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment? This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
When I say “correct”, what I mean is that these evaluations are not biased Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months. Because of this we have to run experiments offline over historical data. The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment? This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
Knowledge base versions external data so that we prevent our evals from using information from “the future”. Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
Knowledge base versions external data so that we prevent our evals from using information from “the future”. Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side. We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side. We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
The final point on running experiments correctly goes back to the point about accurately simulating the online case. In the online case, various parts of our modeling stack are learned online. Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy. We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
The final point on running experiments correctly goes back to the point about accurately simulating the online case. In the online case, various parts of our modeling stack are learned online. Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy. We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
Now that we can execute correct experiments, how do we make sense of their results relative to the current state of the system?
To understand why this is especially challenging for us at Sift, we need a little more background on our modeling setup. In its most basic form, a Sift Score is a combination of several different global models (for example, random forest and logistic regression models) along with one or more customer-specific models. However, with the recent launch of our 2 new abuse prevention products...
...we now have 4 of this same setup for each customer, each consisting of distinct models. So we’re up to 4 different scores, with over 10 different models, to evaluate for each customer...
...of which we have several thousand.
As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
But first, here are some things we have tried or considered and found to be flawed in one way or another. One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image. Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc. Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
But first, here are some things we have tried or considered and found to be flawed in one way or another. One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image. Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc. Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments. This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse. We’ve also found that viewing cond Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations. Note that this approach could apply to any summary statistic that can yield a confidence interval. TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments. This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse. We’ve also found that viewing cond Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations. Note that this approach could apply to any summary statistic that can yield a confidence interval. TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
Ok, so we’ve figured out how to run and analyze experiments correctly in theory, but how do we ensure that this always happens in practice. Could also phrase as: Now that we’ve...we need to ensure...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
...for this use case, we’ve found iPython notebooks to be a perfect fit. One example where we found these tools useful was when we were investigating pulling in some new external data source at the request of a specific customer. When we ran an experiment with the new data, it didn’t help in aggregate -- no significant changes. But our intuition said it would help some, so we dug deeper through iPython to find some users who would be affected by this new data, and sure enough, were able to find a change.
That does it for the topics I want to cover.
I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item
I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item
I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item