Alex Paino, a Software Engineer at Sift Science, discusses how we use machine learning to prevent several types of abusive user behavior for thousands of customers. Measuring the accuracy of the thousands of classifiers used in a manner that correctly represents the value provided to customers is a huge challenge for us. Alex describes how we think about this problem and what we have done to address it. This includes an overview of the various tools and methodologies we employ that allow us to quickly summarize the results of an experiment, break ties in mixed result experiments, and drill into specific models and samples.
5. Motivation - Why is this important?
1. Experiments must happen to improve an ML system
5
6. Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
6
7. Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
3. Getting this right is a subtle and tricky problem
7
9. Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
9
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days
10. Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
- Need to simulate the online case as closely as possible
10
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days
12. Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
12
Train
Test
t
users
13. Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
- Watch for class skew - ours is over 50:1 → need to downsample
13
Train
Test
t
users
14. Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
14
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction
15. Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
- Can’t leak groundtruth into feature vectors
15
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction
16. Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
16
t
17. Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
- Need to evaluate scores our customers use to
make decisions
17
t
18. Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
18
19. Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
- Reusing the same code paths
19
25. Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
25
26. Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
→ Need some way to consolidate these
evaluations
26
??
27. Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
27
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =
28. Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
- Weighted averages are tricky
28
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =
34. Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
34
35. Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
35
36. Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
36
vs
37. Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
37
38. Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
38
39. Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
39
40. Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
40
ROC
41. Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
41
ROC Score distribution
42. Building tools to ensure correctness - Examples
Example: Jupyter notebooks
for deep-dives
42
45. Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
45
46. Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
3. The right tools can help ensure all of your analyses are correct while
improving productivity
46
...today I’ll be talking to you about how we conduct machine learning experiments here at Sift.
I’ll start with the necessary background on Sift, and then touch on why this is such an important topic before diving into our experiences with this topic, where I’ll cover how we run experiments correctly, how we compare experiments correctly, and how we have built tools that ensure all experiments have this correctness baked in.
First, a little about Sift. Sift uses machine learning to prevent various forms of abuse on the internet for our customers.
To do this, our customers send us three types of data: page view data sent via our Javascript snippet, event data for important events such as the creation of an order or account through our events API, and feedback through our labels API or our web Console. (this console is what our customers’ analysts use to investigate potential cases of abuse)
Especially relevant to this discussion is the fact that we now offer 4 distinct abuse prevention products as of our launch last Tuesday, and that we do this for thousands of customers.
Sample integration
Have another slide w/ workflows?
Ok, so here is the motivation for the talk, starting with the basics:
We must conduct experiments to improve a machine learning system
We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time.
However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier
Must run experiments
Experiments must be correct
Easy to get them wrong, which is why you should think about this
Ok, so here is the motivation for the talk, starting with the basics:
We must conduct experiments to improve a machine learning system
We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time.
However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier
Must run experiments
Experiments must be correct
Easy to get them wrong, which is why you should think about this
Ok, so here is the motivation for the talk, starting with the basics:
We must conduct experiments to improve a machine learning system
We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time.
However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier
Must run experiments
Experiments must be correct
Easy to get them wrong, which is why you should think about this
Ok, so we’ve said it’s important to get evaluation right. The first step along that path is running correct, representative experiments. Here’s how we do this at Sift.
When I say “correct”, what I mean is that these evaluations are not biased
Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months.
Because of this we have to run experiments offline over historical data.
The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment?
This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
When I say “correct”, what I mean is that these evaluations are not biased
Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months.
Because of this we have to run experiments offline over historical data.
The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment?
This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
The first thing you have to get right here is how you divide up your data into train and test sets.
If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent.
For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad.
Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
The first thing you have to get right here is how you divide up your data into train and test sets.
If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent.
For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad.
Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
The first thing you have to get right here is how you divide up your data into train and test sets.
If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent.
For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad.
Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
Knowledge base versions external data so that we prevent our evals from using information from “the future”.
Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
Knowledge base versions external data so that we prevent our evals from using information from “the future”.
Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side.
We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users
But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters
Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy
The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side.
We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users
But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters
Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy
The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
The final point on running experiments correctly goes back to the point about accurately simulating the online case.
In the online case, various parts of our modeling stack are learned online.
Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy.
We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
The final point on running experiments correctly goes back to the point about accurately simulating the online case.
In the online case, various parts of our modeling stack are learned online.
Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy.
We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
Now that we can execute correct experiments, how do we make sense of their results relative to the current state of the system?
To understand why this is especially challenging for us at Sift, we need a little more background on our modeling setup.
In its most basic form, a Sift Score is a combination of several different global models (for example, random forest and logistic regression models) along with one or more customer-specific models.
However, with the recent launch of our 2 new abuse prevention products...
...we now have 4 of this same setup for each customer, each consisting of distinct models. So we’re up to 4 different scores, with over 10 different models, to evaluate for each customer...
...of which we have several thousand.
As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them.
This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations.
To make sense of this, we had to come up with some means of summarizing these diverse results.
As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them.
This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations.
To make sense of this, we had to come up with some means of summarizing these diverse results.
As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them.
This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations.
To make sense of this, we had to come up with some means of summarizing these diverse results.
But first, here are some things we have tried or considered and found to be flawed in one way or another.
One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image.
Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc.
Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
But first, here are some things we have tried or considered and found to be flawed in one way or another.
One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image.
Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc.
Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments.
This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse.
We’ve also found that viewing cond
Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations.
Note that this approach could apply to any summary statistic that can yield a confidence interval.
TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments.
This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse.
We’ve also found that viewing cond
Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations.
Note that this approach could apply to any summary statistic that can yield a confidence interval.
TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
Ok, so we’ve figured out how to run and analyze experiments correctly in theory, but how do we ensure that this always happens in practice.
Could also phrase as: Now that we’ve...we need to ensure...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment.
Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact.
In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today.
So how do we do this at Sift?
We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment.
Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact.
In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today.
So how do we do this at Sift?
We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment.
Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact.
In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today.
So how do we do this at Sift?
We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment.
Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact.
In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today.
So how do we do this at Sift?
We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment.
Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact.
In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today.
So how do we do this at Sift?
We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
...an example of which is our experiment evaluation page. <describe eval page as depicted in image>
However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
...an example of which is our experiment evaluation page. <describe eval page as depicted in image>
However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
TODO: add transitions
...an example of which is our experiment evaluation page. <describe eval page as depicted in image>
However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
TODO: add transitions
...an example of which is our experiment evaluation page. <describe eval page as depicted in image>
However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
TODO: add transitions
...an example of which is our experiment evaluation page. <describe eval page as depicted in image>
However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
TODO: add transitions
...for this use case, we’ve found iPython notebooks to be a perfect fit.
One example where we found these tools useful was when we were investigating pulling in some new external data source at the request of a specific customer.
When we ran an experiment with the new data, it didn’t help in aggregate -- no significant changes.
But our intuition said it would help some, so we dug deeper through iPython to find some users who would be affected by this new data, and sure enough, were able to find a change.
That does it for the topics I want to cover.
I hope you’ll take away from this talk that:
running experiments correctly is very important
TODO: make sure to add transitions for each item
I hope you’ll take away from this talk that:
running experiments correctly is very important
TODO: make sure to add transitions for each item
I hope you’ll take away from this talk that:
running experiments correctly is very important
TODO: make sure to add transitions for each item