This document discusses machine learning in production and provides several case studies as examples. It begins with an overview of machine learning, common algorithms like linear regression and neural networks. It then discusses best practices for the machine learning pipeline including getting data, modeling and evaluation, deployment, and maintenance. Several case studies are presented: a call center model to predict who to call, a student performance model, a credit scoring model, a customer deposit prediction model, and a fraud detection model. The case studies show how machine learning can be applied to different domains and businesses.
11. Pipeline & Pitfalls
• Get data: garbage in = garbage out
– Ensure all data will be available at the time of
prediction
– Use sampling if necessary
– Use the same code to get data for analysis and
prediction
12. Pipeline and Pitfalls
• Get data
• Model & Evaluate
– Select the target with business in mind
– Start with simple things and set a benchmark
– Improve, write a notebook
– Test out of sample and out of time
13. Pipeline and Pitfalls
• Get data
• Model & Evaluate
– Select the target with business in mind
– Start with simple things and set a benchmark
– Improve, write a notebook
– Test out of sample and out of time
wholedataset
out of timetrain + out of sample
time
14. Pipeline & Pitfalls
• Get data
• Model & Evaluate
• Deploy
– Simpler algorithm = simpler deployment
– For regression – only weights for variables
– For more advanced, usually REST API (R shown):
• https://cran.r-project.org/web/packages/AzureML/index.html
• https://www.opencpu.org/
• https://tensorflow.rstudio.com/tools/tfdeploy/articles/introduction.html
• https://github.com/trestletech/plumber
• ...
• https://github.com/danaki/yshanka
15. Pipeline & Pitfalls
• Deploy
– OR a training+hosting tool in case budget allows $$$
or cloud is not an issue:
– Budget: 10 000s - 100 000s
16. Pipeline & Pitfalls
• Get data
• Model & Evaluate
• Deploy
• Maintain
– Test data -> population must be the same!
– Test model -> track output, performance
– Challenge model -> update the model and
challenge it
19. Case: a call centre
Setup:
• A company that connects short-term employees with employers
• Data on several thousands of calls provided, mainly contact data and indication
whether the person accepted employment offer
• Q: Who do we call?
Result:
• A model with AUC 0.8
model output took the job rate calls base
0 8% 49%
10 16% 20%
20 20% 12%
30 28% 8%
40 36% 5%
50 42% 3%
60 50% 2%
70 62% 1%
80 67% 0%
90 71% 0%
16% 100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
0 10 20 30 40 50 60 70 80 90
acceptedoffer
model score
20. Case: student performance review
Setup:
• A company that records and keeps all student marks throughout the year
• Data on several thousands of marks provided
• The idea for the model to predict the year’s final mark for each subject
• Q: What the year-end mark is going to be for each student and subject?
Result:
• A very simple model, 5% MAE
0
2
4
6
8
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
predicted
actual
month prediction error
1 12%
2 11%
3 10%
4 8%
5 8%
6 7%
7 7%
8 6%
9 5%
21. Case: credit scoring model for online lender
Setup:
• A company that issues loan in a EU country
• Data on several thousands of loans provided
• Q: Will a customer default?
Result:
• An advanced ensemble machine learning pipeline yielded mere 2% gain over a
logistic regression model
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
ensemble machine learning logistic regression
AUC
22. Case: Will a customer deposit funds?
Setup:
• A company that trades currency
• Data on several hundreds thousands of user agent strings provided
• Q: Will a customer deposit funds to their account?
Result:
• An ensemble machine learning model learned to separate those who will deposit:
Score Deposited Count total
0 3% 7110
10 13% 800
20 16% 341
30 25% 159
40 25% 80
50 43% 23
60 40% 10
70 67% 3
80 100% 2
90 100% 1
23. Case: Is a transaction fraudulent?
Setup:
• A kaggle dataset with fraudulent transactions from
https://www.kaggle.com/dalpozz/creditcardfraud
• Epistatica’s learning pipeline
• Q: Can we build an unsupervised model?
Result:
• AUC 0.75 on kaggle data (0.6 hit rate with 0.3 FP rate)
Validated:
• One of the top consulting companies data (AUC 0.7 )
• On payment provider data (AUC 0.8)
group size fraud rate fraud rate difference
67% 0.1%
302%
33% 0.3%