3. What is the problem?
• X Store has a retail credit card available to
customers
4. What is the problem?
• X Store has a retail credit card available to
customers
• There can be a number of sources of loss
from this product, but one is customer’s
defaulting on their debt
5. What is the problem?
• X Store has a retail credit card available to
customers
• There can be a number of sources of loss
from this product, but one is customer’s
defaulting on their debt
• This prevents the store from collecting
payment for products and services
rendered
7. Is this problem big enough to matter?
• Examining a slice of the customer database
(150,000 customers) we find that 6.6% of
customers were seriously delinquent in
payment the last two years
8. Is this problem big enough to matter?
• Examining a slice of the customer database
(150,000 customers) we find that 6.6% of
customers were seriously delinquent in
payment the last two years
• If only 5% of their carried debt was the
store credit card this is potentially an:
9. Is this problem big enough to matter?
• Examining a slice of the customer database
(150,000 customers) we find that 6.6% of
customers were seriously delinquent in
payment the last two years
• If only 5% of their carried debt was the
store credit card this is potentially an:
• Average loss of $8.12 per customer
10. Is this problem big enough to matter?
• Examining a slice of the customer database
(150,000 customers) we find that 6.6% of
customers were seriously delinquent in
payment the last two years
• If only 5% of their carried debt was the
store credit card this is potentially an:
• Average loss of $8.12 per customer
• Potential overall loss of $1.2 million
12. What can be done?
• There are numerous models that can be
used to predict which customers will
default
13. What can be done?
• There are numerous models that can be
used to predict which customers will
default
• This could be used to decrease credit limits
or cancel credit lines for current risky
customers to minimize potential loss
14. What can be done?
• There are numerous models that can be
used to predict which customers will
default
• This could be used to decrease credit limits
or cancel credit lines for current risky
customers to minimize potential loss
• Or better screen which customers are
approved for the card
16. How will I do this?
• This is a basic classification problem with
important business implications
17. How will I do this?
• This is a basic classification problem with
important business implications
• We’ll examine a few simplistic models to
get an idea of performance
18. How will I do this?
• This is a basic classification problem with
important business implications
• We’ll examine a few simplistic models to
get an idea of performance
• Explore decision tree methods to achieve
better performance
19. What will the models predict delinquency?
Each customer has a number of attributes
20. What will the models predict delinquency?
Each customer has a number of attributes
John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
21. What will the models predict delinquency?
Each customer has a number of attributes
John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2
22. What will the models predict delinquency?
Each customer has a number of attributes
John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2
...
23. What will the models predict delinquency?
Each customer has a number of attributes
John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2
...
We will use the customer attributes to predict
whether they were delinquent
24. How do we make sure that our solution actually
has predictive power?
25. How do we make sure that our solution actually
has predictive power?
We have two slices of the customer dataset
26. How do we make sure that our solution actually
has predictive power?
We have two slices of the customer dataset
Train
150,000
customers
Delinquency
in dataset
27. How do we make sure that our solution actually
has predictive power?
We have two slices of the customer dataset
Train Test
150,000
customers
Delinquency
in dataset
101,000
customers
Delinquency
not in
dataset
28. How do we make sure that our solution actually
has predictive power?
We have two slices of the customer dataset
Train Test
150,000
customers
Delinquency
in dataset
101,000
customers
Delinquency
not in
dataset
None of the customers in the test dataset are
used to train the model
29. Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train
30. Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train
Train 1
Train 2
Train 3
31. Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train
Train 1
Train 2
Train 3
Train 1
Train 2
Algorithm
Training
32. Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train
Train 1
Train 2
Train 3
Train 1
Train 2
Algorithm
Training
Algorithm
Testing
Train 3
33. What matters is how well we can predict
the test dataset
We judge this using the accuracy, which is the number
of our predictions correct out of the total number of
predictions made
So with 100,000 customers and an 80% accuracy we
will have correctly predicted whether 80,000
customers will default or not in the next two years
35. Putting accuracy in context
We could save $600,000 over two years if we
correctly predicted 50% of the customers that would
default and changed their account to prevent it
36. Putting accuracy in context
We could save $600,000 over two years if we
correctly predicted 50% of the customers that would
default and changed their account to prevent it
The potential loss is minimized by ~$8,000 for every
100,000 customers with each percentage point
increase in accuracy
49. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
50. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1
51. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2
52. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2
53. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2
54. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2 ...
55. We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
56. We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
57. We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
Sens = True Positives
Number of People
Actually Delinquent
58. 0 20 40 60 80 100
Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
Accuracy
Precision
Sensitivity
We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
Sens = True Positives
Number of People
Actually Delinquent
59. 0 20 40 60 80 100
Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
Accuracy
Precision
Sensitivity
We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
Sens = True Positives
Number of People
Actually Delinquent
60. 0 20 40 60 80 100
Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
Accuracy
Precision
Sensitivity
We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
Sens = True Positives
Number of People
Actually Delinquent
0.61 KGI on Test Set
61. However, not all fields are as informative
Using the number of times past due 60-89 days
we achieve a KGI of 0.5
62. However, not all fields are as informative
Using the number of times past due 60-89 days
we achieve a KGI of 0.5
The approach is naive and could be improved but
our time is better spent on different algorithms
63. Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
64. Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
66. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
67. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
68. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
No
75,000
Customers>30
69. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
No
75,000
Customers>30
Yes
25,000
Customers <30
70. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
No
75,000
Customers>30
Yes
25,000
Customers <30
...
71. A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
72. A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
73. A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
Class assignment of a customer is based on how many
of the decision trees “vote” on how to split an attribute
74. A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
We use a large number of trees to not over-fit to the
training data
Class assignment of a customer is based on how many
of the decision trees “vote” on how to split an attribute
75. The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
76. The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
77. The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
Also parallelized with Mahout and Hadoop since
there is no dependence from one tree to the next
78. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
79. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI
80. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI
1000 trees: 0.850 KGI
81. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI
1000 trees: 0.850 KGI
82. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI
1000 trees: 0.850 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
83. Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
0.78-0.85
84. Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting
85. Boosting Trees is similar to a Random Forest
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
No
Customers
>30 Data
Yes
Customers
<30 Data
...
86. Boosting Trees is similar to a Random Forest
Customer Data
Is age <30?
No
Customers
>30 Data
Yes
Customers
<30 Data
...
Do an exhaustive search
for best split
87. How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The first tree is optimized to minimize
a loss function describing the data
88. How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The first tree is optimized to minimize
a loss function describing the data
The next tree is then optimized to
fit whatever variability the first
tree didn’t fit
89. How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The first tree is optimized to minimize
a loss function describing the data
The next tree is then optimized to
fit whatever variability the first
tree didn’t fit
This is a sequential process in
comparison to the random forest
90. How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The first tree is optimized to minimize
a loss function describing the data
The next tree is then optimized to
fit whatever variability the first
tree didn’t fit
This is a sequential process in
comparison to the random forest
We also run the risk of over-fitting
to the data, thus the learning rate
92. Implementing Gradient Boosted Trees
In Python or R it is easy for initial testing and validation
There are implementations that use Hadoop but it’s
more complicated to achieve the best performance
94. Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI
1000 trees, 0.1 Learning: 0.865248 KGI
95. Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI
1000 trees, 0.1 Learning: 0.865248 KGI
0 0.6 0.8
Learning Rate
0.75
0.8
0.85
KGI
0.2 0.4
96. Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI
1000 trees, 0.1 Learning: 0.865248 KGI
0 0.6 0.8
Learning Rate
0.75
0.8
0.85
KGI
0.2 0.4
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees
97. Moving one step further in complexity
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting
0.71-0.8659
Blended
Method
98. Or more accurately an ensemble of
ensemble methods
Algorithm Progression
99. Or more accurately an ensemble of
ensemble methods
Algorithm Progression
Random Forest
100. Or more accurately an ensemble of
ensemble methods
Algorithm Progression
Random Forest
Extremely Random Forest
101. Or more accurately an ensemble of
ensemble methods
Algorithm Progression
Random Forest
Extremely Random Forest
Gradient Tree Boosting
102. Or more accurately an ensemble of
ensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.1
0.5
0.01
0.8
0.7
.
.
.
103. Or more accurately an ensemble of
ensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
104. Or more accurately an ensemble of
ensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
105. Combine all of the model information
Train Data Probabilities
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
106. Combine all of the model information
Train Data Probabilities
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
Optimize the set of train probabilities
to the known delinquencies
107. Combine all of the model information
Train Data Probabilities
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
Optimize the set of train probabilities
to the known delinquencies
Apply the same weighting scheme to the
set of test data probabilities
108. Implementation can be done in a number of ways
Testing in Python or R is slower, due to the sequential nature
of applying the algorithms
Could be faster parallelized, running each algorithm separately
and combining the results
110. Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees
Blended
111. Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
But this performance and the possibility of
additional gains comes at a distinct time cost.
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees
Blended
112. Examining the continuum of choices
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting
0.71-0.8659
Blended
Method
0.864
114. What would be best to implement?
There is a large amount of optimization in the
blended method that could be done
115. What would be best to implement?
There is a large amount of optimization in the
blended method that could be done
However, this algorithm takes the longest to run.
This constraint will apply in testing and validation also
116. What would be best to implement?
There is a large amount of optimization in the
blended method that could be done
However, this algorithm takes the longest to run.
This constraint will apply in testing and validation also
Random Forests returns a reasonably good result.
It is quick and easily parallelized
117. What would be best to implement?
There is a large amount of optimization in the
blended method that could be done
However, this algorithm takes the longest to run.
This constraint will apply in testing and validation also
Random Forests returns a reasonably good result.
It is quick and easily parallelized
Gradient Tree Boosting returns the best result and
runs reasonably fast.
It is not as easily parallelized though
118. What would be best to implement?
Random Forests returns a reasonably good result.
It is quick and easily parallelized
Gradient Tree Boosting returns the best result and
runs reasonably fast.
It is not as easily parallelized though
119. Increases in predictive performance have real
business value
Using any of the more complex algorithms we
achieve an increase of 35% in comparison to random
120. Increases in predictive performance have real
business value
Using any of the more complex algorithms we
achieve an increase of 35% in comparison to random
Potential decrease of ~$420k in losses by identifying
customers likely to default in the training set alone