6. 6
Not Always a Problem
Switch Room?
on bright
on bright
off dark
on bright
on bright
on bright
on bright
off dark
on bright
0
2
4
6
8
bright dark
Tightly Correlated
Switch on <=> bright
7. 7
When is it a problem?
Imagine:
A Fraud dataset
with 100 rows…
and only ONE
fraud instance
Forget building a model
just always return False:
This is 99% Accurate!
…but the Precision of the fraud class is 0%
8. 8
What’s the Problem?
Front
Door
… Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
Imagine: Dataset with 10 identical
inputs and 9/10 identical outcomes
What does the model learn?
Front Door unlocked?
No Robbery, with slightly
less than perfect confidence
9. 8
What’s the Problem?
Front
Door
… Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
Imagine: Dataset with 10 identical
inputs and 9/10 identical outcomes
What does the model learn?
Front Door unlocked?
No Robbery, with slightly
less than perfect confidence
!!! IMPORTANT !!!
10. 9
What’s the Problem?
• The ML algorithm treats all instances equally
• It does not know the relative cost of different
outcomes, unless you tell it!
• This is important even if the class is balanced. One
class can still be more important to get right.
• No Free Lunch - there are ways to fix, but there is
always a tradeoff
11. 10
Sub-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
2.25
4.5
6.75
9
Robbed Not Robbed
Throw out instances from “over-represented” class
either randomly or using clustering
12. 10
Sub-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
0.25
0.5
0.75
1
Robbed Not Robbed
Throw out instances from “over-represented” class
either randomly or using clustering
13. 11
Over-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
2.25
4.5
6.75
9
Robbed Not Robbed
Count instances from “under-represented” class
more than once
14. 11
Over-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
0.25
0.5
0.75
1
Robbed Not Robbed
Count instances from “under-represented” class
more than once
9 X
15. Front Door … Robbed? weight
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … yes 1000
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
12
Weighting
Tell the model engine which instances
are more “important” to learn from
16. Front Door … Robbed? weight
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … yes 9
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
13
Auto Balancing
Tell the model engine to add weights
so all instances have equal representation
17. Classified
Not Fraud
14
The Trade-off
Accuracy = 70%
Precision = 50%
Recall = 66%
Classified
Not Fraud
Classified Fraud
= Fraud
= Not Fraud
Positive Class
Fraud
Negative Class
Not Fraud
Evaluation with no weighting Evaluation with weighting
Accuracy = 60%
Precision = 43%
Recall = 100%
Classified Fraud
18. 15
The Trade-off
• Weighting is typically a tradeoff between precision
and recall.
• What to do depends on what is important in the
“business” sense.
• There are some ways to optimize
19. feature_1 … feature_n label weight
3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
Force an unbalanced dataset to improve a model