L6. Unbalanced Datasets

Unbalanced Datasets
Poul Petersen
BigML

4
How Does it Happen?
Campus Population
Students
Faculty
Visitors
Consider:

Campus Survey
by this guy
pretty
BIASED

SAMPLE
FIX: Re-sample

5
Sometimes it’s Reality
0
750
1500
2250
3000
Fraud Not Fraud
0
125
250
375
500
Earthquake No Earthquake

6
Not Always a Problem
Switch Room?
on bright
on bright
off dark
on bright
on bright
on bright
on bright
off dark
on bright
0
2
4
6
8
bright dark
Tightly Correlated

Switch on <=> bright

7
When is it a problem?
Imagine:

A Fraud dataset

with 100 rows…

and only ONE

fraud instance
Forget building a model

just always return False:
This is 99% Accurate!
…but the Precision of the fraud class is 0%

8
What’s the Problem?
Front
Door
… Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
Imagine: Dataset with 10 identical

inputs and 9/10 identical outcomes

What does the model learn?

Front Door unlocked?

No Robbery, with slightly

less than perfect conﬁdence

8
Front
Door
… Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
Imagine: Dataset with 10 identical

inputs and 9/10 identical outcomes

What does the model learn?

Front Door unlocked?

No Robbery, with slightly

less than perfect conﬁdence

!!! IMPORTANT !!!

9
• The ML algorithm treats all instances equally
• It does not know the relative cost of different
outcomes, unless you tell it!
• This is important even if the class is balanced. One
class can still be more important to get right.
• No Free Lunch - there are ways to fix, but there is
always a tradeoff

10
Sub-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
2.25
4.5
6.75
9
Robbed Not Robbed
Throw out instances from “over-represented” class

either randomly or using clustering

10
Sub-sampling
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
0.25
0.5
0.75
1
Robbed Not Robbed
Throw out instances from “over-represented” class

either randomly or using clustering

11
Over-sampling
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
2.25
4.5
6.75
9
Robbed Not Robbed
Count instances from “under-represented” class

more than once

11
Over-sampling
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
0.25
0.5
0.75
1
Robbed Not Robbed
Count instances from “under-represented” class

more than once
9 X

Front Door … Robbed? weight
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … yes 1000
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
12
Weighting
Tell the model engine which instances

are more “important” to learn from

Front Door … Robbed? weight
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … yes 9
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
13
Auto Balancing
Tell the model engine to add weights

so all instances have equal representation

Classified

Not Fraud
14
The Trade-off
Accuracy = 70%
Precision = 50%
Recall = 66%
Classified

Not Fraud
Classified Fraud
= Fraud
= Not Fraud
Positive Class

Fraud
Negative Class

Not Fraud
Evaluation with no weighting Evaluation with weighting
Accuracy = 60%
Precision = 43%
Recall = 100%
Classified Fraud

15
The Trade-oﬀ
• Weighting is typically a tradeoff between precision
and recall.
• What to do depends on what is important in the
“business” sense.
• There are some ways to optimize

feature_1 … feature_n label weight
3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
Force an unbalanced dataset to improve a model

3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
feature_1 … feature_n label weight predict
3.4 … 4 TRUE 1 TRUE
6.7 … 5 FALSE 1 TRUE
1.0 … 1 FALSE 1 FALSE
5.5 … 23 TRUE 1 FALSE

3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
correct

wrong

correct

wrong

3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
correct

wrong

correct

wrong
0.5

2

0.5

2

3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
3.4 … 4 TRUE 0.5
6.7 … 5 FALSE 2
1.0 … 1 FALSE 0.5
5.5 … 23 TRUE 2
Repeat … this is a type of Boosting

L6. Unbalanced Datasets

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Recently uploaded

Recently uploaded (20)

L6. Unbalanced Datasets