SlideShare a Scribd company logo
1 of 23
Download to read offline
Unbalanced Datasets
Poul Petersen
BigML
2
Unbalanced Dataset?
DATASET
3
Unbalanced Dataset?
4
How Does it Happen?
Campus Population
Students
Faculty
Visitors
Consider:

Campus Survey
by this guy
pretty
BIASED 

SAMPLE
FIX: Re-sample
5
Sometimes it’s Reality
0
750
1500
2250
3000
Fraud Not Fraud
0
125
250
375
500
Earthquake No Earthquake
6
Not Always a Problem
Switch Room?
on bright
on bright
off dark
on bright
on bright
on bright
on bright
off dark
on bright
0
2
4
6
8
bright dark
Tightly Correlated

Switch on <=> bright
7
When is it a problem?
Imagine:

A Fraud dataset

with 100 rows…

and only ONE 

fraud instance
Forget building a model

just always return False:
This is 99% Accurate!
…but the Precision of the fraud class is 0%
8
What’s the Problem?
Front
Door
… Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
Imagine: Dataset with 10 identical

inputs and 9/10 identical outcomes

What does the model learn?

Front Door unlocked?

No Robbery, with slightly

less than perfect confidence
8
What’s the Problem?
Front
Door
… Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
Imagine: Dataset with 10 identical

inputs and 9/10 identical outcomes

What does the model learn?

Front Door unlocked?

No Robbery, with slightly

less than perfect confidence

!!! IMPORTANT !!!
9
What’s the Problem?
• The ML algorithm treats all instances equally
• It does not know the relative cost of different
outcomes, unless you tell it!
• This is important even if the class is balanced. One
class can still be more important to get right.
• No Free Lunch - there are ways to fix, but there is
always a tradeoff
10
Sub-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
2.25
4.5
6.75
9
Robbed Not Robbed
Throw out instances from “over-represented” class

either randomly or using clustering
10
Sub-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
0.25
0.5
0.75
1
Robbed Not Robbed
Throw out instances from “over-represented” class

either randomly or using clustering
11
Over-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
2.25
4.5
6.75
9
Robbed Not Robbed
Count instances from “under-represented” class

more than once
11
Over-sampling
Front Door … Robbed?
unlocked … no
unlocked … no
unlocked … no
unlocked … yes
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
unlocked … no
0
0.25
0.5
0.75
1
Robbed Not Robbed
Count instances from “under-represented” class

more than once
9 X
Front Door … Robbed? weight
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … yes 1000
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
12
Weighting
Tell the model engine which instances

are more “important” to learn from
Front Door … Robbed? weight
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … yes 9
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
unlocked … no 1
13
Auto Balancing
Tell the model engine to add weights

so all instances have equal representation
Classified 

Not Fraud
14
The Trade-off
Accuracy = 70%
Precision = 50%
Recall = 66%
Classified 

Not Fraud
Classified Fraud
= Fraud
= Not Fraud
Positive Class

Fraud
Negative Class

Not Fraud
Evaluation with no weighting Evaluation with weighting
Accuracy = 60%
Precision = 43%
Recall = 100%
Classified Fraud
15
The Trade-off
• Weighting is typically a tradeoff between precision
and recall.
• What to do depends on what is important in the
“business” sense.
• There are some ways to optimize
feature_1 … feature_n label weight
3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
Force an unbalanced dataset to improve a model
feature_1 … feature_n label weight
3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
Force an unbalanced dataset to improve a model
feature_1 … feature_n label weight predict
3.4 … 4 TRUE 1 TRUE
6.7 … 5 FALSE 1 TRUE
1.0 … 1 FALSE 1 FALSE
5.5 … 23 TRUE 1 FALSE
feature_1 … feature_n label weight
3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
Force an unbalanced dataset to improve a model
feature_1 … feature_n label weight predict
3.4 … 4 TRUE 1 TRUE
6.7 … 5 FALSE 1 TRUE
1.0 … 1 FALSE 1 FALSE
5.5 … 23 TRUE 1 FALSE
correct

wrong

correct

wrong
feature_1 … feature_n label weight
3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
Force an unbalanced dataset to improve a model
feature_1 … feature_n label weight predict
3.4 … 4 TRUE 1 TRUE
6.7 … 5 FALSE 1 TRUE
1.0 … 1 FALSE 1 FALSE
5.5 … 23 TRUE 1 FALSE
correct

wrong

correct

wrong
0.5

2

0.5

2
feature_1 … feature_n label weight
3.4 … 4 TRUE 1
6.7 … 5 FALSE 1
1.0 … 1 FALSE 1
5.5 … 23 TRUE 1
16
Sometimes Useful
Force an unbalanced dataset to improve a model
feature_1 … feature_n label weight
3.4 … 4 TRUE 0.5
6.7 … 5 FALSE 2
1.0 … 1 FALSE 0.5
5.5 … 23 TRUE 2
Repeat … this is a type of Boosting

More Related Content

Viewers also liked

4 steps to perfect project planning
4 steps to perfect project planning4 steps to perfect project planning
4 steps to perfect project planningjosephb987
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsMachine Learning Valencia
 
I_Leonardos_Anthology of Post War Poetry
I_Leonardos_Anthology of Post War PoetryI_Leonardos_Anthology of Post War Poetry
I_Leonardos_Anthology of Post War PoetryYannis Leonardos
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Universal Emergency Handsignals
Universal Emergency HandsignalsUniversal Emergency Handsignals
Universal Emergency HandsignalsSuncoastMeetings
 
L7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIsL7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIsMachine Learning Valencia
 
Water life by Group 6
Water life by Group 6Water life by Group 6
Water life by Group 6justkuk
 
agro-chemicals & fertilizerr
agro-chemicals & fertilizerragro-chemicals & fertilizerr
agro-chemicals & fertilizerrpravin sawant
 

Viewers also liked (17)

4 steps to perfect project planning
4 steps to perfect project planning4 steps to perfect project planning
4 steps to perfect project planning
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking Predictions
 
I_Leonardos_Anthology of Post War Poetry
I_Leonardos_Anthology of Post War PoetryI_Leonardos_Anthology of Post War Poetry
I_Leonardos_Anthology of Post War Poetry
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
CARBON CAPTURE AND STORAGE
CARBON CAPTURE AND STORAGECARBON CAPTURE AND STORAGE
CARBON CAPTURE AND STORAGE
 
laporan biologi
laporan biologilaporan biologi
laporan biologi
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
Universal Emergency Handsignals
Universal Emergency HandsignalsUniversal Emergency Handsignals
Universal Emergency Handsignals
 
Reflective analysis
Reflective analysisReflective analysis
Reflective analysis
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
In-water resuscitiation: What are the highlights and pitfalls?
In-water resuscitiation: What are the highlights and pitfalls?In-water resuscitiation: What are the highlights and pitfalls?
In-water resuscitiation: What are the highlights and pitfalls?
 
L7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIsL7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIs
 
Water life by Group 6
Water life by Group 6Water life by Group 6
Water life by Group 6
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
 
agro-chemicals & fertilizerr
agro-chemicals & fertilizerragro-chemicals & fertilizerr
agro-chemicals & fertilizerr
 

Recently uploaded

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Recently uploaded (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

L6. Unbalanced Datasets

  • 4. 4 How Does it Happen? Campus Population Students Faculty Visitors Consider: Campus Survey by this guy pretty BIASED SAMPLE FIX: Re-sample
  • 5. 5 Sometimes it’s Reality 0 750 1500 2250 3000 Fraud Not Fraud 0 125 250 375 500 Earthquake No Earthquake
  • 6. 6 Not Always a Problem Switch Room? on bright on bright off dark on bright on bright on bright on bright off dark on bright 0 2 4 6 8 bright dark Tightly Correlated Switch on <=> bright
  • 7. 7 When is it a problem? Imagine: A Fraud dataset with 100 rows… and only ONE fraud instance Forget building a model just always return False: This is 99% Accurate! …but the Precision of the fraud class is 0%
  • 8. 8 What’s the Problem? Front Door … Robbed? unlocked … no unlocked … no unlocked … no unlocked … yes unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no Imagine: Dataset with 10 identical inputs and 9/10 identical outcomes What does the model learn? Front Door unlocked? No Robbery, with slightly less than perfect confidence
  • 9. 8 What’s the Problem? Front Door … Robbed? unlocked … no unlocked … no unlocked … no unlocked … yes unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no Imagine: Dataset with 10 identical inputs and 9/10 identical outcomes What does the model learn? Front Door unlocked? No Robbery, with slightly less than perfect confidence !!! IMPORTANT !!!
  • 10. 9 What’s the Problem? • The ML algorithm treats all instances equally • It does not know the relative cost of different outcomes, unless you tell it! • This is important even if the class is balanced. One class can still be more important to get right. • No Free Lunch - there are ways to fix, but there is always a tradeoff
  • 11. 10 Sub-sampling Front Door … Robbed? unlocked … no unlocked … no unlocked … no unlocked … yes unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no 0 2.25 4.5 6.75 9 Robbed Not Robbed Throw out instances from “over-represented” class either randomly or using clustering
  • 12. 10 Sub-sampling Front Door … Robbed? unlocked … no unlocked … no unlocked … no unlocked … yes unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no 0 0.25 0.5 0.75 1 Robbed Not Robbed Throw out instances from “over-represented” class either randomly or using clustering
  • 13. 11 Over-sampling Front Door … Robbed? unlocked … no unlocked … no unlocked … no unlocked … yes unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no 0 2.25 4.5 6.75 9 Robbed Not Robbed Count instances from “under-represented” class more than once
  • 14. 11 Over-sampling Front Door … Robbed? unlocked … no unlocked … no unlocked … no unlocked … yes unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no unlocked … no 0 0.25 0.5 0.75 1 Robbed Not Robbed Count instances from “under-represented” class more than once 9 X
  • 15. Front Door … Robbed? weight unlocked … no 1 unlocked … no 1 unlocked … no 1 unlocked … yes 1000 unlocked … no 1 unlocked … no 1 unlocked … no 1 unlocked … no 1 unlocked … no 1 unlocked … no 1 12 Weighting Tell the model engine which instances are more “important” to learn from
  • 16. Front Door … Robbed? weight unlocked … no 1 unlocked … no 1 unlocked … no 1 unlocked … yes 9 unlocked … no 1 unlocked … no 1 unlocked … no 1 unlocked … no 1 unlocked … no 1 unlocked … no 1 13 Auto Balancing Tell the model engine to add weights so all instances have equal representation
  • 17. Classified Not Fraud 14 The Trade-off Accuracy = 70% Precision = 50% Recall = 66% Classified Not Fraud Classified Fraud = Fraud = Not Fraud Positive Class Fraud Negative Class Not Fraud Evaluation with no weighting Evaluation with weighting Accuracy = 60% Precision = 43% Recall = 100% Classified Fraud
  • 18. 15 The Trade-off • Weighting is typically a tradeoff between precision and recall. • What to do depends on what is important in the “business” sense. • There are some ways to optimize
  • 19. feature_1 … feature_n label weight 3.4 … 4 TRUE 1 6.7 … 5 FALSE 1 1.0 … 1 FALSE 1 5.5 … 23 TRUE 1 16 Sometimes Useful Force an unbalanced dataset to improve a model
  • 20. feature_1 … feature_n label weight 3.4 … 4 TRUE 1 6.7 … 5 FALSE 1 1.0 … 1 FALSE 1 5.5 … 23 TRUE 1 16 Sometimes Useful Force an unbalanced dataset to improve a model feature_1 … feature_n label weight predict 3.4 … 4 TRUE 1 TRUE 6.7 … 5 FALSE 1 TRUE 1.0 … 1 FALSE 1 FALSE 5.5 … 23 TRUE 1 FALSE
  • 21. feature_1 … feature_n label weight 3.4 … 4 TRUE 1 6.7 … 5 FALSE 1 1.0 … 1 FALSE 1 5.5 … 23 TRUE 1 16 Sometimes Useful Force an unbalanced dataset to improve a model feature_1 … feature_n label weight predict 3.4 … 4 TRUE 1 TRUE 6.7 … 5 FALSE 1 TRUE 1.0 … 1 FALSE 1 FALSE 5.5 … 23 TRUE 1 FALSE correct wrong correct wrong
  • 22. feature_1 … feature_n label weight 3.4 … 4 TRUE 1 6.7 … 5 FALSE 1 1.0 … 1 FALSE 1 5.5 … 23 TRUE 1 16 Sometimes Useful Force an unbalanced dataset to improve a model feature_1 … feature_n label weight predict 3.4 … 4 TRUE 1 TRUE 6.7 … 5 FALSE 1 TRUE 1.0 … 1 FALSE 1 FALSE 5.5 … 23 TRUE 1 FALSE correct wrong correct wrong 0.5 2 0.5 2
  • 23. feature_1 … feature_n label weight 3.4 … 4 TRUE 1 6.7 … 5 FALSE 1 1.0 … 1 FALSE 1 5.5 … 23 TRUE 1 16 Sometimes Useful Force an unbalanced dataset to improve a model feature_1 … feature_n label weight 3.4 … 4 TRUE 0.5 6.7 … 5 FALSE 2 1.0 … 1 FALSE 0.5 5.5 … 23 TRUE 2 Repeat … this is a type of Boosting