SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
FeatureSelection
February 15, 2020
1 Feature Selection
A discussion that often comes up doing applied Machine Learning work is whether and how to
perform do feature selection. In this post, I will consider two standard justifications offered for
doing so and evaluate whether they make sense. In many ways, this discussion centers on one of
the core tradeoffs in Supervised Learning: does increasing predictive accuracyccomes at the expense
of reducing interpretability?
1.1 Improve model accuracy?
The typical way bias-variance tradeoff is introduced in textbooks and courses is in the context of
linear regression. The story goes as follows: you can reduce in-sample error to a point as arbitrarily
low as possible by increasing the number of parameters in your model. However, when you try
to use the same model to predict out of sample your accuracy is going to be much lower. This
is because the extra parameters get tuned to the in-sample noise and when you get data that
doesn’t contain the same noise they don’t work so well. The suggested remedy is to regularize your
model using ridge, lasso or a combination of the two called elastic net. Regularization proceeds
by shrinking the coefficients of certain variables to very small values(ridge and elastic net) or zero
(lasso) by imposing a constraint on how big the L2 (squared sum) or L1 (absolute sum) of the
coefficients can get.
But is this true in the case of non-parametric models like Random Forests as well? Although I
wasn’t able to find any formal work that addresses this specific question (suggestions welcome), it’s
possible that it doesn’t. One reason might be that during the fitting of each tree a certain number
of variables are dropped. The overfitting problem in the context of Random Forests comes from
growing a tree that is too deep or requiring too few samples fall into each leaf of the tree. This
can be dealt with by ensembling together many trees so that the variance of the overall estimator
is smaller than that for any individual estimator.
I trained a Random Forest classifier following the standard Machine Learning workflow on a banking
dataset from a Portugese bank that analyzed the effect of telemarketing campaigns on whether
contacted customers subcribed to the product being marketed. I tried the model with the following
four variations:
• a base model without variable selection
• using Variance Inflation factor for variable selection
• using Hierarchical Clustering for variable selection
1
• using a mix of 2 and 3, where numerical variables were selected by VIF and categorical
variables by 3.
Four metrics obtained from these models are presented below.
1.2 What is Collinearity?
• In the most basic sense a variable is considered collinear if it can be written as a linear
combination of other variables. In Linear Regression world this becomes a problem because
it blows up your standard errors as it is not possible to attribute variation in the output
variable to the collinear variables based on the given data alone. In some ways this is a
problem with the dataset and people worrying about it are confusing the property of the
dataset with the properties of the model. For a more comprehensive discussion consider
reading this and this.
• Variance Inflation Factor is a metric that allows you to quantify how much of the variation
in one variable is explained by other covariates. This can be obtained by regressing each
variable on the complement set and getting R-squared for each. VIF is defined as 1
1−R2
i
Typical Feature Selection routines using VIF use some threshold of VIF to drop variables.
• Hierarchical Clustering using Spearman’s Rank Correlation allows us to learn about depen-
dencies between not just numerical features but also categorical features and also allows us
to model non-linear dependencies. A typical routine using Hierarchical Clustering first fits a
model, gets variable importances, gets Hierarchical cluster memberships and decides to drop
the least important members of each cluster.
[15]: Text(0.5,0.98,'Comparison of various feature selection methods on classification
metrics')
2
From looking at the results above it doesn’t appear as if variable selection improves model perfor-
mance. Infact it seems as if having more variables results in better performance. Of course this is
just one dataset and a more comprehensive assessment would repeat the same process over several
datasets.
1.3 Improve Interpretability?
Unless you’re building a system where accuracy is all that matters, you don’t care about accuracy
alone. Models need to be interpretable. Interpretability means different things to different people
and several different use cases are commonly lumped together. These might be as follows:
• the end user should be able to understand how the model arrived at a prediction
• the end user should be able to trust that the model is giving the right amoung of importance
to the right variables in arriving at a prediction
• the modeler should be able to debug the model if it starts making predictions that don’t seem
correct
• the end user should be able to derive recommendations for actions from the model.
Having fewer variables in a model helps on all four counts but it doesn’t completely address all
these issues.
3
• Business recommendations could be derived from such a model by stratifying the population
based on features the model considers important for prediction and applying business rules
for taking the action relevant for optimizing the business metric under consideration. If these
variables are non-overlapping, it is probably easier to apply this procedure.
• Having fewer and uncorrelated variables doesn’t shed any insight into the mechanism for
arriving at the prediction.
• Having fewer and uncorrelated variables changes how Random Forest default importance
measure. Below are the default variable importance measures from the model:
[8]: Text(0.5,0.98,'Variable Importance Comparison')
The base model considers duration of the call to the most important measure but duration is
4
not known before a call is performed and moreover after the call the outcome is already known.
Including this variable in the model is an example of data leakage. Below are feature importances
for all four cases obtained after removing ‘duration’ from the dataset.
[11]: Text(0.5,0.98,'Variable Importance Comparison')
The three models that implement feature selection methods consider the categorical feature ‘loan’,
indicating whether the person has a personal loan or not, to be the most important feature followed
by the person’s marital status and the type of communicaation method used while the base model
considers the balance in their account to be the most important variable followed by their age and
the date on which they were contacted. This is confusing. Two different people using different
variable importance measures and interpreting them as actionable insights might end up taking
completely different actions. So which one should we trust?
5
In this in-depth study of default variable importances in the case of RandomForests it was found
that default variable importances can be biased, especially for features that vary in their scale
of measurement or the number of their categories. Instead they recommend using a different
importance measure using permutation feature importance. This procedure involves permuting a
feature’s values randomly and seeing how much drop-off there is in predictive accuracy. For a more
comprehensive discussion please read the article.
Feature importances only tell you about variables a model considered important. They don’t tell
you the magnitude of dependence of the output on that feature and not even the direction. In
order to obtain these kinds of partial dependences you might want to look into the interpretability
literature and consider methods such as Partial Dependence Plots, LIME and Shap values. In order
to derive a recommendation from this model you might want to think about the kind of actions you
want to take and what kind of effects they might have on the outcome but this requires estimating
the counterfactual, i.e. making predictions under intervention and that is a totally different analysis
altogether.
1.4 In conclusion:
1) Feature selection methods may not give you a lift in accuracy.
2) They reduce the number of features and decorrelate them but they don’t help you interpret
the model in any useful way for making actionable business recommendations.
3) The main reason to interpret models is to make causal inferences. Standard remedies for
measuring collinearity won’t help you do that. If all you care about is prediction, you can
just use regularization. If you want to make causal inferences about effects of a variable, while
it is useful to notice you don’t have much variation in that variable conditional on another
variable, you should still condition on it.
All the code that goes with this post is available on this github repository.
6

Contenu connexe

Tendances

Confirmatory Factor Analysis Presented by Mahfoudh Mgammal
Confirmatory Factor Analysis Presented by Mahfoudh MgammalConfirmatory Factor Analysis Presented by Mahfoudh Mgammal
Confirmatory Factor Analysis Presented by Mahfoudh MgammalDr. Mahfoudh Hussein Mgammal
 
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newAsh bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newrhettwhitee
 
Factor analysis
Factor analysis Factor analysis
Factor analysis Nima
 
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newAsh bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newFaarooqkhaann
 
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newAsh bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newuopassignment
 
What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...
What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...
What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...Sourav Das
 
Factor analysis
Factor analysisFactor analysis
Factor analysissaba khan
 
Stat11t alq chapter03
Stat11t alq chapter03Stat11t alq chapter03
Stat11t alq chapter03raylenepotter
 
Lecture 4: NBERMetrics
Lecture 4: NBERMetricsLecture 4: NBERMetrics
Lecture 4: NBERMetricsNBER
 
Sensitivity Analysis
Sensitivity AnalysisSensitivity Analysis
Sensitivity Analysisashishtqm
 
Discriminant analysis
Discriminant analysisDiscriminant analysis
Discriminant analysisWansuklangk
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005jamescupello
 
An overview of fixed effects assumptions for meta analysis - Pubrica
An overview of fixed effects assumptions for meta analysis - PubricaAn overview of fixed effects assumptions for meta analysis - Pubrica
An overview of fixed effects assumptions for meta analysis - PubricaPubrica
 

Tendances (16)

Discriminant analysis
Discriminant analysisDiscriminant analysis
Discriminant analysis
 
Confirmatory Factor Analysis Presented by Mahfoudh Mgammal
Confirmatory Factor Analysis Presented by Mahfoudh MgammalConfirmatory Factor Analysis Presented by Mahfoudh Mgammal
Confirmatory Factor Analysis Presented by Mahfoudh Mgammal
 
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newAsh bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
 
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newAsh bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
 
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newAsh bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
 
What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...
What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...
What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Stat11t alq chapter03
Stat11t alq chapter03Stat11t alq chapter03
Stat11t alq chapter03
 
Sensitivity analysis
Sensitivity analysisSensitivity analysis
Sensitivity analysis
 
Lecture 4: NBERMetrics
Lecture 4: NBERMetricsLecture 4: NBERMetrics
Lecture 4: NBERMetrics
 
Sensitivity Analysis
Sensitivity AnalysisSensitivity Analysis
Sensitivity Analysis
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Discriminant analysis
Discriminant analysisDiscriminant analysis
Discriminant analysis
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005
 
An overview of fixed effects assumptions for meta analysis - Pubrica
An overview of fixed effects assumptions for meta analysis - PubricaAn overview of fixed effects assumptions for meta analysis - Pubrica
An overview of fixed effects assumptions for meta analysis - Pubrica
 

Similaire à Feature selection

Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easyWeam Banjar
 
Campaign response modeling
Campaign response modelingCampaign response modeling
Campaign response modelingEsteban Ribero
 
Discriminant Analysis.pptx
Discriminant Analysis.pptxDiscriminant Analysis.pptx
Discriminant Analysis.pptxGedaSheko
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingMeghana Gowda
 
Factor analysis using SPSS
Factor analysis using SPSSFactor analysis using SPSS
Factor analysis using SPSSRemas Mohamed
 
1. F A Using S P S S1 (Saq.Sav) Q Ti A
1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A
1. F A Using S P S S1 (Saq.Sav) Q Ti AZoha Qureshi
 
Analysis in Action 21 September 2021
Analysis in Action 21 September 2021Analysis in Action 21 September 2021
Analysis in Action 21 September 2021IIBA UK Chapter
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.Teng Xiaolu
 
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdfXIAOZEJIN1
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progressoveesingh
 
Construction of composite index: process & methods
Construction of composite index:  process & methodsConstruction of composite index:  process & methods
Construction of composite index: process & methodsgopichandbalusu
 
A researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docxA researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docxevonnehoggarth79783
 
1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docxjackiewalcutt
 
MA- UNIT -1.pptx for ipu bba sem 5, complete pdf
MA- UNIT -1.pptx for ipu bba sem 5, complete pdfMA- UNIT -1.pptx for ipu bba sem 5, complete pdf
MA- UNIT -1.pptx for ipu bba sem 5, complete pdfzm2pfgpcdt
 
STAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxSTAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxJishanAhmed24
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Egor Kraev
 
MIS 05 Decision Support Systems
MIS 05  Decision Support SystemsMIS 05  Decision Support Systems
MIS 05 Decision Support SystemsTushar B Kute
 

Similaire à Feature selection (20)

Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Campaign response modeling
Campaign response modelingCampaign response modeling
Campaign response modeling
 
Discriminant Analysis.pptx
Discriminant Analysis.pptxDiscriminant Analysis.pptx
Discriminant Analysis.pptx
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modeling
 
Factor analysis using SPSS
Factor analysis using SPSSFactor analysis using SPSS
Factor analysis using SPSS
 
1. F A Using S P S S1 (Saq.Sav) Q Ti A
1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A
1. F A Using S P S S1 (Saq.Sav) Q Ti A
 
Analysis in Action 21 September 2021
Analysis in Action 21 September 2021Analysis in Action 21 September 2021
Analysis in Action 21 September 2021
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.
 
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
 
segmentda
segmentdasegmentda
segmentda
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progresso
 
Construction of composite index: process & methods
Construction of composite index:  process & methodsConstruction of composite index:  process & methods
Construction of composite index: process & methods
 
A researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docxA researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docx
 
1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
MA- UNIT -1.pptx for ipu bba sem 5, complete pdf
MA- UNIT -1.pptx for ipu bba sem 5, complete pdfMA- UNIT -1.pptx for ipu bba sem 5, complete pdf
MA- UNIT -1.pptx for ipu bba sem 5, complete pdf
 
STAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxSTAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptx
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...
 
MIS 05 Decision Support Systems
MIS 05  Decision Support SystemsMIS 05  Decision Support Systems
MIS 05 Decision Support Systems
 

Dernier

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...HyderabadDolls
 
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...vershagrag
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...HyderabadDolls
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 

Dernier (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

Feature selection

  • 1. FeatureSelection February 15, 2020 1 Feature Selection A discussion that often comes up doing applied Machine Learning work is whether and how to perform do feature selection. In this post, I will consider two standard justifications offered for doing so and evaluate whether they make sense. In many ways, this discussion centers on one of the core tradeoffs in Supervised Learning: does increasing predictive accuracyccomes at the expense of reducing interpretability? 1.1 Improve model accuracy? The typical way bias-variance tradeoff is introduced in textbooks and courses is in the context of linear regression. The story goes as follows: you can reduce in-sample error to a point as arbitrarily low as possible by increasing the number of parameters in your model. However, when you try to use the same model to predict out of sample your accuracy is going to be much lower. This is because the extra parameters get tuned to the in-sample noise and when you get data that doesn’t contain the same noise they don’t work so well. The suggested remedy is to regularize your model using ridge, lasso or a combination of the two called elastic net. Regularization proceeds by shrinking the coefficients of certain variables to very small values(ridge and elastic net) or zero (lasso) by imposing a constraint on how big the L2 (squared sum) or L1 (absolute sum) of the coefficients can get. But is this true in the case of non-parametric models like Random Forests as well? Although I wasn’t able to find any formal work that addresses this specific question (suggestions welcome), it’s possible that it doesn’t. One reason might be that during the fitting of each tree a certain number of variables are dropped. The overfitting problem in the context of Random Forests comes from growing a tree that is too deep or requiring too few samples fall into each leaf of the tree. This can be dealt with by ensembling together many trees so that the variance of the overall estimator is smaller than that for any individual estimator. I trained a Random Forest classifier following the standard Machine Learning workflow on a banking dataset from a Portugese bank that analyzed the effect of telemarketing campaigns on whether contacted customers subcribed to the product being marketed. I tried the model with the following four variations: • a base model without variable selection • using Variance Inflation factor for variable selection • using Hierarchical Clustering for variable selection 1
  • 2. • using a mix of 2 and 3, where numerical variables were selected by VIF and categorical variables by 3. Four metrics obtained from these models are presented below. 1.2 What is Collinearity? • In the most basic sense a variable is considered collinear if it can be written as a linear combination of other variables. In Linear Regression world this becomes a problem because it blows up your standard errors as it is not possible to attribute variation in the output variable to the collinear variables based on the given data alone. In some ways this is a problem with the dataset and people worrying about it are confusing the property of the dataset with the properties of the model. For a more comprehensive discussion consider reading this and this. • Variance Inflation Factor is a metric that allows you to quantify how much of the variation in one variable is explained by other covariates. This can be obtained by regressing each variable on the complement set and getting R-squared for each. VIF is defined as 1 1−R2 i Typical Feature Selection routines using VIF use some threshold of VIF to drop variables. • Hierarchical Clustering using Spearman’s Rank Correlation allows us to learn about depen- dencies between not just numerical features but also categorical features and also allows us to model non-linear dependencies. A typical routine using Hierarchical Clustering first fits a model, gets variable importances, gets Hierarchical cluster memberships and decides to drop the least important members of each cluster. [15]: Text(0.5,0.98,'Comparison of various feature selection methods on classification metrics') 2
  • 3. From looking at the results above it doesn’t appear as if variable selection improves model perfor- mance. Infact it seems as if having more variables results in better performance. Of course this is just one dataset and a more comprehensive assessment would repeat the same process over several datasets. 1.3 Improve Interpretability? Unless you’re building a system where accuracy is all that matters, you don’t care about accuracy alone. Models need to be interpretable. Interpretability means different things to different people and several different use cases are commonly lumped together. These might be as follows: • the end user should be able to understand how the model arrived at a prediction • the end user should be able to trust that the model is giving the right amoung of importance to the right variables in arriving at a prediction • the modeler should be able to debug the model if it starts making predictions that don’t seem correct • the end user should be able to derive recommendations for actions from the model. Having fewer variables in a model helps on all four counts but it doesn’t completely address all these issues. 3
  • 4. • Business recommendations could be derived from such a model by stratifying the population based on features the model considers important for prediction and applying business rules for taking the action relevant for optimizing the business metric under consideration. If these variables are non-overlapping, it is probably easier to apply this procedure. • Having fewer and uncorrelated variables doesn’t shed any insight into the mechanism for arriving at the prediction. • Having fewer and uncorrelated variables changes how Random Forest default importance measure. Below are the default variable importance measures from the model: [8]: Text(0.5,0.98,'Variable Importance Comparison') The base model considers duration of the call to the most important measure but duration is 4
  • 5. not known before a call is performed and moreover after the call the outcome is already known. Including this variable in the model is an example of data leakage. Below are feature importances for all four cases obtained after removing ‘duration’ from the dataset. [11]: Text(0.5,0.98,'Variable Importance Comparison') The three models that implement feature selection methods consider the categorical feature ‘loan’, indicating whether the person has a personal loan or not, to be the most important feature followed by the person’s marital status and the type of communicaation method used while the base model considers the balance in their account to be the most important variable followed by their age and the date on which they were contacted. This is confusing. Two different people using different variable importance measures and interpreting them as actionable insights might end up taking completely different actions. So which one should we trust? 5
  • 6. In this in-depth study of default variable importances in the case of RandomForests it was found that default variable importances can be biased, especially for features that vary in their scale of measurement or the number of their categories. Instead they recommend using a different importance measure using permutation feature importance. This procedure involves permuting a feature’s values randomly and seeing how much drop-off there is in predictive accuracy. For a more comprehensive discussion please read the article. Feature importances only tell you about variables a model considered important. They don’t tell you the magnitude of dependence of the output on that feature and not even the direction. In order to obtain these kinds of partial dependences you might want to look into the interpretability literature and consider methods such as Partial Dependence Plots, LIME and Shap values. In order to derive a recommendation from this model you might want to think about the kind of actions you want to take and what kind of effects they might have on the outcome but this requires estimating the counterfactual, i.e. making predictions under intervention and that is a totally different analysis altogether. 1.4 In conclusion: 1) Feature selection methods may not give you a lift in accuracy. 2) They reduce the number of features and decorrelate them but they don’t help you interpret the model in any useful way for making actionable business recommendations. 3) The main reason to interpret models is to make causal inferences. Standard remedies for measuring collinearity won’t help you do that. If all you care about is prediction, you can just use regularization. If you want to make causal inferences about effects of a variable, while it is useful to notice you don’t have much variation in that variable conditional on another variable, you should still condition on it. All the code that goes with this post is available on this github repository. 6