SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
PyData London
28th April, 2018
Thomas Huijskens
Senior Data Scientist
How to get better
performance with less
data
3All content copyright © 2017 QuantumBlack, a McKinsey company
Feature collinearity and scarceness of data means we can't just give a model many features and let it decide
which ones are useful and which ones are not.
There are multiple reasons to do feature selection when developing machine learning models:
• Computational burden: Limiting the number of features may reduce the computational burden of processing the data in
the learning algorithm.
• Risk of overfitting: Noise reduction and consequently better class separation may be obtained by adding variables that
are presumably redundant.
• Interpretability: Removing redundant variables from the input data can make the results more interpretable for both the
seasoned practitioner as well as any business stakeholders.
It pays off to do feature selection as part of the model development process
4All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms should:
• remove variables that contain redundant information about the target variable; and
• reduce the overlap in information between the variables in the subset of selected features.
A good feature selection algorithm also shouldn't look at variables purely in isolation:
• Two variables that are useless by themselves can be useful together.
• Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity.
What are the components of a good feature selection algorithm?
5All content copyright © 2017 QuantumBlack, a McKinsey company
Two variables that are useless by themselves can be useful together
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
6All content copyright © 2017 QuantumBlack, a McKinsey company
Very high variable correlation (or anti-correlation) does not mean absence
of variable complementarity
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
7All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods Filter methods Embedded methods
8All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Filter methods Embedded methods
9All content copyright © 2017 QuantumBlack, a McKinsey company
Mlxtend is an open-source Python package that implements multiple
wrapper methods
10All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Embedded methods
11All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – mutual information
The mutual information quantifies the amount of information obtained about one random variable, through another
random variable. For two variables ! and ", the mutual information is given by
# !; " = &
'
&
(
) *, , log
)(*, ,)
) * )(,)
2* 2,.
It determines how similar the joint distribution ) *, , is to the products of the factored marginal distribution ) * ) , .
12All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
In the feature selection problem, we would like to maximise the mutual information between the selected variables
!" and the target #.
$% = arg max
"
, !"; # , /. 1. % = 2,
where 2 is the number of features we want to select.
This is an NP-hard problem, as the set of possible combinations of features grows exponentially.
13All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
A popular heuristic in the literature is to use a greedy forward selection method, where features are selected
incrementally, one feature at a time.
Let !" #$
= &'(
, … , &'+ ,(
, be the set of selected features at time step - − 1. The greedy method selects the next
feature 0" such that
0" = arg max
6 ∉8+,(
9 :8+,( ⋃ 6; = .
14All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
One can show (proof omitted here) that this is equivalent to the following
!" = arg max
) ∉+,-.
/ 0); 2 − / 0) ; 0+,-. − / 0) ; 0+,-. 2) .
However, the quantities involving 6"78
quickly become intractable computationally because they are (: − 1)-
dimensional integrals!
15All content copyright © 2017 QuantumBlack, a McKinsey company
Mutual information based measures trade off relevancy of a variable
against the redundancy of the information a variable contains
We can use an approximation to the multidimensional integrals to make the computation more tractable:
arg max
& ∉()*+
, -& ; /
relevancy
− 1 2
345
6 75
,(-9:
; -&) − < 2
345
6 75
, -9:
; -& /)
redundancy
,
where 1 and < are to be specified. This greedy algorithm parametrizes a family of mutual information based
feature selection algorithms. The most prominent members of this family are:
1. Joint Mutual Information (JMI): 1 = < =
5
6 75
.
2. Maximum relevancy minimum redundancy(MRMR): 1 =
5
675
and < = 0.
3. Mutual information maximisation (MIM): 1 = < = 0.
16All content copyright © 2017 QuantumBlack, a McKinsey company
There are many open-source Python modules available that do filter-based
feature selection
17All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
18All content copyright © 2017 QuantumBlack, a McKinsey company
Embedded methods example – stability selection
• Stability selection wraps around a base learning algorithm, that has a parameter that controls the amount of
regularization.
• For every value of this parameter, we can get an estimate of which variables to select.
• Stability selection runs the learner on many bootstrap samples of the original data set, and keeps track of which
variables get selected in every sample to form a set of ‘stable’ variables.
Generate
bootstrap
sample
Estimate
LASSO on
bootstrapped
sample
Record
features that
get selected
For each
bootstrap
sample and
each value of
penalization
parameter
Compute posterior
probability of inclusion
Select the
set of ’stable’
features
19All content copyright © 2017 QuantumBlack, a McKinsey company
Stability selection is straightforward to implement in Python, and mature
implementations exist for both Python and R
Iterate over
penalization
parameter
and bootstrap
samples
20All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
Advantages
• Takes interaction between feature subset
search and learning algorithm into account.
Disadvantages
• Computationally more expensive than filter
methods.
21All content copyright © 2017 QuantumBlack, a McKinsey company
Each of these three approaches has its advantages and disadvantages, the primary distinguishing factors being
speed of computation, and the chance of overfitting:
• In terms of speed, filters are faster than embedded methods which are in turn faster than wrappers.
• In terms of overfitting, wrappers have higher learning capacity so are more likely to overfit than embedded methods,
which in turn are more likely to overfit than filter methods.
All of this of course changes with extremes of data/feature availability.
What type of algorithm should I use in practice?

Contenu connexe

Tendances

Iaetsd protecting privacy preserving for cost effective adaptive actions
Iaetsd protecting  privacy preserving for cost effective adaptive actionsIaetsd protecting  privacy preserving for cost effective adaptive actions
Iaetsd protecting privacy preserving for cost effective adaptive actions
Iaetsd Iaetsd
 

Tendances (20)

Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning Optimization
 
AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise Webinar
 
Ad Click Prediction - Paper review
Ad Click Prediction - Paper reviewAd Click Prediction - Paper review
Ad Click Prediction - Paper review
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
 
Modeling at scale in systematic trading
Modeling at scale in systematic tradingModeling at scale in systematic trading
Modeling at scale in systematic trading
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kaggler
 
Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold ReinwaldAlpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold Reinwald
 
IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...
IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...
IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...
 
Iaetsd protecting privacy preserving for cost effective adaptive actions
Iaetsd protecting  privacy preserving for cost effective adaptive actionsIaetsd protecting  privacy preserving for cost effective adaptive actions
Iaetsd protecting privacy preserving for cost effective adaptive actions
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at O'Reilly - Best Practices for Scaling Modeling PlatformsSigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
 
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approach
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
 
Analysis and Implementation of Efficient Association Rules using K-mean and N...
Analysis and Implementation of Efficient Association Rules using K-mean and N...Analysis and Implementation of Efficient Association Rules using K-mean and N...
Analysis and Implementation of Efficient Association Rules using K-mean and N...
 
Presentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive ProblemPresentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive Problem
 
Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1
 
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric StrategyTuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
 

Similaire à PyData London 2018 talk on feature selection

Cloudsim a fast clustering-based feature subset selection algorithm for high...
Cloudsim  a fast clustering-based feature subset selection algorithm for high...Cloudsim  a fast clustering-based feature subset selection algorithm for high...
Cloudsim a fast clustering-based feature subset selection algorithm for high...
ecway
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
ecway
 
Android a fast clustering-based feature subset selection algorithm for high-...
Android  a fast clustering-based feature subset selection algorithm for high-...Android  a fast clustering-based feature subset selection algorithm for high-...
Android a fast clustering-based feature subset selection algorithm for high-...
ecway
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
jaffarbikat
 

Similaire à PyData London 2018 talk on feature selection (20)

IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code Optimization
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
 
Cloudsim a fast clustering-based feature subset selection algorithm for high...
Cloudsim  a fast clustering-based feature subset selection algorithm for high...Cloudsim  a fast clustering-based feature subset selection algorithm for high...
Cloudsim a fast clustering-based feature subset selection algorithm for high...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Android a fast clustering-based feature subset selection algorithm for high-...
Android  a fast clustering-based feature subset selection algorithm for high-...Android  a fast clustering-based feature subset selection algorithm for high-...
Android a fast clustering-based feature subset selection algorithm for high-...
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Test PDF file
Test PDF fileTest PDF file
Test PDF file
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated Process
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
 
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
Competition16
Competition16Competition16
Competition16
 

Dernier

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Dernier (20)

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

PyData London 2018 talk on feature selection

  • 1.
  • 2. PyData London 28th April, 2018 Thomas Huijskens Senior Data Scientist How to get better performance with less data
  • 3. 3All content copyright © 2017 QuantumBlack, a McKinsey company Feature collinearity and scarceness of data means we can't just give a model many features and let it decide which ones are useful and which ones are not. There are multiple reasons to do feature selection when developing machine learning models: • Computational burden: Limiting the number of features may reduce the computational burden of processing the data in the learning algorithm. • Risk of overfitting: Noise reduction and consequently better class separation may be obtained by adding variables that are presumably redundant. • Interpretability: Removing redundant variables from the input data can make the results more interpretable for both the seasoned practitioner as well as any business stakeholders. It pays off to do feature selection as part of the model development process
  • 4. 4All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms should: • remove variables that contain redundant information about the target variable; and • reduce the overlap in information between the variables in the subset of selected features. A good feature selection algorithm also shouldn't look at variables purely in isolation: • Two variables that are useless by themselves can be useful together. • Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity. What are the components of a good feature selection algorithm?
  • 5. 5All content copyright © 2017 QuantumBlack, a McKinsey company Two variables that are useless by themselves can be useful together 1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
  • 6. 6All content copyright © 2017 QuantumBlack, a McKinsey company Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity 1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
  • 7. 7All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Filter methods Embedded methods
  • 8. 8All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Filter methods Embedded methods
  • 9. 9All content copyright © 2017 QuantumBlack, a McKinsey company Mlxtend is an open-source Python package that implements multiple wrapper methods
  • 10. 10All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Embedded methods
  • 11. 11All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – mutual information The mutual information quantifies the amount of information obtained about one random variable, through another random variable. For two variables ! and ", the mutual information is given by # !; " = & ' & ( ) *, , log )(*, ,) ) * )(,) 2* 2,. It determines how similar the joint distribution ) *, , is to the products of the factored marginal distribution ) * ) , .
  • 12. 12All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information In the feature selection problem, we would like to maximise the mutual information between the selected variables !" and the target #. $% = arg max " , !"; # , /. 1. % = 2, where 2 is the number of features we want to select. This is an NP-hard problem, as the set of possible combinations of features grows exponentially.
  • 13. 13All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information A popular heuristic in the literature is to use a greedy forward selection method, where features are selected incrementally, one feature at a time. Let !" #$ = &'( , … , &'+ ,( , be the set of selected features at time step - − 1. The greedy method selects the next feature 0" such that 0" = arg max 6 ∉8+,( 9 :8+,( ⋃ 6; = .
  • 14. 14All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information One can show (proof omitted here) that this is equivalent to the following !" = arg max ) ∉+,-. / 0); 2 − / 0) ; 0+,-. − / 0) ; 0+,-. 2) . However, the quantities involving 6"78 quickly become intractable computationally because they are (: − 1)- dimensional integrals!
  • 15. 15All content copyright © 2017 QuantumBlack, a McKinsey company Mutual information based measures trade off relevancy of a variable against the redundancy of the information a variable contains We can use an approximation to the multidimensional integrals to make the computation more tractable: arg max & ∉()*+ , -& ; / relevancy − 1 2 345 6 75 ,(-9: ; -&) − < 2 345 6 75 , -9: ; -& /) redundancy , where 1 and < are to be specified. This greedy algorithm parametrizes a family of mutual information based feature selection algorithms. The most prominent members of this family are: 1. Joint Mutual Information (JMI): 1 = < = 5 6 75 . 2. Maximum relevancy minimum redundancy(MRMR): 1 = 5 675 and < = 0. 3. Mutual information maximisation (MIM): 1 = < = 0.
  • 16. 16All content copyright © 2017 QuantumBlack, a McKinsey company There are many open-source Python modules available that do filter-based feature selection
  • 17. 17All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Advantages • Typically scale better to high-dimensional data sets than wrapper methods. • Independent of the learning algorithm. Disadvantages • Ignore interaction with learning algorithm. • Often employs lower-dimensional approximations to make computations more tractable. This means they may ignore interactions between different features. Embedded methods Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process.
  • 18. 18All content copyright © 2017 QuantumBlack, a McKinsey company Embedded methods example – stability selection • Stability selection wraps around a base learning algorithm, that has a parameter that controls the amount of regularization. • For every value of this parameter, we can get an estimate of which variables to select. • Stability selection runs the learner on many bootstrap samples of the original data set, and keeps track of which variables get selected in every sample to form a set of ‘stable’ variables. Generate bootstrap sample Estimate LASSO on bootstrapped sample Record features that get selected For each bootstrap sample and each value of penalization parameter Compute posterior probability of inclusion Select the set of ’stable’ features
  • 19. 19All content copyright © 2017 QuantumBlack, a McKinsey company Stability selection is straightforward to implement in Python, and mature implementations exist for both Python and R Iterate over penalization parameter and bootstrap samples
  • 20. 20All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Advantages • Typically scale better to high-dimensional data sets than wrapper methods. • Independent of the learning algorithm. Disadvantages • Ignore interaction with learning algorithm. • Often employs lower-dimensional approximations to make computations more tractable. This means they may ignore interactions between different features. Embedded methods Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. Advantages • Takes interaction between feature subset search and learning algorithm into account. Disadvantages • Computationally more expensive than filter methods.
  • 21. 21All content copyright © 2017 QuantumBlack, a McKinsey company Each of these three approaches has its advantages and disadvantages, the primary distinguishing factors being speed of computation, and the chance of overfitting: • In terms of speed, filters are faster than embedded methods which are in turn faster than wrappers. • In terms of overfitting, wrappers have higher learning capacity so are more likely to overfit than embedded methods, which in turn are more likely to overfit than filter methods. All of this of course changes with extremes of data/feature availability. What type of algorithm should I use in practice?