SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Understanding variable importances
in forests of randomized trees
Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Dept. of EE & CS, & GIGA-R
Universit´e de Li`ege, Belgium
June 6, 2014
1 / 12
Ensembles of randomized trees
...
Majority vote or prediction average
Improve standard classification and regression trees by
reducing their variance.
Many examples : Bagging (Breiman, 1996), Random Forests
(Breiman, 2001), Extremely randomized trees (Geurts et al., 2006).
Standard Random Forests : bootstrap sampling + random
selection of K features at each node
2 / 12
Strengths and weaknesses
Universal approximation
Robustness to outliers
Robustness to irrelevant attributes (to some extent)
Invariance to scaling of inputs
Good computational efficiency and scalability
Very good accuracy
3 / 12
Strengths and weaknesses
Universal approximation
Robustness to outliers
Robustness to irrelevant attributes (to some extent)
Invariance to scaling of inputs
Good computational efficiency and scalability
Very good accuracy
Loss of interpretability w.r.t. single decision trees
3 / 12
Variable importances
Interpretability can be recovered through variable importances
4 / 12
Variable importances
Interpretability can be recovered through variable importances
Two main importance measures :
• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variable
appears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuring
accuracy reduction on out-of-bag samples when the values of
the variable are randomly permuted (Breiman, 2001).
4 / 12
Variable importances
Interpretability can be recovered through variable importances
Two main importance measures :
• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variable
appears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuring
accuracy reduction on out-of-bag samples when the values of
the variable are randomly permuted (Breiman, 2001).
We focus here on MDI because :
• It is faster to compute ;
• It does not require to use bootstrap sampling ;
• In practice, it correlates well with the MDA measure (except in
specific conditions).
4 / 12
Mean decrease of impurity
...
Importance of variable Xj for an ensemble of M trees ϕm is :
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t) , (1)
where jt denotes the variable used at node t, p(t) = Nt/N
and ∆i(t) is the impurity reduction at node t :
∆i(t) = i(t) −
NtL
Nt
i(tL) −
Ntr
Nt
i(tR) (2)
Impurity i(t) can be Shannon entropy, gini index, variance, ...
5 / 12
Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
6 / 12
Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
6 / 12
Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
6 / 12
Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
6 / 12
Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
• Shannon entropy as impurity measure :
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
6 / 12
Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
• Shannon entropy as impurity measure :
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;
6 / 12
Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
• Shannon entropy as impurity measure :
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;
• Asymptotic conditions : N → ∞, M → ∞.
6 / 12
Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
7 / 12
Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
I(X1, . . . , Xp; Y )
Information jointly provided
by all input variables
about the output
=
p
j=1
Imp(Xj )
i) Decomposition in terms of
the MDI importance of
each input variable
7 / 12
Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
I(X1, . . . , Xp; Y )
Information jointly provided
by all input variables
about the output
=
p
j=1
Imp(Xj )
i) Decomposition in terms of
the MDI importance of
each input variable
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
ii) Decomposition along
the degrees k of interaction
with the other variables
B∈Pk (V −j )
I(Xj ; Y |B)
iii) Decomposition along all
interaction terms B
of a given degree k
7 / 12
Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
I(X1, . . . , Xp; Y )
Information jointly provided
by all input variables
about the output
=
p
j=1
Imp(Xj )
i) Decomposition in terms of
the MDI importance of
each input variable
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
ii) Decomposition along
the degrees k of interaction
with the other variables
B∈Pk (V −j )
I(Xj ; Y |B)
iii) Decomposition along all
interaction terms B
of a given degree k
E.g. : p = 3, Imp(X1) = 1
3
I(X1; Y ) + 1
6
(I(X1; Y |X2) + I(X1; Y |X3)) + 1
3
I(X1; Y |X2, X3)
7 / 12
Illustration (Breiman et al., 1984)
X1
X2 X3
X4
X5 X6
X7
y x1 x2 x3 x4 x5 x6 x7
0 1 1 1 0 1 1 1
1 0 0 1 0 0 1 0
2 1 0 1 1 1 0 1
3 1 0 1 1 0 1 1
4 0 1 1 1 0 1 0
5 1 1 0 1 0 1 1
6 1 1 0 1 1 1 1
7 1 0 1 0 0 1 0
8 1 1 1 1 1 1 1
9 1 1 1 1 0 1 1
8 / 12
Illustration (Breiman et al., 1984)
X1
X2 X3
X4
X5 X6
X7
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
B∈Pk (V −j )
I(Xj ; Y |B)
Var Imp
X1 0.412
X2 0.581
X3 0.531
X4 0.542
X5 0.656
X6 0.225
X7 0.372
3.321
0 1 2 3 4 5 6
X1
X2
X3
X4
X5
X6
X7
k
8 / 12
Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
9 / 12
Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y
with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is
relevant if it is not irrelevant.
9 / 12
Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y
with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is
relevant if it is not irrelevant.
A variable Xj is irrelevant if and only if Imp(Xj ) = 0.
9 / 12
Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y
with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is
relevant if it is not irrelevant.
A variable Xj is irrelevant if and only if Imp(Xj ) = 0.
The importance of a relevant variable is insensitive to the addition
or the removal of irrelevant variables.
9 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by the
number of irrelevant variables.
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by the
number of irrelevant variables.
K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
Illustration (continued)
X1
X2 X3
X4
X5 X6
X7
1 2 3 4 5 6 7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
K
Importance
X1
X2
X3
X4
X5
X6
X7
11 / 12
Conclusions
We propose a theoretical characterization of MDI importance
scores in asymptotic conditions ;
In the case of totally randomized trees, variable importances
actually show sound and desirable theoretical properties, that
are lost when using non-totally randomized trees ;
Main results remain valid for a large range of impurity
measures (Gini entropy, variance, kernelized variance, etc).
Future works :
• More formal treatment of the non-totally random case
(K > 1) ;
• Derive distributions of importances in finite setting ;
• Use these results to design better importance score estimators.
12 / 12

Contenu connexe

En vedette

Scikit-Learn - Or why I joined an open source software project
Scikit-Learn - Or why I joined an open source software projectScikit-Learn - Or why I joined an open source software project
Scikit-Learn - Or why I joined an open source software project
Gilles Louppe
 

En vedette (6)

Scikit-Learn in Particle Physics
Scikit-Learn in Particle PhysicsScikit-Learn in Particle Physics
Scikit-Learn in Particle Physics
 
Scikit-Learn - Or why I joined an open source software project
Scikit-Learn - Or why I joined an open source software projectScikit-Learn - Or why I joined an open source software project
Scikit-Learn - Or why I joined an open source software project
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to Practice
 

Similaire à Understanding variable importances in forests of randomized trees

MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
The Statistical and Applied Mathematical Sciences Institute
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
Lu Mao
 

Similaire à Understanding variable importances in forests of randomized trees (20)

Benelearn2016
Benelearn2016Benelearn2016
Benelearn2016
 
1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 
Introduction to Decision Making Theory
Introduction to Decision Making TheoryIntroduction to Decision Making Theory
Introduction to Decision Making Theory
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
Statistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsStatistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: Models
 
Asset Prices in Segmented and Integrated Markets
Asset Prices in Segmented and Integrated MarketsAsset Prices in Segmented and Integrated Markets
Asset Prices in Segmented and Integrated Markets
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Finance Enginering from Columbia.pdf
Finance Enginering from Columbia.pdfFinance Enginering from Columbia.pdf
Finance Enginering from Columbia.pdf
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility using
 
Handling missing data with expectation maximization algorithm
Handling missing data with expectation maximization algorithmHandling missing data with expectation maximization algorithm
Handling missing data with expectation maximization algorithm
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Slides risk-rennes
Slides risk-rennesSlides risk-rennes
Slides risk-rennes
 
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 

Dernier

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 

Dernier (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 

Understanding variable importances in forests of randomized trees

  • 1. Understanding variable importances in forests of randomized trees Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts Dept. of EE & CS, & GIGA-R Universit´e de Li`ege, Belgium June 6, 2014 1 / 12
  • 2. Ensembles of randomized trees ... Majority vote or prediction average Improve standard classification and regression trees by reducing their variance. Many examples : Bagging (Breiman, 1996), Random Forests (Breiman, 2001), Extremely randomized trees (Geurts et al., 2006). Standard Random Forests : bootstrap sampling + random selection of K features at each node 2 / 12
  • 3. Strengths and weaknesses Universal approximation Robustness to outliers Robustness to irrelevant attributes (to some extent) Invariance to scaling of inputs Good computational efficiency and scalability Very good accuracy 3 / 12
  • 4. Strengths and weaknesses Universal approximation Robustness to outliers Robustness to irrelevant attributes (to some extent) Invariance to scaling of inputs Good computational efficiency and scalability Very good accuracy Loss of interpretability w.r.t. single decision trees 3 / 12
  • 5. Variable importances Interpretability can be recovered through variable importances 4 / 12
  • 6. Variable importances Interpretability can be recovered through variable importances Two main importance measures : • The mean decrease of impurity (MDI) : summing total impurity reductions at all tree nodes where the variable appears (Breiman et al., 1984) ; • The mean decrease of accuracy (MDA) : measuring accuracy reduction on out-of-bag samples when the values of the variable are randomly permuted (Breiman, 2001). 4 / 12
  • 7. Variable importances Interpretability can be recovered through variable importances Two main importance measures : • The mean decrease of impurity (MDI) : summing total impurity reductions at all tree nodes where the variable appears (Breiman et al., 1984) ; • The mean decrease of accuracy (MDA) : measuring accuracy reduction on out-of-bag samples when the values of the variable are randomly permuted (Breiman, 2001). We focus here on MDI because : • It is faster to compute ; • It does not require to use bootstrap sampling ; • In practice, it correlates well with the MDA measure (except in specific conditions). 4 / 12
  • 8. Mean decrease of impurity ... Importance of variable Xj for an ensemble of M trees ϕm is : Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) , (1) where jt denotes the variable used at node t, p(t) = Nt/N and ∆i(t) is the impurity reduction at node t : ∆i(t) = i(t) − NtL Nt i(tL) − Ntr Nt i(tR) (2) Impurity i(t) can be Shannon entropy, gini index, variance, ... 5 / 12
  • 9. Motivation and assumptions Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) MDI works well, but it is not well understood theoretically ; We would like to better characterize it and derive its main properties from this characterization. 6 / 12
  • 10. Motivation and assumptions Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) MDI works well, but it is not well understood theoretically ; We would like to better characterize it and derive its main properties from this characterization. Working assumptions : 6 / 12
  • 11. Motivation and assumptions Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) MDI works well, but it is not well understood theoretically ; We would like to better characterize it and derive its main properties from this characterization. Working assumptions : • All variables are discrete ; 6 / 12
  • 12. Motivation and assumptions Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) MDI works well, but it is not well understood theoretically ; We would like to better characterize it and derive its main properties from this characterization. Working assumptions : • All variables are discrete ; • Multi-way splits `a la C4.5 (i.e., one branch per value) ; 6 / 12
  • 13. Motivation and assumptions Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) MDI works well, but it is not well understood theoretically ; We would like to better characterize it and derive its main properties from this characterization. Working assumptions : • All variables are discrete ; • Multi-way splits `a la C4.5 (i.e., one branch per value) ; • Shannon entropy as impurity measure : i(t) = − c Nt,c Nt log Nt,c Nt 6 / 12
  • 14. Motivation and assumptions Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) MDI works well, but it is not well understood theoretically ; We would like to better characterize it and derive its main properties from this characterization. Working assumptions : • All variables are discrete ; • Multi-way splits `a la C4.5 (i.e., one branch per value) ; • Shannon entropy as impurity measure : i(t) = − c Nt,c Nt log Nt,c Nt • Totally randomized trees (RF with K = 1) ; 6 / 12
  • 15. Motivation and assumptions Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) MDI works well, but it is not well understood theoretically ; We would like to better characterize it and derive its main properties from this characterization. Working assumptions : • All variables are discrete ; • Multi-way splits `a la C4.5 (i.e., one branch per value) ; • Shannon entropy as impurity measure : i(t) = − c Nt,c Nt log Nt,c Nt • Totally randomized trees (RF with K = 1) ; • Asymptotic conditions : N → ∞, M → ∞. 6 / 12
  • 16. Result 1 : Three-level decomposition Variable importances provide a three-level decomposition of the information jointly provided by all the input variables about the output, accounting for all interaction terms in a fair and exhaustive way. 7 / 12
  • 17. Result 1 : Three-level decomposition Variable importances provide a three-level decomposition of the information jointly provided by all the input variables about the output, accounting for all interaction terms in a fair and exhaustive way. I(X1, . . . , Xp; Y ) Information jointly provided by all input variables about the output = p j=1 Imp(Xj ) i) Decomposition in terms of the MDI importance of each input variable 7 / 12
  • 18. Result 1 : Three-level decomposition Variable importances provide a three-level decomposition of the information jointly provided by all the input variables about the output, accounting for all interaction terms in a fair and exhaustive way. I(X1, . . . , Xp; Y ) Information jointly provided by all input variables about the output = p j=1 Imp(Xj ) i) Decomposition in terms of the MDI importance of each input variable Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k ii) Decomposition along the degrees k of interaction with the other variables B∈Pk (V −j ) I(Xj ; Y |B) iii) Decomposition along all interaction terms B of a given degree k 7 / 12
  • 19. Result 1 : Three-level decomposition Variable importances provide a three-level decomposition of the information jointly provided by all the input variables about the output, accounting for all interaction terms in a fair and exhaustive way. I(X1, . . . , Xp; Y ) Information jointly provided by all input variables about the output = p j=1 Imp(Xj ) i) Decomposition in terms of the MDI importance of each input variable Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k ii) Decomposition along the degrees k of interaction with the other variables B∈Pk (V −j ) I(Xj ; Y |B) iii) Decomposition along all interaction terms B of a given degree k E.g. : p = 3, Imp(X1) = 1 3 I(X1; Y ) + 1 6 (I(X1; Y |X2) + I(X1; Y |X3)) + 1 3 I(X1; Y |X2, X3) 7 / 12
  • 20. Illustration (Breiman et al., 1984) X1 X2 X3 X4 X5 X6 X7 y x1 x2 x3 x4 x5 x6 x7 0 1 1 1 0 1 1 1 1 0 0 1 0 0 1 0 2 1 0 1 1 1 0 1 3 1 0 1 1 0 1 1 4 0 1 1 1 0 1 0 5 1 1 0 1 0 1 1 6 1 1 0 1 1 1 1 7 1 0 1 0 0 1 0 8 1 1 1 1 1 1 1 9 1 1 1 1 0 1 1 8 / 12
  • 21. Illustration (Breiman et al., 1984) X1 X2 X3 X4 X5 X6 X7 Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k B∈Pk (V −j ) I(Xj ; Y |B) Var Imp X1 0.412 X2 0.581 X3 0.531 X4 0.542 X5 0.656 X6 0.225 X7 0.372 3.321 0 1 2 3 4 5 6 X1 X2 X3 X4 X5 X6 X7 k 8 / 12
  • 22. Result 2 : Irrelevant variables Variable importances depend only on the relevant variables. 9 / 12
  • 23. Result 2 : Irrelevant variables Variable importances depend only on the relevant variables. Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is relevant if it is not irrelevant. 9 / 12
  • 24. Result 2 : Irrelevant variables Variable importances depend only on the relevant variables. Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is relevant if it is not irrelevant. A variable Xj is irrelevant if and only if Imp(Xj ) = 0. 9 / 12
  • 25. Result 2 : Irrelevant variables Variable importances depend only on the relevant variables. Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is relevant if it is not irrelevant. A variable Xj is irrelevant if and only if Imp(Xj ) = 0. The importance of a relevant variable is insensitive to the addition or the removal of irrelevant variables. 9 / 12
  • 26. Non-totally randomized trees Most properties are lost as soon as K > 1. 10 / 12
  • 27. Non-totally randomized trees Most properties are lost as soon as K > 1. ⇒ There can be relevant variables with zero importances (due to masking effects). 10 / 12
  • 28. Non-totally randomized trees Most properties are lost as soon as K > 1. ⇒ There can be relevant variables with zero importances (due to masking effects). Example : I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0 10 / 12
  • 29. Non-totally randomized trees Most properties are lost as soon as K > 1. ⇒ There can be relevant variables with zero importances (due to masking effects). Example : I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0 K = 1 → ImpK=1(X1) ≈ 1 2 I(X1; Y ) + and ImpK=1(X1) ≈ 1 2 I(X2; Y ) 10 / 12
  • 30. Non-totally randomized trees Most properties are lost as soon as K > 1. ⇒ There can be relevant variables with zero importances (due to masking effects). Example : I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0 K = 1 → ImpK=1(X1) ≈ 1 2 I(X1; Y ) + and ImpK=1(X1) ≈ 1 2 I(X2; Y ) K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0 10 / 12
  • 31. Non-totally randomized trees Most properties are lost as soon as K > 1. ⇒ There can be relevant variables with zero importances (due to masking effects). Example : I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0 K = 1 → ImpK=1(X1) ≈ 1 2 I(X1; Y ) + and ImpK=1(X1) ≈ 1 2 I(X2; Y ) K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0 ⇒ The importance of relevant variables can be influenced by the number of irrelevant variables. 10 / 12
  • 32. Non-totally randomized trees Most properties are lost as soon as K > 1. ⇒ There can be relevant variables with zero importances (due to masking effects). Example : I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0 K = 1 → ImpK=1(X1) ≈ 1 2 I(X1; Y ) + and ImpK=1(X1) ≈ 1 2 I(X2; Y ) K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0 ⇒ The importance of relevant variables can be influenced by the number of irrelevant variables. K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0 10 / 12
  • 33. Illustration (continued) X1 X2 X3 X4 X5 X6 X7 1 2 3 4 5 6 7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 K Importance X1 X2 X3 X4 X5 X6 X7 11 / 12
  • 34. Conclusions We propose a theoretical characterization of MDI importance scores in asymptotic conditions ; In the case of totally randomized trees, variable importances actually show sound and desirable theoretical properties, that are lost when using non-totally randomized trees ; Main results remain valid for a large range of impurity measures (Gini entropy, variance, kernelized variance, etc). Future works : • More formal treatment of the non-totally random case (K > 1) ; • Derive distributions of importances in finite setting ; • Use these results to design better importance score estimators. 12 / 12