Understanding variable importances in forests of randomized trees
1. Understanding variable importances
in forests of randomized trees
Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Dept. of EE & CS, & GIGA-R
Universit´e de Li`ege, Belgium
June 6, 2014
1 / 12
2. Ensembles of randomized trees
...
Majority vote or prediction average
Improve standard classification and regression trees by
reducing their variance.
Many examples : Bagging (Breiman, 1996), Random Forests
(Breiman, 2001), Extremely randomized trees (Geurts et al., 2006).
Standard Random Forests : bootstrap sampling + random
selection of K features at each node
2 / 12
3. Strengths and weaknesses
Universal approximation
Robustness to outliers
Robustness to irrelevant attributes (to some extent)
Invariance to scaling of inputs
Good computational efficiency and scalability
Very good accuracy
3 / 12
4. Strengths and weaknesses
Universal approximation
Robustness to outliers
Robustness to irrelevant attributes (to some extent)
Invariance to scaling of inputs
Good computational efficiency and scalability
Very good accuracy
Loss of interpretability w.r.t. single decision trees
3 / 12
6. Variable importances
Interpretability can be recovered through variable importances
Two main importance measures :
• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variable
appears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuring
accuracy reduction on out-of-bag samples when the values of
the variable are randomly permuted (Breiman, 2001).
4 / 12
7. Variable importances
Interpretability can be recovered through variable importances
Two main importance measures :
• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variable
appears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuring
accuracy reduction on out-of-bag samples when the values of
the variable are randomly permuted (Breiman, 2001).
We focus here on MDI because :
• It is faster to compute ;
• It does not require to use bootstrap sampling ;
• In practice, it correlates well with the MDA measure (except in
specific conditions).
4 / 12
8. Mean decrease of impurity
...
Importance of variable Xj for an ensemble of M trees ϕm is :
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t) , (1)
where jt denotes the variable used at node t, p(t) = Nt/N
and ∆i(t) is the impurity reduction at node t :
∆i(t) = i(t) −
NtL
Nt
i(tL) −
Ntr
Nt
i(tR) (2)
Impurity i(t) can be Shannon entropy, gini index, variance, ...
5 / 12
9. Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
6 / 12
10. Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
6 / 12
11. Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
6 / 12
12. Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
6 / 12
13. Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
• Shannon entropy as impurity measure :
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
6 / 12
14. Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
• Shannon entropy as impurity measure :
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;
6 / 12
15. Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
Working assumptions :
• All variables are discrete ;
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
• Shannon entropy as impurity measure :
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;
• Asymptotic conditions : N → ∞, M → ∞.
6 / 12
16. Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
7 / 12
17. Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
I(X1, . . . , Xp; Y )
Information jointly provided
by all input variables
about the output
=
p
j=1
Imp(Xj )
i) Decomposition in terms of
the MDI importance of
each input variable
7 / 12
18. Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
I(X1, . . . , Xp; Y )
Information jointly provided
by all input variables
about the output
=
p
j=1
Imp(Xj )
i) Decomposition in terms of
the MDI importance of
each input variable
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
ii) Decomposition along
the degrees k of interaction
with the other variables
B∈Pk (V −j )
I(Xj ; Y |B)
iii) Decomposition along all
interaction terms B
of a given degree k
7 / 12
19. Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
I(X1, . . . , Xp; Y )
Information jointly provided
by all input variables
about the output
=
p
j=1
Imp(Xj )
i) Decomposition in terms of
the MDI importance of
each input variable
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
ii) Decomposition along
the degrees k of interaction
with the other variables
B∈Pk (V −j )
I(Xj ; Y |B)
iii) Decomposition along all
interaction terms B
of a given degree k
E.g. : p = 3, Imp(X1) = 1
3
I(X1; Y ) + 1
6
(I(X1; Y |X2) + I(X1; Y |X3)) + 1
3
I(X1; Y |X2, X3)
7 / 12
21. Illustration (Breiman et al., 1984)
X1
X2 X3
X4
X5 X6
X7
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
B∈Pk (V −j )
I(Xj ; Y |B)
Var Imp
X1 0.412
X2 0.581
X3 0.531
X4 0.542
X5 0.656
X6 0.225
X7 0.372
3.321
0 1 2 3 4 5 6
X1
X2
X3
X4
X5
X6
X7
k
8 / 12
22. Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
9 / 12
23. Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y
with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is
relevant if it is not irrelevant.
9 / 12
24. Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y
with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is
relevant if it is not irrelevant.
A variable Xj is irrelevant if and only if Imp(Xj ) = 0.
9 / 12
25. Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Y
with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is
relevant if it is not irrelevant.
A variable Xj is irrelevant if and only if Imp(Xj ) = 0.
The importance of a relevant variable is insensitive to the addition
or the removal of irrelevant variables.
9 / 12
27. Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
10 / 12
28. Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
10 / 12
29. Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
10 / 12
30. Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0
10 / 12
31. Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by the
number of irrelevant variables.
10 / 12
32. Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due to
masking effects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by the
number of irrelevant variables.
K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
34. Conclusions
We propose a theoretical characterization of MDI importance
scores in asymptotic conditions ;
In the case of totally randomized trees, variable importances
actually show sound and desirable theoretical properties, that
are lost when using non-totally randomized trees ;
Main results remain valid for a large range of impurity
measures (Gini entropy, variance, kernelized variance, etc).
Future works :
• More formal treatment of the non-totally random case
(K > 1) ;
• Derive distributions of importances in finite setting ;
• Use these results to design better importance score estimators.
12 / 12