Understanding variable importances in forests of randomized trees

Understanding variable importances
in forests of randomized trees
Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Dept. of EE & CS, & GIGA-R
Universit´e de Li`ege, Belgium
June 6, 2014
1 / 12

Ensembles of randomized trees
...
Majority vote or prediction average
Improve standard classiﬁcation and regression trees by
reducing their variance.
Many examples : Bagging (Breiman, 1996), Random Forests
(Breiman, 2001), Extremely randomized trees (Geurts et al., 2006).
Standard Random Forests : bootstrap sampling + random
selection of K features at each node
2 / 12

Strengths and weaknesses
Universal approximation
Robustness to outliers
Robustness to irrelevant attributes (to some extent)
Invariance to scaling of inputs
Good computational eﬃciency and scalability
Very good accuracy
3 / 12

Strengths and weaknesses
Universal approximation
Robustness to outliers
Robustness to irrelevant attributes (to some extent)
Invariance to scaling of inputs
Good computational eﬃciency and scalability
Very good accuracy
Loss of interpretability w.r.t. single decision trees
3 / 12

Variable importances
Interpretability can be recovered through variable importances
4 / 12

Two main importance measures :
• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variable
appears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuring
accuracy reduction on out-of-bag samples when the values of
the variable are randomly permuted (Breiman, 2001).
4 / 12

Two main importance measures :
• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variable
appears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuring
accuracy reduction on out-of-bag samples when the values of
the variable are randomly permuted (Breiman, 2001).
We focus here on MDI because :
• It is faster to compute ;
• It does not require to use bootstrap sampling ;
• In practice, it correlates well with the MDA measure (except in
speciﬁc conditions).
4 / 12

Mean decrease of impurity
...
Importance of variable Xj for an ensemble of M trees ϕm is :
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t) , (1)
where jt denotes the variable used at node t, p(t) = Nt/N
and ∆i(t) is the impurity reduction at node t :
∆i(t) = i(t) −
NtL
Nt
i(tL) −
Ntr
Nt
i(tR) (2)
Impurity i(t) can be Shannon entropy, gini index, variance, ...
5 / 12

Motivation and assumptions
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
MDI works well, but it is not well understood theoretically ;
We would like to better characterize it and derive its main
properties from this characterization.
6 / 12

Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
Working assumptions :
6 / 12

Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
• All variables are discrete ;
6 / 12

Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
• Multi-way splits `a la C4.5 (i.e., one branch per value) ;
6 / 12

Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
• Shannon entropy as impurity measure :
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
6 / 12

Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;
6 / 12

Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t)
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;
• Asymptotic conditions : N → ∞, M → ∞.
6 / 12

Result 1 : Three-level decomposition
Variable importances provide a three-level decomposition of the
information jointly provided by all the input variables about
the output, accounting for all interaction terms in a fair and
exhaustive way.
7 / 12

exhaustive way.
I(X1, . . . , Xp; Y )
Information jointly provided
by all input variables
about the output
=
p
j=1
Imp(Xj )
i) Decomposition in terms of
the MDI importance of
each input variable
7 / 12

exhaustive way.
I(X1, . . . , Xp; Y )
about the output
=
p
j=1
Imp(Xj )
each input variable
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
ii) Decomposition along
the degrees k of interaction
with the other variables
B∈Pk (V −j )
I(Xj ; Y |B)
iii) Decomposition along all
interaction terms B
of a given degree k
7 / 12

exhaustive way.
I(X1, . . . , Xp; Y )
about the output
=
p
j=1
Imp(Xj )
each input variable
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
ii) Decomposition along
the degrees k of interaction
with the other variables
B∈Pk (V −j )
I(Xj ; Y |B)
iii) Decomposition along all
interaction terms B
of a given degree k
E.g. : p = 3, Imp(X1) = 1
3
I(X1; Y ) + 1
6
(I(X1; Y |X2) + I(X1; Y |X3)) + 1
3
I(X1; Y |X2, X3)
7 / 12

Illustration (Breiman et al., 1984)
X1
X2 X3
X4
X5 X6
X7
y x1 x2 x3 x4 x5 x6 x7
0 1 1 1 0 1 1 1
1 0 0 1 0 0 1 0
2 1 0 1 1 1 0 1
3 1 0 1 1 0 1 1
4 0 1 1 1 0 1 0
5 1 1 0 1 0 1 1
6 1 1 0 1 1 1 1
7 1 0 1 0 0 1 0
8 1 1 1 1 1 1 1
9 1 1 1 1 0 1 1
8 / 12

Illustration (Breiman et al., 1984)
X1
X2 X3
X4
X5 X6
X7
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
B∈Pk (V −j )
I(Xj ; Y |B)
Var Imp
X1 0.412
X2 0.581
X3 0.531
X4 0.542
X5 0.656
X6 0.225
X7 0.372
3.321
0 1 2 3 4 5 6
X1
X2
X3
X4
X5
X6
X7
k
8 / 12

Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
9 / 12

Deﬁnition (Kohavi & John, 1997) : A variable X is irrelevant (to Y
with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is
relevant if it is not irrelevant.
9 / 12

A variable Xj is irrelevant if and only if Imp(Xj ) = 0.
9 / 12

A variable Xj is irrelevant if and only if Imp(Xj ) = 0.
The importance of a relevant variable is insensitive to the addition
or the removal of irrelevant variables.
9 / 12

Non-totally randomized trees
Most properties are lost as soon as K > 1.
10 / 12

⇒ There can be relevant variables with zero importances (due to
masking eﬀects).
10 / 12

masking eﬀects).
Example :
I(X1; Y ) = H(Y ), I(X1; Y ) ≈ I(X2; Y ), I(X1; Y |X2) = and I(X2; Y |X1) = 0
10 / 12

masking eﬀects).
Example :
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
10 / 12

masking eﬀects).
Example :
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
K = 2 → ImpK=2(X1) = I(X1; Y ) and ImpK=2(X2) = 0
10 / 12

masking eﬀects).
Example :
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
⇒ The importance of relevant variables can be inﬂuenced by the
number of irrelevant variables.
10 / 12

masking eﬀects).
Example :
K = 1 → ImpK=1(X1) ≈ 1
2
I(X1; Y ) + and ImpK=1(X1) ≈ 1
2
I(X2; Y )
⇒ The importance of relevant variables can be inﬂuenced by the
number of irrelevant variables.
K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12

Illustration (continued)
X1
X2 X3
X4
X5 X6
X7
1 2 3 4 5 6 7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
K
Importance
X1
X2
X3
X4
X5
X6
X7
11 / 12

Conclusions
We propose a theoretical characterization of MDI importance
scores in asymptotic conditions ;
In the case of totally randomized trees, variable importances
actually show sound and desirable theoretical properties, that
are lost when using non-totally randomized trees ;
Main results remain valid for a large range of impurity
measures (Gini entropy, variance, kernelized variance, etc).
Future works :
• More formal treatment of the non-totally random case
(K > 1) ;
• Derive distributions of importances in ﬁnite setting ;
• Use these results to design better importance score estimators.
12 / 12

Understanding variable importances in forests of randomized trees

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (6)

Similaire à Understanding variable importances in forests of randomized trees

Similaire à Understanding variable importances in forests of randomized trees (20)

Dernier

Dernier (20)

Understanding variable importances in forests of randomized trees