lassification with decision trees from a nonparametric predictive inference perspective

ERCIM'11 - Handling imprecision in graphical models
Classiﬁcation with decision trees from a
nonparametric predictive inference
perspective
Joaquín Abellán†, Rebecca M. Baker§, Frank P.A. Coolen§,
Richard J. Crossman¶, Andrés. R. Masegosa†
† Department of Computer Science and Artiﬁcial Intelligence, University of
Granada, Granada, Spain
§ Department of Mathematical Sciences, Durham University, Durham, UK
¶ Warwick Medical School, University of Warwick, Coventry, UK
London, December 2011
ERCIM’11 London (UK) 1/28

Outline
1 Introduction
2 Imprecise Models for Multinomial data
3 Uncertainty Measures and Decision Trees
4 Experimental Evaluation
5 Conclusions & Future Works

Introduction
Part I
Introduction

Introduction
Imprecise Probabilities
Imprecise Probabilities is a term that subsumes several
theories for reasoning under uncertainty:
Belief Functions
Reachable Probability Intervals
Capacities of various orders
Upper and lower probabilities
Credal Sets
.....

Introduction
Uncertainty Measures
Extensions of the classic information theory has been
developed for these imprecise models.
Several extensions to the Shannon entropy measure has
been proposed:
Maximum Entropy measure proposed by (Abellan - Moral,
2003) as a total uncertainty measure for general credal
sets.
The employment of this measure in the case of the
Imprecise Dirichlet model (IDM) has lead to several
successful applications in the ﬁeld of data-mining, specially
on decision trees for supervised classiﬁcation.

Introduction
Imprecise Dirichlet Model (IDM) model
The IDM model was proposed by Walley in 1996 for
statistical inference from multinomial data was
developed to correct shortcomings of previous alternative
objective models.
It veriﬁes a set of principles which are claimed by Walley
to be desirable for inference:
Representation Invariance Principle.
It can be expressed via a set of reachable probability
intervals and a belief function (Abellán,2006).

Introduction
Non-parametric Predictive Inference
Non-parametric Predictive Inference model for multinomial data (NPI-M) is
presented by (Coolen-Augustin,2005) as an alternative to the IDM model.

Introduction
Learn from data in the absence of prior knowledge and with only a few
modeling assumptions.

Introduction
Post-data exchangeability-like assumption together with a latent
variable representation of data.
B
BB
B
P
P P
P
P

Introduction
Post-data exchangeability-like assumption together with a latent
variable representation of data.
B
BB
B
P
P P
P
P
It does not satisfy as Representation Invariance Principle (RIP).
Not a shortcoming for Coolen-Augustin. A weaker principle is
proposed.

Introduction
Classiﬁcation with Decision Trees using NPI-M model
In this work, we employ the NPI-M model to build
decision trees using the maximum entropy (ME)
measure as a split criteria:

Introduction
Classiﬁcation with Decision Trees using NPI-M model
In this work, we employ the NPI-M model to build
decision trees using the maximum entropy (ME)
measure as a split criteria:
We experimentally compare the decision trees build with
IDM with different S values.
Classiﬁcation accuracy is slightly improve.
The decision trees are notably smaller.
A parameter free method.

Imprecise Models for Multinomial Data
Part II
Imprecise Models for Multinomial
Data

Imprecise Dirichlet Model
The imprecise Dirichlet model (IDM) was introduced by Walley for
inference about the probability distribution of a categorical variable.

Let us assume that X is a variable taking values on a ﬁnite set X = {x1, . . . , xK }
and we have a sample of N independent and identically distributed outcomes of
X.

X.
The goal is to estimate the probabilities θx = p(x) with which X takes its values.

X.
The goal is to estimate the probabilities θx = p(x) with which X takes its values.
Common Bayesian procedure:
Assumes a Dirichlet prior for the parameter vector (θx )x∈X .
f((θx )x∈X ) =
Γ(s)
x∈X Γ(s · tx )
x∈X
θs·tx −1
x
The parameter s > 0 and t = (tx )x∈X , which is a vector of positive real
numbers satisfying x∈X tx = 1.
Take the posterior expectation of the parameters given the sample.

Posterior Probability
f((θx )x∈X |D) =
Γ(s)
x∈X Γ(s · tx ) x∈X
θ
n(x)+s·tx −1
x
where n(x) are the count frequencies of x in the data.

f((θx )x∈X |D) =
Γ(s)
θ
n(x)+s·tx −1
x
The IDM only depends on parameter s and assumes all the possible values of
the vector t.
This deﬁnes a closed and convex set of prior distributions.

f((θx )x∈X |D) =
Γ(s)
θ
n(x)+s·tx −1
x
The IDM only depends on parameter s and assumes all the possible values of
the vector t.
This deﬁnes a closed and convex set of prior distributions.
This credal set for a particular variable X can be represented by a system of
probability intervals:
p (x) ∈ [
n(x)
N + s
,
n(x) + s
N + s
]
Parameter s determines how quickly the lower and upper probabilities
converge as more data become available.

The NPI-M model
The NPI model for multinomial data (NPI-M) was developed by
Coolen and Augustin is based on a variation of Hill’s assumption A(n)
which relates to predictive inference involving real-valued data
observations.
Post-data exchangeability-like assumption together with a latent variable
representation of data.

The NPI-M model
observations.
Suppose there are 5 different categories {B, P, R, Y, G} the observations: 4
outcomes of B and 5 outcomes of P.
B
P
P
B
P
B
B
P
P

The NPI-M model
observations.
Suppose there are 5 different categories {B, P, R, Y, G} the observations: 4
outcomes of B and 5 outcomes of P.
B
P
P
B
P
B
B
P
P
B
B
B
B
P
P
P
P
P

The NPI-M model
Theorem (Coolen and Augustin 2009) The lower and upper probabilities for
the event x:
P(x) = max 0,
n(x) − 1
N
P(x) = min
n(x) + 1
N
, 1

The NPI-M model
Theorem (Coolen and Augustin 2009) The lower and upper probabilities for
the event x:
P(x) = max 0,
n(x) − 1
N
P(x) = min
n(x) + 1
N
, 1
For the previous case {B, P, R, Y, G}:
B
B
B
B
P
P
P
P
P
[
3
9
,
5
9
]; [
4
9
,
6
9
]; [0,
1
9
]; [0,
1
9
]; [0,
1
9
]

The A-NPI-M model
This set of lower and upper probabilities,L, associated with the singleton events
x expresses a reachable set of probability intervals, i.e. a credal set.
L = {[max 0,
n(x) − 1
N
, min
n(x) + 1
N
, 1 ] : x ∈ {x1, ..., xK }}

The A-NPI-M model
L = {[max 0,
n(x) − 1
N
, min
n(x) + 1
N
, 1 ] : x ∈ {x1, ..., xK }}
The set of probability distributions obtained from NPI-M is not a credal set (see
a counterexample in the paper).
p = ( 3
9
, 4
9
, 2
27
, 2
27
, 2
27
) belongs to the credal set:
[3
9
, 5
9
]; [ 4
9
, 6
9
]; [0, 1
9
]; [0, 1
9
]; [0, 1
9
]
It is not possible to ﬁnd a conﬁguration of the probability wheel that
corresponds to p.

The A-NPI-M model
L = {[max 0,
n(x) − 1
N
, min
n(x) + 1
N
, 1 ] : x ∈ {x1, ..., xK }}
The set of probability distributions obtained from NPI-M is not a credal set (see
a counterexample in the paper).
p = ( 3
9
, 4
9
, 2
27
, 2
27
, 2
27
) belongs to the credal set:
[3
9
, 5
9
]; [ 4
9
, 6
9
]; [0, 1
9
]; [0, 1
9
]; [0, 1
9
]
It is not possible to find a configuration of the probability wheel that
corresponds to p.
The A-NPI-M model is a simplified approximate model derived from NPI-M by
considering all probability distributions compatible with the set of lower and upper
probabilities, L, obtained from NPI-M.

Uncertainty Measures and Decision Trees
Part III
Uncertainty Measures and
Decision Trees

Decision Trees
A decision tree, also called a classification tree, is a simple structure that can
be used as a classifier.
Each node represents an attribute variable
Each branch represents one of the states of this variable
Each tree leaf specifies an expected value of the class variable
Example of decision tree for three attribute variables Ai (i = 1, 2, 3), with two
possible values (0, 1) and a class variable C with cases or states c1, c2, c3:

Building Decision Trees from data
Basic Rule
Associate to each node the most informative variable about the class C.
Not already been selected in the path from the root to this node.
This variable provides more information than if it had not been included.

Basic Rule
We are given a data set D.
Each node of DT deﬁnes a set of probabilities for the
class variable C, Pσ.
σ is the conﬁguration of a node.
σ = (A1 = 1, A3 = 0) associated to node c3.

Basic Rule
We are given a data set D.
Each node of DT deﬁnes a set of probabilities for the
class variable C, Pσ.
σ is the conﬁguration of a node.
σ = (A1 = 1, A3 = 0) associated to node c3.
Pσ are obtained from the IDM, NPI-M or A-NPI-M from the data set D[σ].

Imprecise Information Gain
ImpInfGain = TU(Pσ
) −


ai ∈Ai
rσ
ai
TU(Pσ∪(Ai =ai )
)



ImpInfGain = TU(Pσ
) −


ai ∈Ai
rσ
ai
TU(Pσ∪(Ai =ai )
)


TU is a total uncertainty measure function, normally deﬁned on credal sets.
rσ
ai
is the relative frequency with which Ai takes the value ai in D[σ].
σ ∪ (Ai = ai ) is the result of adding the value Ai = ai to conﬁguration σ

ImpInfGain = TU(Pσ
) −


ai ∈Ai
rσ
ai
TU(Pσ∪(Ai =ai )
)


TU is a total uncertainty measure function, normally defined on credal sets.
rσ
ai
is the relative frequency with which Ai takes the value ai in D[σ].
σ ∪ (Ai = ai ) is the result of adding the value Ai = ai to configuration σ
Imprecise Information Gain is a comparison of the total uncertainty for
configuration σ with the expected total uncertainty of σ ∪ (Ai = ai ).
We put a new node or we split only if this difference is positive.

Maximum Entropy
Maximum Entropy is an aggregate uncertainty measure which aims to
generalize classic measures (Hartley Measure and Shanon Entropy).
It satisﬁes all the required properties (additivity, subadditivity, monotonicity,
proper range, etc.)
It is deﬁned as a functional S∗:
S∗
(Pσ
) = max
p∈Pσ
{−
x∈X
p(x) log2 p(x)}

Maximum Entropy
proper range, etc.)
S∗
(Pσ
) = max
p∈Pσ
{−
x∈X
p(x) log2 p(x)}
Efﬁcient algorithms for different imprecise models have been proposed.
Credal Sets (AbellanMoral,2003); IDM (AbellanMoral,2006); NPI-M and
A-NPI-M (Abellan et al.,2010).

Maximum Entropy
proper range, etc.)
S∗
(Pσ
) = max
p∈Pσ
{−
x∈X
p(x) log2 p(x)}
Efﬁcient algorithms for different imprecise models have been proposed.
Credal Sets (AbellanMoral,2003); IDM (AbellanMoral,2006); NPI-M and
A-NPI-M (Abellan et al.,2010).
This method for building decision trees for classiﬁcation
can be similarly extended for other imprecise models.

Experimental Evaluation
Part IV

Experimental Set-up
The aim is to compare the use of the IDM with the use of
the NPI-M and the A-NPI-M for building decision trees

Experimental Set-up
The aim is to compare the use of the IDM with the use of
the NPI-M and the A-NPI-M for building decision trees
Decision trees for supervised classification using imprecise models, IDM, NPI-M
and A-NPI:
The maximum entropy as base measure for uncertainty.
Classic Shannon’s entropy measure is also employed:
information gain and information gain ratio.
Experiments on 40 UCI data sets using a 10-fold-cross validation scheme.
Statistical Tests for the comparison:
Corrected Paired T-Test at 5% significant level.
Friedman Test at 5% significant level.

Evaluating Classiﬁcation Accuracy
IDM NPI-M A-NPI-M IG IGR
Av. Accuracy 76.73 76.77 76.77 74.33 75.66
Friedman Rank 2.95 2.94 2.76 3.06 3.29
Friedman Test accepts the hypothesis that all algorithms
performs equally well.

Evaluating the size of the decision trees
The size of a decision tree is relevant:
A decision tree is composed by a set of classification rules.
Smaller trees implies more representative classification rules (lower risk of
over-fitted predictions).

Evaluating the size of the decision trees
The size of a decision tree is relevant:
A decision tree is composed by a set of classification rules.
Smaller trees implies more representative classification rules (lower risk of
over-fitted predictions).
IDM NPI-M A-NPI-M IG IGR
Av. N. Nodes 610.16 589.52 589.81 1119.46 1288.56
Friedman Rank 2.96 1.4 1.8 4.18 4.67
Imprecise models provided much more compact representations than
precise based methods.
NPI-M and A-NPI-M build smaller decision trees than IDM with s=1 (default
value).

Evaluating IDM with different s values
IDM depends of the parameter s (affects to the size of the intervals ):
s=0 implies the information gain criterion.
Higher s values implies higher intervals. Decrease the size of the trees.

IDM depends of the parameter s (affects to the size of the intervals ):
s=0 implies the information gain criterion.
Higher s values implies higher intervals. Decrease the size of the trees.
NPI-M IDM-S1 IDM-S1.5 IDM-S2.0
Av. N. Nodes 589.5 610.2 480.3 383.8
Friedman Rank 2.4 3.85 2.43 1.28
IDM with s=1.5 obtains decision trees with a similar size to NPI-M.
IDM with s=2.0 obtains decision trees with smaller similar size than NPI-M.

NPI-M IDM-S1 IDM-S1.5 IDM-S2.0
Av. Accuracy 76.77 76.73 76.23 75.75
Friedman Rank 2.25 2.26 2.46 3.03
IDM with s>1 performs worst than NPI-M and IDM with s=1.
NPI-M models get the best classiﬁcation accuracy with the smallest
decision trees.

Conclusions and Future Works
Part V

We have presented an application of the NPI model for
multinomial data in classiﬁcation.
A known scheme to build decision trees via two NPI-based
models, an exact model (NPI-M), and an approximate model
(A-NPI-M), is used.
Models based on NPI model have a performance of accuracy
similar to that of the best model based on the IDM (which varies
its parameter), but using smaller trees.

We have presented an application of the NPI model for
multinomial data in classiﬁcation.
A known scheme to build decision trees via two NPI-based
models, an exact model (NPI-M), and an approximate model
(A-NPI-M), is used.
Models based on NPI model have a performance of accuracy
similar to that of the best model based on the IDM (which varies
its parameter), but using smaller trees.
Future Works:
Extend these methods for credal classiﬁcation.

Thanks for you attention!
Questions?

lassification with decision trees from a nonparametric predictive inference perspective

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à lassification with decision trees from a nonparametric predictive inference perspective

Similaire à lassification with decision trees from a nonparametric predictive inference perspective (20)

Plus de NTNU

Plus de NTNU (17)

Dernier

Dernier (20)

lassification with decision trees from a nonparametric predictive inference perspective