Locally averaged Bayesian Dirichlet metrics scoring

Locally averaged Bayesian Dirichlet metrics
A. Cano, M. Gomez-Olmedo, A. R. Masegosa and S. Moral
Department of Computer Science and Artiﬁcial Intelligence
University of Granada (Spain)
Belfast, July 2011
European Conference on Symbolic and Quantitative Approaches to Reasoning under Uncertainty
ECSQARU 2011 Belfast (UK) 1/30

Outline
1 Introduction
2 Bayesian Dirichlet Metrics
3 Locally Averaged Bayesian Dirichlet Metrics
4 Experimental Evaluation
5 Conclusions & Future Works

Introduction
Part I
Introduction

Introduction
Bayesian Networks
Bayesian Networks
Excellent models to graphically represent the dependency structure of the
underlying distribution in multivariate domains.
This dependency structure in a multivariate problem domain represents a very
relevant source of knowledge (direct interactions, conditional
independencies...)

Introduction
Learning Bayesian Networks from Data
Learning Algorithms
Constrained-Base learning based on hypothesis tests approaches such as PC
algorithm.
Score+Search methods which employs a search algorithm guided by a score
function.
The model with the highest score is selected.

Introduction
Bayesian Score Metrics
Marginal Likelihood of the data
P(D|G) = P(D|θ, G)P(θ|G)dφ

Introduction
Bayesian Score Metrics
Marginal Likelihood of the data
Bayesian Dirichlet Equivalent Metric (BDe)
It satisﬁes the likelihood equivalence property.
A global Dirichlet distribution is assumed in order to guarantee the likelihood
equivalence property.
The parametrization depends of the equivalent sample size, ESS, parameter.
score(G : D) =
i
|Ui |
j=0
Γ( ESS
|Ui |
)
Γ( ESS
|Ui |
+ Nij )
|Xi |
k=1
Γ( ESS
|Ui ||Xi |
+ Nijk )
Γ( ESS
|Ui ||Xi |
)

Introduction
Sensitivity to ESS parameter
Experimental Evaluations [Silander et al.2007]
The global MAP BN was computed with an exhaustive search based algorithm
for 20 UCI data sets.
They found as different ESS values lead to different optimal BN models.
For some data sets (e.g. Yeast database) the optimal BN model monotonically
goes from the empty to the fully connected graph.
N. of Arcs in the optimal BN vs ESS value

Introduction
Our approach
Solution: Marginalizing the ESS parameter
As ﬁrstly suggested in [Silander et al. 2007], a possible solution is to employ a
Bayesian approach:
Assume a prior distribution on the ESS parameter and to marginalize it
out.

Introduction
Our approach
Solution: Marginalizing the ESS parameter
As ﬁrstly suggested in [Silander et al. 2007], a possible solution is to employ a
Bayesian approach:
Assume a prior distribution on the ESS parameter and to marginalize it
out.
Locally Averaged Bayesian Dirichlet Metrics
It is based on a local averaging approach to marginalize the ESS parameter.
We experimentally justify that this approach is superior:
It is able to adapt to more complex parameter spaces.
This approach removes the sensitivity of Bayesian Dirichlet metric to the ESS
parameter.

Bayesian Dirichlet Metrics
Part II

Notation
Let be X = (X1, ..., Xn) a set of n
multinomial random variables.
|Xi | is the number of values of Xi .
We also assume a fully observed
multinomial data set D.
A Bayesian Network B can be described by:
G is a directed acyclic graph.
G = (Pa(X1), ..., Pa(Xn)).
θG a set of parameter vectors.
P(Xi |Pa(Xi ) = j) = θij .

Bayesian Dirchlet equivalent metric
Marginal Likelihood of a graph structure:

It is computed under the following assumptions:
Complete labelled training data.
The prior distributions over the parameters are Dirichlet distributions.
θij ∼ Dirichet(αij1, ..., αijk )
Parameters are globally and locally independent:
scoreBDeu(G|D) =
n
i=1
|PaG(Xi )|
j=0
Γ(αij )
Γ(αij + Nij )
|Xi |
k=1
Γ(αijk + Nijk )
Γ(αijk )

It is computed under the following assumptions:
Complete labelled training data.
The prior distributions over the parameters are Dirichlet distributions.
θij ∼ Dirichet(αij1, ..., αijk )
Parameters are globally and locally independent:
scoreBDeu(G|D) =
n
i=1
|PaG(Xi )|
j=0
Γ(αij )
Γ(αij + Nij )
|Xi |
k=1
Γ(αijk + Nijk )
Γ(αijk )
BDe metrics sets alpha values as follows, in order to guarantee the likelihood
equivalence property:
αijk =
S
|Xi ||Pa(Xi )|

Sensitivity to the ESS
The problem is that we make αijk values exponentially small either with the
number or the cardinality of the parents: αijk = S
|Xi ||Pa(Xi )|
.
Beta(1,1), Beta(0.5, 0.5), Beta(0.25, 0.25), Beta(0.125, 0.125)

The problem is that we make αijk values exponentially small either with the
number or the cardinality of the parents: αijk = S
|Xi ||Pa(Xi )|
.
Beta(1,1), Beta(0.5, 0.5), Beta(0.25, 0.25), Beta(0.125, 0.125)
(SteckJackola2002, Steck2008, Ueno.2010): small αijk values tends to favor
the the absence of an edge Y −→ X over its presence (even if they are not
conditionally independent).
Specially if the empirical ˆP(X|Y) is not very extreme (it does not match
with the prior assupmtions).

If we increase the S value, we implicitly assume that marginal distributions
P(Xi ) = θi have very symmetrical probability distribution.
Beta(1,1), Beta(2, 2), Beta(4, 4), Beta(8, 8)

If we increase the S value, we implicitly assume that marginal distributions
P(Xi ) = θi have very symmetrical probability distribution.
Beta(1,1), Beta(2, 2), Beta(4, 4), Beta(8, 8)
(SteckJackola2002, Steck2008, Ueno.2010): larger S values tends to favor the
presence of an edge Y −→ X over its absence (even if they are conditionally
independent).
Specially, if there is a notable skewness in both marginal distributions:
P(X|PaX ) and P(Y|PaY ).

Part III
Locally Averaged Bayesian
Dirichlet Metrics

Globally Averaged Bayesian Dirichlet Metrics
[Silander et al. 2007] Bayesian solution to the problem of selecting an
optimal ESS:
Consider S as a random variable, place a prior on S and marginalize it out.
P(D|G) = P(D|G, s)P(s|G)ds
where P(D|G, s) is the classic marginal likelihood which depends of the
equivalent sample size.

optimal ESS:
It is assumed that P(S|G) is uniform and integral is approximated by a simple
averaging method
P(D|G) =
1
|S| s∈S i
|Ui |
j=0
Γ( S
|Ui |
)
Γ( S
|Ui |
+ Nij )
|Xi |
k=1
Γ( S
|Ui ||Xi |
+ Nijk )
Γ( S
|Ui ||Xi |
)
where S is a ﬁnite set of different S values.

optimal ESS:
It is assumed that P(S|G) is uniform and integral is approximated by a simple
averaging method
P(D|G) =
1
|S| s∈S i
|Ui |
j=0
Γ( S
|Ui |
)
Γ( S
|Ui |
+ Nij )
|Xi |
k=1
Γ( S
|Ui ||Xi |
+ Nijk )
Γ( S
|Ui ||Xi |
)
Satisﬁes the likelihood equivalence property but it is not locally decomposable.

A toy example:
Z and Y have very skewed marginal
distributions.
P(X|Z) is not notably far from uniform.
We generate 1000 data samples.
We evaluate the BN with the highest score.

Different averaging set values, SL, were tested:
S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
S << 1 (very skewed), S < 1 (skewed), S ≈ 1
(uniform), S >> 1 (strongly uniform).

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
Results
It always retrieves the empty graph without any edge.

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
Results
It always retrieves the empty graph without any edge.
Reasons:
We assume a global distribution (either strongly uniform or uniform or
skewed or very skewed) for all parameters at the same time.
This assumption does not ﬁt the parameter space of this Bayesian
network.

The marginalization of the parameter S is carried out locally:
We assume that each parameter vector θij is drawn from a different Dirichlet distribution where the
parameters S are independent.

P(D|G) =
1
|S|
i
|Pa(Xi )|
j=0 s∈S
Γ( S
|Pa(Xi )|
)
Γ( S
|Pa(Xi )|
+ Nij )
|Xi |
k=1
Γ( S
|Pa(Xi )||Xi |
+ Nijk )
Γ( S
|Pa(Xi )||Xi |
)

P(D|G) =
1
|S|
i
|Pa(Xi )|
j=0 s∈S
Γ( S
|Pa(Xi )|
)
Γ( S
|Pa(Xi )|
+ Nij )
|Xi |
k=1
Γ( S
|Pa(Xi )||Xi |
+ Nijk )
Γ( S
|Pa(Xi )||Xi |
)
It is now locally decomposable metric but it losses the likelihood equivalence
property.

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
Results
When L ≥ 5 we always retrieve the right graph.

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
Results
When L ≥ 5 we always retrieve the right graph.
We assume that each parameter vector follows a different Dirichlet
distribution either strongly uniform or uniform or skewed or very skewed. But
independent from the rest of parameters.
This assumption allow to ﬁt much more complex parameter spaces.

Experimental Evaluation
Part IV

Experimental Set-up
Bayesian Networks:
alarm (37 nodes), boblo (23 nodes), boerlage-92 (23 nodes), hailﬁnder (56
nodes), insurance (27 nodes).

Experimental Set-up
Bayesian Networks:
Data Sets:
We run 10 times the algorithms with 1000 data samples (other data
samples sizes were evaluated).

Experimental Set-up
Bayesian Networks:
Data Sets:
Evaluation Measures
Number of missing/extra links, Kullback-Leibler distance....

Experimental Set-up
Bayesian Networks:
Data Sets:
Evaluation Measures
Number of missing/extra links, Kullback-Leibler distance....
Algorithms
A greedy search algorithm is used assuming we are given a correct
topological order of the variables.
Different SL sets are used to perform averaging: L = 1, ...10 (displayed on
x-axis).

BDe with different S values I
050100150
Log of the S Value
Missing+ExtraLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
−6 −4 −2 0 1 2 3 4 5 6 7 8
0.00.51.01.52.0
Log of the S Value
KLDistance
Alarm
Boblo
Boerlage
Hailfinder
Insurance
−6 −4 −2 0 1 2 3 4 5 6 7 8
Analysis
The BDe metric is very sensitive to the S values in some domain problems.
There is an optimal S value which is different for each problem.

BDe with different S values II
5101520
Log of the S Value
MissingLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
−6 −4 −2 0 1 2 3 4 5 6 7 8
020406080100120140
Log of the S Value
ExtraLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
−6 −4 −2 0 1 2 3 4 5 6 7 8
Analysis
We can see the theoretically predicted tendencies appears.
Higher S values have a tendency to add edges.
Lower S values have a tendency to remove edges.

Locally Averaged Bayesian Dirichlet metrics
5101520
L Values
Missing+ExtraLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
0.00.20.40.60.81.01.2
L Values
KLDistance
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
Analysis
The higher the L value, the wider the set of averaged S values.
In some domains, the error measures improves with the size of averaged S
values.
In other domains, the error does not improve but it does not get worse.

Globally Averaged Bayesian Dirichlet metrics
5101520
L Values
Missing+ExtraLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
0.00.20.40.60.81.01.2
L ValuesKLDistance
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
Analysis
Similar behavior to locally averaged metrics.

Globally vs Locally Averaged Bayesian Dirichlet metrics
Global-AvBD error minus Local-AvBD error
0123
L Values
Missing+ExtraDifference
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
Analysis
In Alarm, Boblo and Boerlage, there hardly are differences between them.
In Hailﬁnder and Insurance, Local-AvBD metric performs better.
The performance depends of the complexity of the parameter space.

BDe metric vs Locally Averaged Bayesian Dirichlet metrics
BD error minus Local-AvBD error
−1012
L Values
Missing+ExtraDifference
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
Analysis
For BD metric, it is seleced the model with the lowest error using any of S
values in the set SL.
Local-AvBD metric peforms as least as well as the BD metric with an optimal S
value.
In some domains (Hailﬁnder and Insurance), Local-AvBD metric carries out
better inferences.

Conclusions and Future Works
Part V

Conclusions
Locally Averaged Bayesian Dirichlet metrics robustly infers more accurate
models than the BDe metric with an optimal selection the ESS parameter.
It is able to adapt to complex parameter spaces.
This metric is worth for knowledge discovery tasks: the inferences does not
depend of any free parameter and it gives the performance of an opmtimal
solution.

Conclusions
Locally Averaged Bayesian Dirichlet metrics robustly infers more accurate
models than the BDe metric with an optimal selection the ESS parameter.
It is able to adapt to complex parameter spaces.
This metric is worth for knowledge discovery tasks: the inferences does not
depend of any free parameter and it gives the performance of an opmtimal
solution.
Future Works
Extend this method to the parameter estimation of a BN model:
P(Xi = k|Pa(Xi ) = j) =
nijk + S
|Xi ||Pa(Xi )|
nij + S
|Pa(Xi )|

Thanks for you attention!!!

Locally averaged Bayesian Dirichlet metrics scoring

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

En vedette

En vedette (20)

Similaire à Locally averaged Bayesian Dirichlet metrics scoring

Similaire à Locally averaged Bayesian Dirichlet metrics scoring (20)

Plus de NTNU

Plus de NTNU (7)

Dernier

Dernier (20)

Locally averaged Bayesian Dirichlet metrics scoring