The marginal likelihood of the data computed using Bayesian score metrics is at the core of score+search methods when learning Bayesian networks from data. However, common formulations of those Bayesian score metrics depend of free parameters which are hard to asses. Recent theoretical and experimental works have also shown as the commonly employed BDeu score metric is strongly biased by the particular assignments of its free parameter known as the equivalent sample size and, also, as an optimal selection of this parameter depends of the underlying distribution. This sensibility causes that wrong choices of this parameter lead to inferred models which do not properly represent the distribution generating the data even with large sample sizes. To overcome this issue we introduce here an approach which tries to marginalize this free parameter with a simple averaging method. As experimentally shown, this approach robustly performs as well as an optimum selection of this parameter while it prevents from the choice of wrong settings for this widely applied Bayesian score metric.
1. Locally averaged Bayesian Dirichlet metrics
A. Cano, M. Gomez-Olmedo, A. R. Masegosa and S. Moral
Department of Computer Science and Artificial Intelligence
University of Granada (Spain)
Belfast, July 2011
European Conference on Symbolic and Quantitative Approaches to Reasoning under Uncertainty
ECSQARU 2011 Belfast (UK) 1/30
4. Introduction
Bayesian Networks
Bayesian Networks
Excellent models to graphically represent the dependency structure of the
underlying distribution in multivariate domains.
This dependency structure in a multivariate problem domain represents a very
relevant source of knowledge (direct interactions, conditional
independencies...)
ECSQARU 2011 Belfast (UK) 4/30
5. Introduction
Learning Bayesian Networks from Data
Learning Algorithms
Constrained-Base learning based on hypothesis tests approaches such as PC
algorithm.
Score+Search methods which employs a search algorithm guided by a score
function.
The model with the highest score is selected.
ECSQARU 2011 Belfast (UK) 5/30
7. Introduction
Bayesian Score Metrics
Marginal Likelihood of the data
P(D|G) = P(D|θ, G)P(θ|G)dφ
Bayesian Dirichlet Equivalent Metric (BDe)
It satisfies the likelihood equivalence property.
A global Dirichlet distribution is assumed in order to guarantee the likelihood
equivalence property.
The parametrization depends of the equivalent sample size, ESS, parameter.
score(G : D) =
i
|Ui |
j=0
Γ( ESS
|Ui |
)
Γ( ESS
|Ui |
+ Nij )
|Xi |
k=1
Γ( ESS
|Ui ||Xi |
+ Nijk )
Γ( ESS
|Ui ||Xi |
)
ECSQARU 2011 Belfast (UK) 6/30
8. Introduction
Sensitivity to ESS parameter
Experimental Evaluations [Silander et al.2007]
The global MAP BN was computed with an exhaustive search based algorithm
for 20 UCI data sets.
They found as different ESS values lead to different optimal BN models.
For some data sets (e.g. Yeast database) the optimal BN model monotonically
goes from the empty to the fully connected graph.
N. of Arcs in the optimal BN vs ESS value
ECSQARU 2011 Belfast (UK) 7/30
9. Introduction
Our approach
Solution: Marginalizing the ESS parameter
As firstly suggested in [Silander et al. 2007], a possible solution is to employ a
Bayesian approach:
Assume a prior distribution on the ESS parameter and to marginalize it
out.
ECSQARU 2011 Belfast (UK) 8/30
10. Introduction
Our approach
Solution: Marginalizing the ESS parameter
As firstly suggested in [Silander et al. 2007], a possible solution is to employ a
Bayesian approach:
Assume a prior distribution on the ESS parameter and to marginalize it
out.
Locally Averaged Bayesian Dirichlet Metrics
It is based on a local averaging approach to marginalize the ESS parameter.
We experimentally justify that this approach is superior:
It is able to adapt to more complex parameter spaces.
This approach removes the sensitivity of Bayesian Dirichlet metric to the ESS
parameter.
ECSQARU 2011 Belfast (UK) 8/30
12. Bayesian Dirichlet Metrics
Notation
Let be X = (X1, ..., Xn) a set of n
multinomial random variables.
|Xi | is the number of values of Xi .
We also assume a fully observed
multinomial data set D.
A Bayesian Network B can be described by:
G is a directed acyclic graph.
G = (Pa(X1), ..., Pa(Xn)).
θG a set of parameter vectors.
P(Xi |Pa(Xi ) = j) = θij .
ECSQARU 2011 Belfast (UK) 10/30
14. Bayesian Dirichlet Metrics
Bayesian Dirchlet equivalent metric
Marginal Likelihood of a graph structure:
P(D|G) = P(D|θ, G)P(θ|G)dφ
It is computed under the following assumptions:
Complete labelled training data.
The prior distributions over the parameters are Dirichlet distributions.
θij ∼ Dirichet(αij1, ..., αijk )
Parameters are globally and locally independent:
scoreBDeu(G|D) =
n
i=1
|PaG(Xi )|
j=0
Γ(αij )
Γ(αij + Nij )
|Xi |
k=1
Γ(αijk + Nijk )
Γ(αijk )
ECSQARU 2011 Belfast (UK) 11/30
15. Bayesian Dirichlet Metrics
Bayesian Dirchlet equivalent metric
Marginal Likelihood of a graph structure:
P(D|G) = P(D|θ, G)P(θ|G)dφ
It is computed under the following assumptions:
Complete labelled training data.
The prior distributions over the parameters are Dirichlet distributions.
θij ∼ Dirichet(αij1, ..., αijk )
Parameters are globally and locally independent:
scoreBDeu(G|D) =
n
i=1
|PaG(Xi )|
j=0
Γ(αij )
Γ(αij + Nij )
|Xi |
k=1
Γ(αijk + Nijk )
Γ(αijk )
BDe metrics sets alpha values as follows, in order to guarantee the likelihood
equivalence property:
αijk =
S
|Xi ||Pa(Xi )|
ECSQARU 2011 Belfast (UK) 11/30
16. Bayesian Dirichlet Metrics
Sensitivity to the ESS
The problem is that we make αijk values exponentially small either with the
number or the cardinality of the parents: αijk = S
|Xi ||Pa(Xi )|
.
Beta(1,1), Beta(0.5, 0.5), Beta(0.25, 0.25), Beta(0.125, 0.125)
ECSQARU 2011 Belfast (UK) 12/30
17. Bayesian Dirichlet Metrics
Sensitivity to the ESS
The problem is that we make αijk values exponentially small either with the
number or the cardinality of the parents: αijk = S
|Xi ||Pa(Xi )|
.
Beta(1,1), Beta(0.5, 0.5), Beta(0.25, 0.25), Beta(0.125, 0.125)
(SteckJackola2002, Steck2008, Ueno.2010): small αijk values tends to favor
the the absence of an edge Y −→ X over its presence (even if they are not
conditionally independent).
Specially if the empirical ˆP(X|Y) is not very extreme (it does not match
with the prior assupmtions).
ECSQARU 2011 Belfast (UK) 12/30
18. Bayesian Dirichlet Metrics
Sensitivity to the ESS
If we increase the S value, we implicitly assume that marginal distributions
P(Xi ) = θi have very symmetrical probability distribution.
Beta(1,1), Beta(2, 2), Beta(4, 4), Beta(8, 8)
ECSQARU 2011 Belfast (UK) 13/30
19. Bayesian Dirichlet Metrics
Sensitivity to the ESS
If we increase the S value, we implicitly assume that marginal distributions
P(Xi ) = θi have very symmetrical probability distribution.
Beta(1,1), Beta(2, 2), Beta(4, 4), Beta(8, 8)
(SteckJackola2002, Steck2008, Ueno.2010): larger S values tends to favor the
presence of an edge Y −→ X over its absence (even if they are conditionally
independent).
Specially, if there is a notable skewness in both marginal distributions:
P(X|PaX ) and P(Y|PaY ).
ECSQARU 2011 Belfast (UK) 13/30
20. Locally Averaged Bayesian Dirichlet Metrics
Part III
Locally Averaged Bayesian
Dirichlet Metrics
ECSQARU 2011 Belfast (UK) 14/30
21. Locally Averaged Bayesian Dirichlet Metrics
Globally Averaged Bayesian Dirichlet Metrics
[Silander et al. 2007] Bayesian solution to the problem of selecting an
optimal ESS:
Consider S as a random variable, place a prior on S and marginalize it out.
P(D|G) = P(D|G, s)P(s|G)ds
where P(D|G, s) is the classic marginal likelihood which depends of the
equivalent sample size.
ECSQARU 2011 Belfast (UK) 15/30
22. Locally Averaged Bayesian Dirichlet Metrics
Globally Averaged Bayesian Dirichlet Metrics
[Silander et al. 2007] Bayesian solution to the problem of selecting an
optimal ESS:
Consider S as a random variable, place a prior on S and marginalize it out.
P(D|G) = P(D|G, s)P(s|G)ds
where P(D|G, s) is the classic marginal likelihood which depends of the
equivalent sample size.
It is assumed that P(S|G) is uniform and integral is approximated by a simple
averaging method
P(D|G) =
1
|S| s∈S i
|Ui |
j=0
Γ( S
|Ui |
)
Γ( S
|Ui |
+ Nij )
|Xi |
k=1
Γ( S
|Ui ||Xi |
+ Nijk )
Γ( S
|Ui ||Xi |
)
where S is a finite set of different S values.
ECSQARU 2011 Belfast (UK) 15/30
23. Locally Averaged Bayesian Dirichlet Metrics
Globally Averaged Bayesian Dirichlet Metrics
[Silander et al. 2007] Bayesian solution to the problem of selecting an
optimal ESS:
Consider S as a random variable, place a prior on S and marginalize it out.
P(D|G) = P(D|G, s)P(s|G)ds
where P(D|G, s) is the classic marginal likelihood which depends of the
equivalent sample size.
It is assumed that P(S|G) is uniform and integral is approximated by a simple
averaging method
P(D|G) =
1
|S| s∈S i
|Ui |
j=0
Γ( S
|Ui |
)
Γ( S
|Ui |
+ Nij )
|Xi |
k=1
Γ( S
|Ui ||Xi |
+ Nijk )
Γ( S
|Ui ||Xi |
)
where S is a finite set of different S values.
Satisfies the likelihood equivalence property but it is not locally decomposable.
ECSQARU 2011 Belfast (UK) 15/30
24. Locally Averaged Bayesian Dirichlet Metrics
Sensitivity to the ESS
A toy example:
Z and Y have very skewed marginal
distributions.
P(X|Z) is not notably far from uniform.
We generate 1000 data samples.
We evaluate the BN with the highest score.
ECSQARU 2011 Belfast (UK) 16/30
25. Locally Averaged Bayesian Dirichlet Metrics
Globally Averaged Bayesian Dirichlet Metrics
Different averaging set values, SL, were tested:
S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
S << 1 (very skewed), S < 1 (skewed), S ≈ 1
(uniform), S >> 1 (strongly uniform).
ECSQARU 2011 Belfast (UK) 17/30
26. Locally Averaged Bayesian Dirichlet Metrics
Globally Averaged Bayesian Dirichlet Metrics
Different averaging set values, SL, were tested:
S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
S << 1 (very skewed), S < 1 (skewed), S ≈ 1
(uniform), S >> 1 (strongly uniform).
Results
It always retrieves the empty graph without any edge.
ECSQARU 2011 Belfast (UK) 17/30
27. Locally Averaged Bayesian Dirichlet Metrics
Globally Averaged Bayesian Dirichlet Metrics
Different averaging set values, SL, were tested:
S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
S << 1 (very skewed), S < 1 (skewed), S ≈ 1
(uniform), S >> 1 (strongly uniform).
Results
It always retrieves the empty graph without any edge.
Reasons:
We assume a global distribution (either strongly uniform or uniform or
skewed or very skewed) for all parameters at the same time.
This assumption does not fit the parameter space of this Bayesian
network.
ECSQARU 2011 Belfast (UK) 17/30
28. Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
The marginalization of the parameter S is carried out locally:
We assume that each parameter vector θij is drawn from a different Dirichlet distribution where the
parameters S are independent.
ECSQARU 2011 Belfast (UK) 18/30
29. Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
The marginalization of the parameter S is carried out locally:
We assume that each parameter vector θij is drawn from a different Dirichlet distribution where the
parameters S are independent.
P(D|G) =
1
|S|
i
|Pa(Xi )|
j=0 s∈S
Γ( S
|Pa(Xi )|
)
Γ( S
|Pa(Xi )|
+ Nij )
|Xi |
k=1
Γ( S
|Pa(Xi )||Xi |
+ Nijk )
Γ( S
|Pa(Xi )||Xi |
)
where S is a finite set of different S values.
ECSQARU 2011 Belfast (UK) 18/30
30. Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
The marginalization of the parameter S is carried out locally:
We assume that each parameter vector θij is drawn from a different Dirichlet distribution where the
parameters S are independent.
P(D|G) =
1
|S|
i
|Pa(Xi )|
j=0 s∈S
Γ( S
|Pa(Xi )|
)
Γ( S
|Pa(Xi )|
+ Nij )
|Xi |
k=1
Γ( S
|Pa(Xi )||Xi |
+ Nijk )
Γ( S
|Pa(Xi )||Xi |
)
where S is a finite set of different S values.
It is now locally decomposable metric but it losses the likelihood equivalence
property.
ECSQARU 2011 Belfast (UK) 18/30
31. Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
Different averaging set values, SL, were tested:
S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
S << 1 (very skewed), S < 1 (skewed), S ≈ 1
(uniform), S >> 1 (strongly uniform).
ECSQARU 2011 Belfast (UK) 19/30
32. Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
Different averaging set values, SL, were tested:
S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
S << 1 (very skewed), S < 1 (skewed), S ≈ 1
(uniform), S >> 1 (strongly uniform).
Results
When L ≥ 5 we always retrieve the right graph.
ECSQARU 2011 Belfast (UK) 19/30
33. Locally Averaged Bayesian Dirichlet Metrics
Locally Averaged Bayesian Dirichlet Metrics
Different averaging set values, SL, were tested:
S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,
S10 = {2−10
, 2−9
, ...., 29
, 210
}.
S << 1 (very skewed), S < 1 (skewed), S ≈ 1
(uniform), S >> 1 (strongly uniform).
Results
When L ≥ 5 we always retrieve the right graph.
We assume that each parameter vector follows a different Dirichlet
distribution either strongly uniform or uniform or skewed or very skewed. But
independent from the rest of parameters.
This assumption allow to fit much more complex parameter spaces.
ECSQARU 2011 Belfast (UK) 19/30
36. Experimental Evaluation
Experimental Set-up
Bayesian Networks:
alarm (37 nodes), boblo (23 nodes), boerlage-92 (23 nodes), hailfinder (56
nodes), insurance (27 nodes).
Data Sets:
We run 10 times the algorithms with 1000 data samples (other data
samples sizes were evaluated).
ECSQARU 2011 Belfast (UK) 21/30
37. Experimental Evaluation
Experimental Set-up
Bayesian Networks:
alarm (37 nodes), boblo (23 nodes), boerlage-92 (23 nodes), hailfinder (56
nodes), insurance (27 nodes).
Data Sets:
We run 10 times the algorithms with 1000 data samples (other data
samples sizes were evaluated).
Evaluation Measures
Number of missing/extra links, Kullback-Leibler distance....
ECSQARU 2011 Belfast (UK) 21/30
38. Experimental Evaluation
Experimental Set-up
Bayesian Networks:
alarm (37 nodes), boblo (23 nodes), boerlage-92 (23 nodes), hailfinder (56
nodes), insurance (27 nodes).
Data Sets:
We run 10 times the algorithms with 1000 data samples (other data
samples sizes were evaluated).
Evaluation Measures
Number of missing/extra links, Kullback-Leibler distance....
Algorithms
A greedy search algorithm is used assuming we are given a correct
topological order of the variables.
Different SL sets are used to perform averaging: L = 1, ...10 (displayed on
x-axis).
ECSQARU 2011 Belfast (UK) 21/30
39. Experimental Evaluation
BDe with different S values I
050100150
Log of the S Value
Missing+ExtraLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
−6 −4 −2 0 1 2 3 4 5 6 7 8
0.00.51.01.52.0
Log of the S Value
KLDistance
Alarm
Boblo
Boerlage
Hailfinder
Insurance
−6 −4 −2 0 1 2 3 4 5 6 7 8
Analysis
The BDe metric is very sensitive to the S values in some domain problems.
There is an optimal S value which is different for each problem.
ECSQARU 2011 Belfast (UK) 22/30
40. Experimental Evaluation
BDe with different S values II
5101520
Log of the S Value
MissingLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
−6 −4 −2 0 1 2 3 4 5 6 7 8
020406080100120140
Log of the S Value
ExtraLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
−6 −4 −2 0 1 2 3 4 5 6 7 8
Analysis
We can see the theoretically predicted tendencies appears.
Higher S values have a tendency to add edges.
Lower S values have a tendency to remove edges.
ECSQARU 2011 Belfast (UK) 23/30
41. Experimental Evaluation
Locally Averaged Bayesian Dirichlet metrics
5101520
L Values
Missing+ExtraLinks
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
0.00.20.40.60.81.01.2
L Values
KLDistance
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
Analysis
The higher the L value, the wider the set of averaged S values.
In some domains, the error measures improves with the size of averaged S
values.
In other domains, the error does not improve but it does not get worse.
ECSQARU 2011 Belfast (UK) 24/30
43. Experimental Evaluation
Globally vs Locally Averaged Bayesian Dirichlet metrics
Global-AvBD error minus Local-AvBD error
0123
L Values
Missing+ExtraDifference
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
Analysis
In Alarm, Boblo and Boerlage, there hardly are differences between them.
In Hailfinder and Insurance, Local-AvBD metric performs better.
The performance depends of the complexity of the parameter space.
ECSQARU 2011 Belfast (UK) 26/30
44. Experimental Evaluation
BDe metric vs Locally Averaged Bayesian Dirichlet metrics
BD error minus Local-AvBD error
−1012
L Values
Missing+ExtraDifference
Alarm
Boblo
Boerlage
Hailfinder
Insurance
1 2 3 4 5 6 7 8 9 10
Analysis
For BD metric, it is seleced the model with the lowest error using any of S
values in the set SL.
Local-AvBD metric peforms as least as well as the BD metric with an optimal S
value.
In some domains (Hailfinder and Insurance), Local-AvBD metric carries out
better inferences.
ECSQARU 2011 Belfast (UK) 27/30
45. Conclusions and Future Works
Part V
Conclusions and Future Works
ECSQARU 2011 Belfast (UK) 28/30
46. Conclusions and Future Works
Conclusions and Future Works
Conclusions
Locally Averaged Bayesian Dirichlet metrics robustly infers more accurate
models than the BDe metric with an optimal selection the ESS parameter.
It is able to adapt to complex parameter spaces.
This metric is worth for knowledge discovery tasks: the inferences does not
depend of any free parameter and it gives the performance of an opmtimal
solution.
ECSQARU 2011 Belfast (UK) 29/30
47. Conclusions and Future Works
Conclusions and Future Works
Conclusions
Locally Averaged Bayesian Dirichlet metrics robustly infers more accurate
models than the BDe metric with an optimal selection the ESS parameter.
It is able to adapt to complex parameter spaces.
This metric is worth for knowledge discovery tasks: the inferences does not
depend of any free parameter and it gives the performance of an opmtimal
solution.
Future Works
Extend this method to the parameter estimation of a BN model:
P(Xi = k|Pa(Xi ) = j) =
nijk + S
|Xi ||Pa(Xi )|
nij + S
|Pa(Xi )|
ECSQARU 2011 Belfast (UK) 29/30
48. Conclusions and Future Works
Thanks for you attention!!!
ECSQARU 2011 Belfast (UK) 30/30