Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Link-based text classiﬁcation using
Bayesian networks

Luis M. de Campos Juan M. Fernández-Luna
Juan F. Huete Andrés R. Masegosa
Alfonso E. Romero
{lci,jmfluna,jhg,andrew,aeromero}@decsai.ugr.es

Departamento de Ciencias de la Computación e Inteligencia Artiﬁcial
E.T.S.I. Informática y de Telecomunicación,
CITIC-UGR, Universidad de Granada
18071 – Granada, Spain

INEX 2009 Workshop, Brisbane


Our participation

Universidad de Granada at INEX 2009

The third year we participate on XML mining
(classiﬁcation).

As previous ocasions, we are interested in Bayesian
networks.

We’ve provided a new solution to this problem.

Sorry, no AdHoc this year .


Our participation

The problem itself

A text (XML) categorization problem. Training/test corpus.

Multilabel (more than 1 category per doc).

Links among ﬁles (training, test) given in a matrix.

Vectors of indexed terms (normalized tf-idf) provided.


Our participation

The problem itself

A text (XML) categorization problem. Training/test corpus.
Same as previous years
Multilabel (more than 1 category per doc).
New this year!
Links among ﬁles (training, test) given in a matrix.
Same as 2008
Vectors of indexed terms (normalized tf-idf) provided.
The eternal question, what about XML?


Our solution (2008)

Encyclopedia regularity (a document of category Ci tends
to links documents on the same category). Graphically
verified on the training set.

In 2008 we combined a flat-text classifier (Naïve Bayes)
with a Bayesian network of fixed structure which modelled
interaction among categories, using learnt probabilities
P(ci |cj ).

Results were discrete (the worst model among 3, and
improvements over our baseline were not significant).


Our starting point (2009)

We detected the same regularity on categories (no matrix
plot this year).

Possible (hidden) hierarchy (for example
Portal:Religion, Portal:Christianity and
Portal:Catholicism).

This year we learn the interactions among categories from
data, no ﬁxed structure, but any which is on the set of
categories.


Modeling link structure

Modeling link structure I

We assume there is a global probability distribution
among all these variables, and we will model it with a
Bayesian network.

Variables: categories Ci (39), categories of incoming links
Ej (39) and terms Tk (many).

Main Assumption: the probability distributions of a
document and the categories of ﬁles that link it are
independent given the category. Or simbolically:

p(dj , ej |ci ) = p(dj |ci ) p(ej |ci ).




p(ci |dj , ej ) = =
=
p(ci ) p(dj , ej )
=
p(ci ) p(dj , ej )
= .
p(dj , ej ) p(ci )

p(ci |dj ) p(ci |ej )
p(ci |dj , ej ) ∝
p(ci )



Modeling link structure III

p(ci |dj ): output of a probabilistic classiﬁer. Any
probabilistic classiﬁer.

p(ci |ej ): probability of being of Ci considering the set of the
categories of the incoming (known) links. This is modeled
by the Bayesian network.

The problem reduces to the following: [see next slide]



Modeling link structure IV

We have a vector of 39+39 binary variables for each
document: 39 for each category (1 if the doc. is of that
category, 0 if not), and 39 more (1 if the document is linked
by documents of this category, 0 if not).

With a learning algorithm, we learn a Bayesian network
from that data.

For each document to classify, for each category Ci we
compute its content probability p(ci |dj ) (with base
classiﬁer), and the probability of being of Ci knowing the
categories of certain neighbours p(ci |ej ) (with the learnt
Bayesian network).

We combine them using the blue equation.


Learning link structure

Learning Bayesian Network, using WEKA package.




Hillclimbing algorithm (easy and fast).

BDeu metric.

Three parents max. per node.





BDeu metric.


Propagation, using Elvira (WEKA does not have
propagation algorithms).





BDeu metric.


Propagation, using Elvira (WEKA does not have
propagation algorithms).

Compute p(ci ) (once), and p(ci |ej ) (for each document j).

Exact propagation was slow !

Importance Sampling algorithm (approximate).


Base classifiers

Base classifiers

We have used Multinomial Naïve Bayes (binary) and
Bayesian OR Gate (a model presented by our group in
INEX 2007).

They are extensive described on the paper (read it if you
want to learn deeply about these two classifiers).

Any other probabilistic classifiers can be used to firstly
obtain p(ci |dj ) (any suggestions or preferences?).


Results

MACC µACC MROC µROC MPRF µPRF MAP
N. Bayes 0.95142 0.93284 0.80260 0.81992 0.49613 0.52670 0.64097
N. Bayes + BN 0.95235 0.93386 0.80209 0.81974 0.50015 0.53029 0.64235
OR gate 0.75420 0.67806 0.92526 0.92163 0.25310 0.26268 0.72955
OR gate + BN 0.84768 0.81891 0.92810 0.92739 0.31611 0.36036 0.72508

Initial results

Problem in the OR gate! (Evaluation assumes
dj ∈ Ci ⇔ p(ci |dj ) > 0.5). This is not, in general, true for the
OR gate, need some scaling procedure (like SCut strategy).


Results

N. Bayes 0.95142 0.93284 0.80260 0.81992 0.49613 0.52670 0.64097
N. Bayes + BN 0.95235 0.93386 0.80209 0.81974 0.50015 0.53029 0.64235
OR gate 0.75420 0.67806 0.92526 0.92163 0.25310 0.26268 0.72955
OR gate + BN 0.84768 0.81891 0.92810 0.92739 0.31611 0.36036 0.72508

Initial results

Problem in the OR gate! (Evaluation assumes
dj ∈ Ci ⇔ p(ci |dj ) > 0.5). This is not, in general, true for the
OR gate, need some scaling procedure (like SCut strategy).
OR gate 0.92932 0.92612 0.92526 0.92163 0.45966 0.50407 0.72955
OR gate + BN 0.96607 0.95588 0.92810 0.92739 0.51729 0.55116 0.72508

Scaled results (see paper for details).


Conclusions

The model is new, parametrizable (learning algorithm,
parameters of algorithm, base classiﬁer,...) and valuable
by itself (always improves a baseline).


Conclusions


Using the Bayesian network over the OR gate provides a
10% of improvement in some measures
.


Conclusions


.

Good results on ROC (ranked third).


Conclusions


.


Other base classiﬁer? SVM with probabilistic outputs,
Logistic Regression...


Conclusions


.


Other base classiﬁer? SVM with probabilistic outputs,
Logistic Regression...

More experiments for the ﬁnal version of the paper!


Thank you for your
attention!
Questions, comments, criticism?

<SPAM>Expecting to defend my PhD by April 2010,
searching for a PostDoc (in Europe) for 2010 on ML/IR
related stuff. Any offers? < /SPAM>

Link-based document classification using Bayesian Networks

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Link-based document classification using Bayesian Networks

Similaire à Link-based document classification using Bayesian Networks (20)

Dernier

Dernier (20)

Link-based document classification using Bayesian Networks