Language Model Information Retrieval with Document Expansion

A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang
Zhai

Presented By Kumar Ashish
INF384H/CS395T: Concepts of Information Retrieval (and Web
Search) Fall 2011

 Zero Count Problem: Term is a possible word
of Information need does not occur in
document
 General Problem of Estimation: Terms
occurring once are overestimated even
though their occurrence was partly by chance
 In order to solve above problems, high quality
extra data is required to enlarge the sample
of document.

This gives the average logarithmic
distance between the probabilities: a
word would be observed at random from
unigram query language model and
unigram document language model.

C(w, d) is number of times word w occur in document d, and |d|
is the length of document.

Problems:
•Assigns Zero Probability to any word not present in document
causing problem in scoring a document with KL-Divergence.

 Jelinek-Mercer(JM) Smoothing

 Dirichlet Smoothing

 Proposes a fixed parameter λ to control
interpolation.

Probability of word w given by the collection model Θc

 It uses document dependent coefficient
(parameterized with μ) to control the
interpolation.

 Uses clustering information to smooth a
document.
 Divides all documents into K clusters.
 First smoothes cluster model with collection
model using Dirichlet Smoothing.
 Takes smoothed cluster as a new reference
model to smooth document using JM
Smoothing

ΘLd stand for document d’s cluster model and λ,
β are smoothing parameters.

 Better than JM or Dirichlet Smoothing: It
expands a document with more data from
the cluster instead of just using the same
collection language model.

Cluster D good for
smoothing document a
but not good for
document d.

Ideally each document
should have its own
cluster centered
around itself.

 Expand each document using Probabilistic
Neighborhood to estimate a virtual
document(d’).

 Apply any interpolation based method(e.g. JM
or Dirchlet) to such a virtual document and
treat the word counts given by this virtual
document as if they were the original word
count.

 Can use Cosine rule to determine documents
in the neighborhood of Original document.

 Problems:
◦ In narrow sense would contain only few documents
whereas in wide sense the whole collection may
included.
◦ Neighbor documents can’t be sampled the same as
original document.

 Associates a Confidence Value with every
document in the collection

◦ This Confidence Value reflects the belief that the
document is sampled from the same underlying
model as the original one.

 Confidence Value(γd) is associated to every
document to indicate how strongly it is
sampled from d’s document.
 Confidence Value should follow normal
distribution:

 Shorter document require more help from its
neighbor.
 Longer documents rely more on itself.

 In order to take care of this a parameter α is
introduced to control this balance.

For Efficiency: Pseudo term count can be
calculated only using top M closest Neighbors ( as
confidence value follows decay shape)

 For performance comparison:
◦ It uses four TREC data sets
 AP(Associate Press news 1988-90)
 LA ( LA times)
 WSJ(Wall Street Journals 1987- 92)
 SJMN(San Jose Mercury News 1991)
 For Testing Algorithm Scale Up
◦ Uses TREC8
 For Testing Effect on Short Documents
◦ Uses DOE( Department of Energy)

Comparison of DELM +(Diri/JM) with Diri/JM

λ for JM, μ for Dirichet are optimal and the same values of λ or
μ are used for DELM without further tuning. M is 100 and α is
0.5 for DELM.
DELM Outperforms JM and Dirichlet on each Data Sets with
improvement as much as 15% in case of Associated Press
News(AP).

Compared Precision
values at different
levels of recall for AP
data sets.
DELM + Dirichet
outperforms Dirichlet
on every precision
point.

Precision-Recall Curve on AP Data

Compares
Performance Trend
with respect to M(
top M closest
neighbors for each
Document)

Performance change with respect to M
Conclusion:
 Neighborhood information improves retrieval accuracy
Performance becomes insensitive to M when M is sufficiently large

Comparison of DELM+Dirichlet with CBDM

 DELM + Dirichet outperforms CBDM in MAP values on all
four data sets.

Document in AP88-89 was shrinked to 30% of original in 1st,
50% of original in 2nd and 70% of original in 3rd .
Results shows that DELM help shorter documents more than
longer ones (41% on 30%-length corpus to 16% on full length)

Performance change with respect to α

Optimal Points Migrate when document length becomes
shorter. ( 100% corpus length gets optimal at α = 0.4 but
30% corpus has to use α = 0.2)

Combination of DELM with Pseudo Feedback

 DELM combined with Model-Based Feedback proposed in (Zhai
and Lafferty, 2001a)
Experiment Performed by:
 Retrieving Documents by DELM method
Choosing top five document to do model based Feedback
Using Expanded query model to retrieve documents again
Result: DELM can be combined with pseudo feedback to
improve performance

 References:
◦ http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf
◦ http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf
◦ http://krisztianbalog.com/files/sigir2008-csiro.pdf

Language Model Information Retrieval with Document Expansion

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Language Model Information Retrieval with Document Expansion

Similaire à Language Model Information Retrieval with Document Expansion (20)

Dernier

Dernier (20)

Language Model Information Retrieval with Document Expansion