Language Model Information Retrieval with Document Expansion
1. A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang
Zhai
Presented By Kumar Ashish
INF384H/CS395T: Concepts of Information Retrieval (and Web
Search) Fall 2011
2. Zero Count Problem: Term is a possible word
of Information need does not occur in
document
General Problem of Estimation: Terms
occurring once are overestimated even
though their occurrence was partly by chance
In order to solve above problems, high quality
extra data is required to enlarge the sample
of document.
3. This gives the average logarithmic
distance between the probabilities: a
word would be observed at random from
unigram query language model and
unigram document language model.
4. C(w, d) is number of times word w occur in document d, and |d|
is the length of document.
Problems:
•Assigns Zero Probability to any word not present in document
causing problem in scoring a document with KL-Divergence.
6. Proposes a fixed parameter λ to control
interpolation.
Probability of word w given by the collection model Θc
7. It uses document dependent coefficient
(parameterized with μ) to control the
interpolation.
8. Uses clustering information to smooth a
document.
Divides all documents into K clusters.
First smoothes cluster model with collection
model using Dirichlet Smoothing.
Takes smoothed cluster as a new reference
model to smooth document using JM
Smoothing
9. ΘLd stand for document d’s cluster model and λ,
β are smoothing parameters.
10. Better than JM or Dirichlet Smoothing: It
expands a document with more data from
the cluster instead of just using the same
collection language model.
11. Cluster D good for
smoothing document a
but not good for
document d.
Ideally each document
should have its own
cluster centered
around itself.
12. Expand each document using Probabilistic
Neighborhood to estimate a virtual
document(d’).
Apply any interpolation based method(e.g. JM
or Dirchlet) to such a virtual document and
treat the word counts given by this virtual
document as if they were the original word
count.
13. Can use Cosine rule to determine documents
in the neighborhood of Original document.
Problems:
◦ In narrow sense would contain only few documents
whereas in wide sense the whole collection may
included.
◦ Neighbor documents can’t be sampled the same as
original document.
14. Associates a Confidence Value with every
document in the collection
◦ This Confidence Value reflects the belief that the
document is sampled from the same underlying
model as the original one.
15. Confidence Value(γd) is associated to every
document to indicate how strongly it is
sampled from d’s document.
Confidence Value should follow normal
distribution:
16. Shorter document require more help from its
neighbor.
Longer documents rely more on itself.
In order to take care of this a parameter α is
introduced to control this balance.
17. For Efficiency: Pseudo term count can be
calculated only using top M closest Neighbors ( as
confidence value follows decay shape)
18. For performance comparison:
◦ It uses four TREC data sets
AP(Associate Press news 1988-90)
LA ( LA times)
WSJ(Wall Street Journals 1987- 92)
SJMN(San Jose Mercury News 1991)
For Testing Algorithm Scale Up
◦ Uses TREC8
For Testing Effect on Short Documents
◦ Uses DOE( Department of Energy)
19. Comparison of DELM +(Diri/JM) with Diri/JM
λ for JM, μ for Dirichet are optimal and the same values of λ or
μ are used for DELM without further tuning. M is 100 and α is
0.5 for DELM.
DELM Outperforms JM and Dirichlet on each Data Sets with
improvement as much as 15% in case of Associated Press
News(AP).
20. Compared Precision
values at different
levels of recall for AP
data sets.
DELM + Dirichet
outperforms Dirichlet
on every precision
point.
Precision-Recall Curve on AP Data
21. Compares
Performance Trend
with respect to M(
top M closest
neighbors for each
Document)
Performance change with respect to M
Conclusion:
Neighborhood information improves retrieval accuracy
Performance becomes insensitive to M when M is sufficiently large
23. Document in AP88-89 was shrinked to 30% of original in 1st,
50% of original in 2nd and 70% of original in 3rd .
Results shows that DELM help shorter documents more than
longer ones (41% on 30%-length corpus to 16% on full length)
24. Performance change with respect to α
Optimal Points Migrate when document length becomes
shorter. ( 100% corpus length gets optimal at α = 0.4 but
30% corpus has to use α = 0.2)
25. Combination of DELM with Pseudo Feedback
DELM combined with Model-Based Feedback proposed in (Zhai
and Lafferty, 2001a)
Experiment Performed by:
Retrieving Documents by DELM method
Choosing top five document to do model based Feedback
Using Expanded query model to retrieve documents again
Result: DELM can be combined with pseudo feedback to
improve performance