2. OUTLINE
•
•
•
•
•
•
•
•
INTRODUCTION
MODEL OF QUERY GENERATION
PREVIOUS WORK USING 2-POISSON MODEL
STATISTICAL TRANSLATION
MODELS OF DOCUMENT-QUERY TRANSLATION
WORD-FOR-WORD TRANSLATION
EXPERIMENTAL RESULTS
CRITIQUE
3. INTRODUCTION
• Information Retrieval (IR): Obtaining information resources relevant to an information need from a
collection of information resources (documents).
• Predicting relevance is the central goal of IR.
• A new probabilistic approach to IR based upon the ideas and methods of statistical machine translation.
• Model: Medium between data and understanding.
• Ultimately, document retrieval systems must be sophisticated enough to handle polysemy and
synonymy.
4. INTRODUCTION (…cont.)
SOME BASIC TERMINOLOGIES
PRECISION is the fraction of the documents retrieved that are relevant to the user's information
need.
RECALL is the fraction of the documents that are relevant to the query that are successfully
retrieved.
There is a inverse relation between precision and recall.
5. MODEL OF QUERY GENERATION
• The user ‘U’ has an information need ‘I’ .
• From this need, he generates an ideal document ‘d’.
• Ideal Document: a perfect fit for the user, but almost certainly not present in the retrieval system’s
collection of documents.
• He selects a set of key terms from ‘d’, and generates a query ‘q’ from this set.
In this setting, the task of a retrieval system is to find those documents most similar to ‘d’.
6.
7. The Retrieval System’s task
To find the most likely documents given the query; that is, those ‘d’ for which p(d | q, U) is
highest. By Bayes’ law –
Denominator p(q | U) is fixed for a given query and user, we can ignore it for the purpose of
ranking documents, and define the relevance of a document to a query as –
8. 2-POISSON MODEL (PREVIOUS WORK)
The 2-Poisson model is a mixture, that is a linear combination, of two Poisson distributions:
Where Et – the Elite set of term t which occur more densely and non randomly in a few documents.
In the context of IR, the 2-Poisson is used to model the probability distribution of the frequency X of a term
in a collection of documents.
The effectiveness of the Two-Poisson model for document retrieval was never tested, for two reasons. The
first issue is that the learning of the three parameters using the Expectation Maximization (EM) algorithm
for each term is expensive, and in general large collections contain millions of terms. The second problem is
that the model does not take into account the document size, therefore the model should be extended to
normalize different document lengths.
9. STATISTICAL MACHINE TRANSLATION
Automatic translation by computer was first contemplated by Warren Weaver when modern
computers were in their infancy.
The central problem of statistical MT is to build a system that automatically learns how to
translate text, by inspecting a large set of sentences in one language along with their
translations into another language.
Let translational probability for each English word ‘e’ translating to each French word ‘f’ is given
by : t( f | e).
10. STATISTICAL MT (..cont.)
The probability that an English sentence e = {e1, e2,…} translates to a French sentence f =
{f1,f2,…} is calculated as
where Gamma is a normalizing factor. The hidden variable in this model is the alignment a
between the French and English words: aj = k means that the kth English word translates to the
jth French word.
11. MODEL OF DOCUMENT-QUERY
TRANSLATION
First, a word ‘w’ is chosen at random from the document d according to distribution l( w | d)
that we call the document language model.
Next translate ‘w’ into the word or phrase ‘q’ according to a translational model, with
parameters t( q | w).
Thus, the probability of choosing q as a representative of the document d is –
Now assuming the sample size model ᶲ( n | d) as the Poisson distribution with mean lamda(d)
as -
12. MODEL OF DOCUMENT-QUERY
TRANSLATION (…cont.)
Under that assumption of treating the number of samples ‘n’ as Poisson distribution, the
probability that a particular query q = q1,q2,…qm is generated will be given by –
This was the Model 1 of document-query translation. It was inspired by IBM statistical
translation model.
To fit translational probabilities in Model 1, Expectation Maximization (EM) algorithm is used.
13. MODEL 0 – THE SIMPLEST CASE: WORDFOR-WORD TRANSLATION
The simplest version of the model 1 which we will distinguish as Model 0 is one where each
word ‘w’ can be translated only as itself; that is, the translation probabilities are “diagonal”:
Under this model, the query terms are chosen simply according to their frequency of occurrence
in the document.
The probability for query in this case is simply -
14. EXPERIMENTAL RESULTS
Precision-Recall plots. The left plot compares Model 1 to Model 0 on the SDR data. The right
plot compares the same language model scored according to Model 0, demonstrating that the
approximations are very good.
15. CRITIQUE
The 2-Poisson Model was never tested due to one of the reason that the learning of three
parameters for each term is expensive because the Expectation Maximization algorithm
converges in several iterations.
According to this paper, to fit the translation probabilities of Model 1, EM algorithm is used. So
this is also an expensive operation. The efficiency of EM in Model 1 is not discussed well. It
should be more elaborated.
16. REFERENCES
[1] “Information Retrieval as Statistical Translation” by Adam Berger and John Lafferty, 1999.
[2] “Two Poisson model” by Giambattista Amati, Fondazione ugo Bordoni.
[3] Information Retrieval as Statistical Translation by Robert Barbey.
[4] Wikipedia article on “Information Retrieval”.