2. IBM Labs in Haifa
Instructors
Dr. David Carmel
Research Staff Member at the Information Retrieval group at IBM Haifa
Research Lab
Ph.D. in Computer Science from the Technion, Israel, 1997
Research Interests: search in the enterprise, query performance
prediction, social search, and text mining
carmel@il.ibm.com,
https://researcher.ibm.com/researcher/view.php?person=il-CARMEL
Dr. Oren Kurland
Senior lecturer at the Technion --- Israel Institute of Technology
Ph.D. in Computer Science from Cornell University, 2006
Research Interests: information retrieval
kurland@ie.technion.ac.il,
http://iew3.technion.ac.il/~kurland
2
212/07/2012 SIGIR 2010, Geneva 2
SIGIR 2012 Tutorial: Query Performance Prediction
3. IBM Labs in Haifa
This tutorial is based in part on the book
Estimating the query difficulty
for Information Retrieval
Synthesis Lectures on Information
Concepts, Retrieval, and Services,
Morgan & Claypool publishers
The tutorial presents the opinions of the
presenters only, and does not
necessarily reflect the views of IBM or
the Technion
Algorithms, techniques, features, etc.
mentioned here might or might not be in
use by IBM
3
312/07/2012 SIGIR 2010, Geneva 3
SIGIR 2012 Tutorial: Query Performance Prediction
4. IBM Labs in Haifa
Query Difficulty Estimation – Main Challenge
Estimating the query difficulty is an attempt to quantify the quality of search results
retrieved for a query from a given collection of documents, when no relevance
feedback is given
Even for systems that succeed
very well on average, the quality
of results returned for some of
the queries is poor
Understanding why some queries
are inherently more difficult than
others is essential for IR
A good answer to this question
will help search engines to
reduce the variance in
performance
4
412/07/2012 SIGIR 2010, Geneva 4
SIGIR 2012 Tutorial: Query Performance Prediction
5. IBM Labs in Haifa
Estimating the Query Difficulty – Some Benefits
Feedback to users:
The IR system can provide the users with an estimate of the expected quality of the
results retrieved for their queries.
Users can then rephrase “difficult” queries or, resubmit a “difficult” query to
alternative search resources.
Feedback to the search engine:
The IR system can invoke alternative retrieval strategies for different queries
according to their estimated difficulty.
For example, intensive query analysis procedures may be invoked selectively for difficult
queries only
Feedback to the system administrator:
For example, administrators can identify missing content queries
Then expand the collection of documents to better answer these queries.
For IR applications:
For example, a federated (distributed) search application
Merging the results of queries employed distributively over different datasets
Weighing the results returned from each dataset by the predicted difficulty
5
512/07/2012 5
SIGIR 2012 Tutorial: Query Performance Prediction
6. IBM Labs in Haifa
Contents
Introduction -The Robustness Problem of (ad hoc) Information
Retrieval
Basic Concepts
Query Performance Prediction Methods
Pre-Retrieval Prediction Methods
Post-Retrieval Prediction Methods
Combining Predictors
A Unified Framework for Post-Retrieval Query-Performance
Prediction
A General Model for Query Difficulty
A Probabilistic Framework for Query-Performance Prediction
Applications of Query Difficulty Estimation
Summary
6 Open Challenges 2012 Tutorial:SIGIR 2010, Geneva Prediction
612/07/2012 SIGIR Query Performance
6
8. IBM Labs in Haifa
The Robustness problem of IR
Most IR systems suffer from a radical variance in retrieval
performance when responding to users’ queries
Even for systems that succeed very well on average, the quality of results
returned for some of the queries is poor
This may lead to user dissatisfaction
Variability in performance relates to various factors:
The query itself (e.g., term ambiguity “Golf”)
The vocabulary mismatch problem - the discrepancy between the query
vocabulary and the document vocabulary
Missing content queries - there is no relevant information in the corpus
that can satisfy the information needs.
8
812/07/2012 SIGIR 2010, Geneva 8
SIGIR 2012 Tutorial: Query Performance Prediction
9. An example for a difficult query:
IBM Labs in Haifa
“The Hubble Telescope achievements”
Hubble Telescope Achievements:
Great eye sets sights sky high
Simple test would have found flaw in Hubble telescope
Nation in brief
Hubble space telescope placed aboard shuttle
Cause of Hubble telescope defect reportedly found
Flaw in Hubble telescope
Flawed mirror hampers Hubble space telescope
Touchy telescope torments controllers
NASA scrubs launch of discovery
Hubble builders got award fees, magazine says
Retrieved results deal with issues related to the Hubble telescope project in
general; but the gist of that query, achievements, is lost
9
912/07/2012 SIGIR 2010, Geneva 9
SIGIR 2012 Tutorial: Query Performance Prediction
10. IBM Labs in Haifa
The variance in performance across queries
and systems (TREC-7)
Queries are sorted in decreasing order according to average precision attained
among all TREC participants (green bars).
The performance of two different systems per query is shown by the two curves.
10
1012/07/2012 SIGIR 2010, Geneva 10
SIGIR 2012 Tutorial: Query Performance Prediction
11. IBM Labs in Haifa
The Reliable Information Access (RIA) workshop
The first attempt to rigorously investigate the reasons for performance variability
across queries and systems.
Extensive failure analysis of the results of
6 IR systems
45 TREC topics
Main reason to failures - the systems’ inability to identify all important aspects of
the query
The failure to emphasize one aspect of a query over another, or emphasize one aspect
and neglect other aspects
“What disasters have occurred in tunnels used for transportation?”
Emphasizing only one of these terms will deteriorate performance because each
term on its own does not fully reflect the information need
If systems could estimate what failure categories the query may belong to
Systems could apply specific automated techniques that correspond to the failure
mode in order to improve performance
11
1112/07/2012 SIGIR 2010, Geneva 11
SIGIR 2012 Tutorial: Query Performance Prediction
12. IBM Labs in Haifa
12
1212/07/2012 SIGIR 2010, Geneva 12
SIGIR 2012 Tutorial: Query Performance Prediction
13. Instability in Retrieval - The TREC’s Robust
IBM Labs in Haifa
Tracks
The diversity in performance among topics and systems led to the
TREC Robust tracks (2003 – 2005)
Encouraging systems to decrease variance in query performance by
focusing on poorly performing topics
Systems were challenged with 50 old TREC topics found to be
“difficult” for most systems over the years
A topic is considered difficult when the median of the average precision
scores of all participants for that topic is below a given threshold
A new measure, GMAP, uses the geometric mean instead of the
arithmetic mean when averaging precision values over topics
Emphasizes the lowest performing topics, and is thus a useful measure
that can attest to the robustness of system’s performance
13
1312/07/2012 SIGIR 2010, Geneva 13
SIGIR 2012 Tutorial: Query Performance Prediction
14. The Robust Tracks – Decreasing Variability
IBM Labs in Haifa
across Topics
Several approaches to improving the poor effectiveness for some
topics were tested
Selective query processing strategy based on performance prediction
Post-retrieval reordering
Selective weighting functions
Selective query expansion
None of these approaches was able to show consistent
improvement over traditional non-selective approaches
Apparently, expanding the query by appropriate terms extracted
from an external collection (the Web) improves the effectiveness
for many queries, including poorly performing queries
14
1412/07/2012 SIGIR 2010, Geneva 14
SIGIR 2012 Tutorial: Query Performance Prediction
15. The Robust Tracks – Query Performance
IBM Labs in Haifa
Prediction
As a second challenge: systems were asked to predict their performance for
each of the test topics
Then the TREC topics were ranked
First, by their predicted performance value
Second, by their actual performance value
Evaluation was done by measuring the similarity between the predicted
performance-based ranking and the actual performance-based ranking
Most systems failed to exhibit reasonable prediction capability
14 runs had a negative correlation between the predicted and actual topic
rankings, demonstrating that measuring performance prediction is
intrinsically difficult
On the positive side, the difficulty in developing reliable prediction
methods raised the awareness of the IR community to this challenge
15
1512/07/2012 SIGIR 2010, Geneva 15
SIGIR 2012 Tutorial: Query Performance Prediction
16. HowIBM Labs in Haifa is the performance prediction task
difficult
for human experts?
TREC-6 experiment: Estimating whether human experts can predict the query
difficulty
A group of experts were asked to classify a set of TREC topics to three degrees of
difficulty based on the query expression only
easy, middle, hard?
The manual judgments were compared to the median of the average precision
scores, as determined after evaluating the performance of all participating
systems
Results
The Pearson correlation between the expert judgments and the “true” values was
very low (0.26).
The agreement between experts, as measured by the correlation between their
judgments, was very low too (0.39)
The low correlation illustrates how difficult this task is and how little is known
about what makes a query difficult
16
1612/07/2012 SIGIR 2010, Geneva 16
SIGIR 2012 Tutorial: Query Performance Prediction
17. HowIBM Labs in Haifa is the performance prediction task
difficult
for human experts? (contd.)
Hauff et al. (CIKM 2010) found
a low level of agreement between humans with regard to which queries are
more difficult than others (median kappa = 0.36)
there was high variance in the ability of humans to estimate query difficulty although
they shared “similar” backgrounds
a low correlation between true performance and humans’ estimates of query-
difficulty (performed in a pre-retrieval fashion)
median Kendall tau was 0.31 which was quite lower than that posted by the best
performing pre-retrieval predictors
however, overall the humans did manage to differentiate between “good” and “bad”
queries
a low correlation between humans’ predictions and those of query-performance
predictors with some exceptions
These findings further demonstrate how difficult the query-performance
prediction task is and how little is known about what makes a query
17
difficult SIGIR 2010, Geneva 17
1712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
18. IBM Labs in Haifa
Are queries found to be difficult in one collection
still considered difficult in another collection?
Robust track 2005: Difficult topics in the ROBUST collection were tested against another
collection (AQUAINT)
The median average precision over the ROBUST collection is 0.126
Compared to 0.185 for the same topics over the AQUAINT collection
Apparently, the AQUAINT collection is “easier” than the ROBUST collection
probably due to
Collection size
Many more relevant documents per topic in AQUAINT
Document features such as structure and coherence
However, the relative difficulty of the topics is preserved over the two datasets
The Pearson correlation between topics’ performance in both datasets is 0.463
This illustrates some dependency between the topic’s median scores on both
collections
Even when topics are somewhat easier in one collection than another,
the relative difficulty among topics is preserved, at least to some extent
18
1812/07/2012 SIGIR 2010, Geneva 18
SIGIR 2012 Tutorial: Query Performance Prediction
20. IBM Labs in Haifa
The retrieval task
Given:
A document set D (the corpus)
A query q
Retrieve Dq (the result list), a ranked list of documents from D,
which are most likely to be relevant to q
Some widely used retrieval methods:
Vector space tf-idf based ranking, which estimates relevance by the
similarity between the query and a document in the vector space
The probabilistic OKAPI-BM25 method, which estimates the probability
that the document is relevant to the query
Language-model-based approaches, which estimate the probability that
the query was generated by a language model induced from the
document
And more, divergence from randomness (DFR) approaches, inference
networks, markov random fields (MRF), …
20
2012/07/2012 SIGIR 2010, Geneva 20
SIGIR 2012 Tutorial: Query Performance Prediction
21. IBM Labs in Haifa
Text REtrieval Conference (TREC)
A series of workshops for large-scale evaluation of (mostly) text
retrieval technology:
Realistic test collections
Uniform, appropriate scoring procedures Title: African Civilian Deaths
Started in 1992
Description: How many civilian
non-combatants have been killed
in the various civil wars in Africa?
Narrative: A relevant document
A TREC task usually comprises of: will contain specific casualty
A document collection (corpus) information for a given area,
country, or region. It will cite
A list of topics (information needs) numbers of civilian deaths caused
directly or indirectly by armed
A list of relevant documents for each conflict.
topic (QRELs)
21
2112/07/2012 SIGIR 2010, Geneva 21
SIGIR 2012 Tutorial: Query Performance Prediction
22. IBM Labs in Haifa
Precision measures
Precision at k (P@k): The fraction of relevant documents
among the top-k results.
Average precision: AP is the average of precision values
computed at the ranks of each of the relevant documents
in the ranked list:
1
AP(q) =
Rq
∑ P@rank(r)
r∈Rq
Rq is the set of documents in the corpus that are relevant to q
Average Precision is usually computed using a ranking which is truncated at
some position (typically 1000 in TREC).
22
2212/07/2012 SIGIR 2010, Geneva 22
SIGIR 2012 Tutorial: Query Performance Prediction
23. IBM Labs in Haifa
Prediction quality measures
Given:
Query q
Result list Dq that contains n documents
Goal: Estimate the retrieval effectiveness of Dq, in terms of
satisfying Iq, the information need behind q
specifically, the prediction task is to predict AP(q) when no
relevance information (Rq) is given.
In practice: Estimate, for example, the expected average
precision for q.
The quality of a performance predictor can be measured by the
correlation between the expected average precision and the
corresponding actual precision values
23
2312/07/2012 SIGIR 2010, Geneva 23
SIGIR 2012 Tutorial: Query Performance Prediction
24. IBM Labs in Haifa
Measures of correlation
Linear correlation (Pearson)
considers the true AP values as well as the predicted AP values of queries
Non-linear correlation (Kendall’s tau, Spearman’s rho)
considers only the ranking of queries by their true AP values and by their
predicted AP values
24
2412/07/2012 SIGIR 2010, Geneva 24
SIGIR 2012 Tutorial: Query Performance Prediction
25. IBM Labs in Haifa
Evaluating a prediction system
Q5
Predicted AP
Different correlation metrics will
measure different things
If there is no apriory reason to
use Pearson, prefer Spearman
or KT
In practice, the difference is not
Q1 large, because there are
Q4 Q3 enough queries and because of
Q2 the distribution of AP
Actual AP
25
2512/07/2012 SIGIR 2010, Geneva 25
SIGIR 2012 Tutorial: Query Performance Prediction
26. IBM Labs in Haifa
Evaluating a prediction system (cond.)
It is important to note that it was found that state-of-the-art
query-performance predictors might not be correlated (at all)
with measures of users’ performance (e.g., the time it takes to
reach the first relevant document)
see Terpin and Hersh ADC ’04 and Zhao and Scholer ADC ‘07
However!,
This finding might be attributed, as suggested by Terpin and Hersh in ADC
’04, to the fact that standard evaluation measures (e.g., average
precision) and users’ performance are not always strongly correlated
Hersh et al. ’00, Turpin and Hersh ’01, Turpin and Scholer ’06, Smucker and
Parkash Jethani ‘10
26
2612/07/2012 SIGIR 2010, Geneva 26
SIGIR 2012 Tutorial: Query Performance Prediction
29. IBM Labs in Haifa
Pre-Retrieval Prediction Methods
Pre-retrieval prediction approaches estimate the quality of the
search results before the search takes place.
Provide effective instantiation of query performance prediction for search
applications that must efficiently respond to search requests
Only the query terms, associated with some pre-defined statistics
gathered at indexing time, can be used for prediction
Pre-retrieval methods can be split to linguistic and statistical
methods.
Linguistic methods apply natural language processing (NLP)
techniques and use external linguistic resources to identify
ambiguity and polysemy in the query.
Statistical methods analyze the distribution of the query
terms within the collection
29
2912/07/2012 SIGIR 2010, Geneva 29
SIGIR 2012 Tutorial: Query Performance Prediction
30. Linguistic Approaches (Mothe & Tangui (SIGIR QD
IBM Labs in Haifa
workshop, SIGIR 2005), Hauff (2010))
Most linguistic features do not correlate well with the system performance.
Features include, for example, morphological (avg. # of morphemes per query term),
syntactic link span (which relates to the average distance between query words in the
parse tree), semantic (polysemy- the avg. # of synsets per word in the WordNet
dictionary)
Only the syntactic links span and the polysemy value were shown to have
some (low) correlation
This is quite surprising as intuitively poor performance can be expected for
ambiguous queries
Apparently, term ambiguity should be measured using corpus-based approaches,
since a term that might be ambiguous with respect to the general vocabulary, may
have only a single interpretation in the corpus
30
3012/07/2012 SIGIR 2010, Geneva 30
SIGIR 2012 Tutorial: Query Performance Prediction
31. IBM Labs in Haifa
Pre-retrieval Statistical methods
Analyze the distribution of the query term frequencies within the
collection.
Two major frequent term statistics :
Inverse document frequency idf(t) = log(N/Nt)
Inverse collection term frequency ictf(t) = log(|D|/tf(t,D))
Specificity based predictors
Measure the query terms' distribution over the collection
Specificity-based predictors:
avgIDF, avgICTF - Queries composed of infrequent terms are easier to satisfy
maxIDF, maxICTF – Similarly
varIDF, varICTF - Low variance reflects the lack of dominant terms in the query
A query composed of non-specific terms is deemed to be more difficult
“Who and Whom”
31
3112/07/2012 SIGIR 2010, Geneva 31
SIGIR 2012 Tutorial: Query Performance Prediction
32. IBM Labs in Haifa
Specificity-based predictors
Query Scope (He & Ounis 2004)
The percentage of documents in the corpus that contain at least one
query term; high value indicates many candidates for retrieval, and
thereby a difficult query
High query scope indicates marginal prediction quality for short queries
only, while for long queries its quality drops significantly
QS is not a ``pure'' pre-retrieval predictor as it requires finding the
documents containing query terms (consider dynamic corpora)
Simplified Clarity Score (He & Ounis 2004)
The Kullback-Leibler (KL) divergence between the (simplified) query
language model and the corpus language model
p (t | q )
SC S ( q ) ∑ p ( t | q ) log
t∈ q p (t | D )
SCS is strongly related to the avgICTF predictor assuming each term appears only once
in the query : SCS(q) = log(1/|q|) + avgICTF(q)
32
3212/07/2012 SIGIR 2010, Geneva 32
SIGIR 2012 Tutorial: Query Performance Prediction
33. IBM Labs in Haifa
Similarity, coherency, and variance -based predictors
Similarity: SCQ (Zhao et al. 2008): High similarity to the corpus
indicates effective retrieval SCQ(t ) = (1 + log(tf (t , D))) ⋅ idf (t )
maxSCQ(q)/avgSCQ(q) are the maximum/average over the query terms
contradicts(?) the specificity idea (e.g., SCS)
Coherency: CS (He et al. 2008): The average inter-document
similarity between documents containing the query terms,
averaged over query terms
This is a conceptual pre-retrieval reminiscent of the post-retrieval
autocorrelation approach (Diaz 2007) that we will discuss later
Demanding computation that requires the construction of a pointwise
similarity matrix for all pair of documents in the index
Variance (Zhao et al. 2008): var(t) - Variance of term t’s
weights (e.g., tf.idf) over the documents containing it
maxVar and sumVar are the maximum/average over query terms
hypothesis: low variance implies to a difficult query due to low
33
3312/07/2012
discriminative power SIGIR 2010, Geneva 33
SIGIR 2012 Tutorial: Query Performance Prediction
34. IBM Labs in Haifa
Term Relatedness (Hauff 2010)
Hypothesis: If the query terms co-occur frequently in the collection we expect good
performance
Pointwise mutual information (PMI) is a popular measure of co-occurrence
statistics of two terms in the collection
It requires efficient tools for gathering collocation statistics from the corpus, to
allow dynamic usage at query run-time.
avgPMI(q)/maxPMI(q) measures the average and the maximum PMI over all pairs of
terms in the query
p(t1, t2 | D)
PMI (t1, t2 ) log
p(t1 | D) p(t2 | D)
34
3412/07/2012 SIGIR 2010, Geneva 34
SIGIR 2012 Tutorial: Query Performance Prediction
35. IBM Labs in Haifa
Evaluating pre-retrieval methods (Hauff 2010)
Insights
maxVAR and maxSCQ dominate other predictors, and are most stable over the
collections and topic sets
However, their performance significantly drops for one of the query sets (301-350)
Prediction is harder over the Web collections (WT10G and GOV2) than over the
news collection (ROBUST), probably due to higher heterogeneity of the data
35
3512/07/2012 SIGIR 2010, Geneva 35
SIGIR 2012 Tutorial: Query Performance Prediction
37. IBM Labs in Haifa
Post-Retrieval Predictors
Post-Retrieval
Analyze the search results in addition to the query
Methods
Usually are more complex as the top-results are
retrieved and analyzed
Prediction quality depends on the retrieval
process Clarity Robustness
Score
Analysis
As different results are expected for the same
query when using different retrieval methods
In contrast to pre-retrieval methods, the
Doc
search results may depend on query- Query
Perturb. Perturb.
Retrieval
Perturb.
independent factors
Such as document authority scores, search
personalization etc.
Post-retrieval methods can be categorized into three main paradigms:
Clarity-based methods directly measure the coherence of the search results.
Robustness-based methods evaluate how robust the results are to perturbations in the
query, the result list, and the retrieval method.
Score distribution based methods analyze the score distribution of the search results
37
3712/07/2012 SIGIR 2010, Geneva 37
SIGIR 2012 Tutorial: Query Performance Prediction
38. IBM Labs in Haifa
Clarity (Cronen-Townsend et al. SIGIR 2002)
Clarity measures the coherence (clarity) of the result-list with
respect to the corpus
Good results are expected to be focused on the query's topic
Clarity considers the discrepancy between the likelihood of
words most frequently used in retrieved documents and their
likelihood in the whole corpus
Good results - The language of the retrieved documents should be
distinct from the general language of the whole corpus
Bad results - The language of retrieved documents tends to be more
similar to the general language
Accordingly, clarity measures the KL divergence between a
language model induced from the result list and that induced
from the corpus
38
3812/07/2012 SIGIR 2010, Geneva 38
SIGIR 2012 Tutorial: Query Performance Prediction
39. IBM Labs in Haifa
Clarity Computation
Clarity (q ) KL( p(• | Dq ) || p (• | D)) KL divergence between
p (t | Dq ) Dq and D
Clarity (q ) = ∑
t∈V ( D )
p(t | Dq ) log
pMLE (t | D)
tf (t , X ) X’s unsmoothed LM (MLE)
pMLE (t | X )
|X|
p (t | d ) = λ p MLE (t | d ) + (1 − λ ) p (t | D ) d’s smoothed LM
p(t | Dq ) = ∑ p(t | d ) p(d | q)
d∈Dq
Dq’s LM (RM1)
p(q | d ) p(d )
p(d | q) = ∝ p(q | d ) d’s “relevance” to q
∑d '∈D p(q | d ') p(d ')
q
p(q | d ) ∏ p(t | d )
t∈q
query likelihood model
39
3912/07/2012 SIGIR 2010, Geneva 39
SIGIR 2012 Tutorial: Query Performance Prediction
40. IBM Labs in Haifa
Example
Consider the two queries variants for TREC topic 56 (from the
TREC query track):
Query A: Show me any predictions for changes in the prime lending rate
and any changes made in the prime lending rates
Query B: What adjustments should be made once federal action occurs?
40
4012/07/2012 SIGIR 2010, Geneva 40
SIGIR 2012 Tutorial: Query Performance Prediction
41. IBM Labs in Haifa
The Clarity of
1)relevant results, 2) non-relevant results, 3)random documents
10
Relevant
Non−relevant
9
Collection−wide
8
7
Query clarity score
6
5
4
3
2
1
0
300 350 400 450
Topic number
41
4112/07/2012 SIGIR 2010, Geneva 41
SIGIR 2012 Tutorial: Query Performance Prediction
42. IBM Labs in Haifa
Novel interpretation of clarity - Hummel et al. 2012
(See the Clarity Revisited poster!)
Clarity ( q ) KL ( p (• | Dq ) || p (• | D ) =
CE ( p (• | Dq ) || p (• | D )) − H ( p (• | Dq ))
Cross entropy Entropy
Clarity ( q ) = C Distance( q ) − L Diversity( q )
42
4212/07/2012 SIGIR 2010, Geneva 42
SIGIR 2012 Tutorial: Query Performance Prediction
43. IBM Labs in Haifa
Clarity variants
Divergence from randomness approach (Amati et al.
’02)
Emphasize query terms (Cronen Townsend et al. ’04,
Hauff et al. ’08)
Only consider terms that appear in a very low
percentage of all documents in the corpus (Hauff et al.
’08)
Beneficial for noisy Web settings
43
4312/07/2012 SIGIR 2010, Geneva 43
SIGIR 2012 Tutorial: Query Performance Prediction
44. IBM Labs in Haifa
Robustness
Robustness can be measured with respect to perturbations of the
Query
The robustness of the result list to small modifications of the query (e.g.,
perturbation of term weights)
Documents
Small random perturbations of the document representation are unlikely to
result in major changes to documents' retrieval scores
If scores of documents are spread over a wide range, then these
perturbations are unlikely to result in significant changes to the ranking
Retrieval method
In general, different retrieval methods tend to retrieve different results for the
same query, when applied over the same document collection
A high overlap in results retrieved by different methods may be related to high
agreement on the (usually sparse) set of relevant results for the query.
A low overlap may indicate no agreement on the relevant results; hence, query
difficulty
44
4412/07/2012 SIGIR 2010, Geneva 44
SIGIR 2012 Tutorial: Query Performance Prediction
45. Query Labs in Haifa
IBM
Perturbations
Overlap between the query and its sub-queries (Yom Tov et al.
SIGIR 2005)
Observation: Some query terms have little or no influence on the
retrieved documents, especially in difficult queries.
The query feedback (QF) method (Zhou and Croft SIGIR 2007)
models retrieval as a communication channel problem
The input is the query, the channel is the search system, and the set of
results is the noisy output of the channel
A new query is generated from the list of results, using the terms with
maximal contribution to the Clarity score, and then a second list of results
is retrieved for that query
The overlap between the two lists is used as a robustness score.
Original R
q SE q’ AP (q )
Overlap
New R
4512/07/2012 45
SIGIR 2012 Tutorial: Query Performance Prediction
46. Document Perturbation (Zhou & Croft CIKM 06)
IBM Labs in Haifa
(A conceptually similar approach, but which uses a different technique, was presented by
Vinay et al. in SIGIR 06)
How stable is the ranking in the presence of uncertainty in the
ranked documents?
Compare a ranked list from the original collection to the corresponding
ranked list from the corrupted collection using the same query and ranking
function
docs’
d d’
D lang. model D’
q
SE
d1 d’1
d2 d’2
…
Compare …
dk d’k
46
4612/07/2012 SIGIR 2010, Geneva 46
SIGIR 2012 Tutorial: Query Performance Prediction
47. Cohesioninof the Result list – Clustering tendency
IBM Labs Haifa
(Vinay et al. SIGIR 2006)
The cohesion of the result list can be measured by its clustering
patterns
Following the ``cluster hypothesis'' which implies that documents
relevant to a given query are likely to be similar to one another
A good retrieval returns a single, tight cluster, while poor retrieval returns
a loosely related set of documents covering many topics
The ``clustering tendency'' of the result set
Corresponds to the Cox-Lewis statistic which measures the
``randomness'' level of the result list
Measured by the distance between a randomly selected document and it’s
nearest neighbor from the result list.
When the list contains “inherent” clusters, the distance between the random
document and its closest neighbor is likely to be much larger than the
distance between this neighbor and its own nearest neighbor in the list
47
4712/07/2012 SIGIR 2010, Geneva 47
SIGIR 2012 Tutorial: Query Performance Prediction
48. Retrieval in Haifa
IBM Labs
Method Perturbation (Aslam & Pavlu,
ECIR 2007)
Query difficulty is predicted by submitting the query to different
retrieval methods and measuring the diversity of the retrieved
ranked lists
Each ranking is mapped to a distribution over the document collection
JSD distance is used to measure the diversity of these distributions
Evaluation: Submissions of all participants to several TREC tracks
were analyzed
The agreement between submissions highly correlates with the query
difficulty, as measured by the median performance (AP) of all participants
The more submissions are analyzed, the prediction quality improves.
48
4812/07/2012 SIGIR 2010, Geneva 48
SIGIR 2012 Tutorial: Query Performance Prediction
49. IBM Labs in Haifa
Score Distribution Analysis
Often, the retrieval scores reflect the similarity of documents to a
query
Hence, the distribution of retrieval scores can potentially help predict
query performance.
Naïve predictors:
The highest retrieval score or the mean of top scores (Thomlison 2004)
The difference between query-independent scores and a query-
dependent scores
Reflects the ``discriminative power'' of the query (Bernstein et al. 2005)
49
4912/07/2012 SIGIR 2010, Geneva 49
SIGIR 2012 Tutorial: Query Performance Prediction
50. IBM Labs in Haifa
Spatial autocorrelation (Diaz SIGIR 2007)
Query performance is correlated with the extent to which the result list
“respects” the cluster hypothesis
The extent to which similar documents receive similar retrieval scores
In contrast, a difficult query might be detected when similar documents are
scored differently.
A document's ``regularized'' retrieval score is determined based on the
weighted sum of the scores of its most similar documents
ScoreRe g (q, d ) ∑Sim(d, d ')Score(q, d ')
d'
The linear correlation of the regularized scores with the original scores is used for
query-performance prediction.
50
5012/07/2012 SIGIR 2010, Geneva 50
SIGIR 2012 Tutorial: Query Performance Prediction
51. Weighted Haifa
IBM Labs in
Information Gain –WIG
(Zhou & Croft 2007)
WIG measures the divergence between the mean retrieval score of top-
ranked documents and that of the entire corpus.
The more similar these documents are to the query, with respect to the
corpus, the more effective the retrieval
The corpus represents a general non-relevant document
p(t | d ) 1
WIG(q) 1
k ∑k ∑λt log p(t | D) avg ( Score(d )) − Score(D)
k | q | d∈Dqk
d∈Dq t∈q
k
Dq is the list of the k highest ranked documents
λt reflects the weight of the term's type
When all query terms are simple keywords, this parameter collapses to
1/sqrt(|q|)
WIG was originally proposed and employed in the MRF framework
However, for a bag-of-words representation MRF reduces to the query likelihood
model and is effective (Shtok et al. ’07, Zhou ’07)
51
5112/07/2012 SIGIR 2010, Geneva 51
SIGIR 2012 Tutorial: Query Performance Prediction
52. Normalized Query Commitment - NQC
IBM Labs in Haifa
(Shtok et al. ICTIR 2009 and TOIS 2012)
NQC estimates the presumed amount of query drift in the list of top-retrieved documents
Query expansion often uses a centroid of the list Dq as an expanded query model
The centroid usually manifests query drift (Mitra et al. ‘98)
The centroid could be viewed as a prototypical misleader as it exhibits (some)
similarity to the query
This similarity is dominated by non-query-related aspects that lead to query drift
Shtok et al. showed that the mean retrieval score of documents in the result list
corresponds, in several retrieval methods, to the retrieval score of some centroid-based
representation of Dq
Thus, the mean score represents the score of a prototypical misleader
The standard deviation of scores which reflects their dispersion around the mean,
represents the divergence of retrieval scores of documents in the list from that of a non-
relevant document that exhibits high query similarity (the centroid).
1
∑ ( Score(d ) − µ )
2
k k
d ∈Dq
NQC (q )
Score( D )
52
5212/07/2012 SIGIR 2010, Geneva 52
SIGIR 2012 Tutorial: Query Performance Prediction
53. IBM Labs in Haifa
A geometric interpretation of NQC
53
5312/07/2012 SIGIR 2010, Geneva 53
SIGIR 2012 Tutorial: Query Performance Prediction
54. EvaluatingHaifa
IBM Labs in
post-retrieval methods
(Shtok et al. TOIS 2012)
Prediction quality (Pearson correlation and Kendall’s tau)
Similarly to pre-retrieval methods, there is no clear winner
QF and NQC exhibit comparable results
NQC exhibits good performance over most collections but does
not perform very well on GOV2.
QF performs well over some of the collections but is inferior to
other predictors on ROBUST.
54
5412/07/2012 SIGIR 2010, Geneva 54
SIGIR 2012 Tutorial: Query Performance Prediction
55. IBM Labs in Haifa
Additional predictors that analyze the retrieval
scores distribution
Computing the standard deviation of retrieval scores at a query-
dependent cutoff
Perez-Iglesias and Araujo SPIRE 2010
Cummins et al. SIGIR 2011
Computing expected ranks for documents
Vinay et al. CIKM 2008
Inferring AP directly from the retrieval score distribution
Cummins AIRS 2011
55
5512/07/2012 SIGIR 2010, Geneva 55
SIGIR 2012 Tutorial: Query Performance Prediction
57. IBM Labs in Haifa
Combining post-retrieval predictors
Some integration efforts of a few predictors based on linear
regression:
Yom-Tov et al (SIGIR’05) combined avgIDF with the Overlap predictor
Zhou and Croft (SIGIR’07) integrated WIG and QF using a simple linear
combination
Diaz (SIGIR’07) incorporated the spatial autocorrelation predictor with
Clarity and with the document perturbation based predictor
In all those trials, the results of the combined predictor were
much better than the results of the single predictors
This suggests that these predictors measure (at least semi)
complementary properties of the retrieved results
57
5712/07/2012 SIGIR 2010, Geneva 57
SIGIR 2012 Tutorial: Query Performance Prediction
58. Utility Labs in Haifa
IBM estimation framework (UEF) for query
performance prediction (Shtok et al. SIGIR 2010)
Suppose that we have a true model of relevance, RI , for the q
information need I q that is represented by the query q
Then, the ranking π ( Dq , RI q ) , induced over the given result list
using RI , is the most effective for these documents
q
Accordingly, the utility (with respect to the information need)
provided by the given ranking, which can be though of as
reflecting query performance, can be defined as
U ( Dq | I q ) Similarity ( Dq , π ( Dq , RI )) q
In practice, we have no explicit knowledge of the underlying
information need and of RIq
Using statistical decision theory principles, we can approximate
the utility by estimating RIq
58
5812/07/2012
q q ∫
ˆ
(
U ( D | I ) ≈ Similarity D , π ( D , R ) p ( R | I )dR
ˆ
q
ˆ ˆ
q q
SIGIR 2010, Geneva
) q q q
58
Rq SIGIR 2012 Tutorial: Query Performance Prediction
59. Instantiating predictors
IBM Labs in Haifa
U ( D | I ) ≈ ∫ Similarity ( D , π ( D , R ) ) p ( R
q q q
ˆ
q
ˆ
q q
ˆ
| I q )dRq
ˆ
Rq
ˆ
Relevance-model estimates (Rq )
relevance language models constructed from documents sampled from
the highest ranks of some initial ranking
Estimate the relevance-model presumed “representativeness” of
ˆ
the information need (p( Rq | I q ) )
apply previously proposed predictors (Clarity, WIG, QF, NQC) upon the
sampled documents from which the relevance model is constructed
Inter-list similarity measures
Pearson, Kendall’s-tau, Spearman
A specific, highly effective, instantiated predictor:
Construct a single relevance model from the given result list, Dq
Use a previously proposed predictor upon Dq to estimate relevance-
model effectiveness
Use Pearson’s correlation between retrieval scores for the similarity
59 SIGIR 2010, Geneva 59
5912/07/2012
measure SIGIR 2012 Tutorial: Query Performance Prediction
60. IBM Labs in Haifa
The UEF framework - A flow diagram
Dq
q SE Model
ˆ
d1 Sampling R q ;S
d2
Re-Ranking
…
dk
π( Dq , Rq ;S )
ˆ
Relevance
d’1 Estimator
d’2
…
d’k
Performance
AP (q )
Ranking
predictor
Similarity
60
6012/07/2012 SIGIR 2010, Geneva 60
SIGIR 2012 Tutorial: Query Performance Prediction
63. IBM Labs in Haifa
Are the various post-retrieval predictors that
different from each other?
A unified framework for explaining post-retrieval
predictors (Kurland et al. ICTIR 2011)
63
63
64. A unified post-retrieval prediction framework
IBM Labs in Haifa
We want to predict the effectiveness of a ranking π M (q; D) of the corpus that was induced by
retrieval method M in response to query q
Assume a true model of relevance RIq that can be used for retrieval ⇒
the resultant ranking, π opt (q; D), is of optimal utility
Utility (π M (q; D); I q ) sim(π M (q; D), π opt (q; D))
Use as reference comparisons a pseudo effective (PE)
and pseudo ineffective (PIE) rankings (cf. Rocchio '71):
Utility (π M (q; D); I q ) α (q)sim(π M (q; D), π PE (q; D)) − β (q) sim(π M (q; D), π PIE (q; D))
ˆ
Pseudo –Effective ranking πPE(q;D)
Corpus ranking πM(q;D)
…
…
Pseudo -Ineffective ranking πPIE(q;D)
…
6412/07/2012 64
SIGIR 2012 Tutorial: Query Performance Prediction
65. IBM Labs in Haifa
Instantiating predictors
Focus on the result lists of the documents most highly ranked by each ranking
U (π M (q; D); I q ) α (q)sim( Dq , LPE ) − β (q)sim( Dq , LPIE )
ˆ
Deriving predictors:
1. "Guess" a PE result list (LPE ) and/or a PIE result list (LPIE )
2. Select weights α (q) and β (q)
3. Select an inter-list (ranking) similarity measure
- Pearson's r, Kendall's-τ , Spearman's-ρ
6512/07/2012 65
SIGIR 2012 Tutorial: Query Performance Prediction
66. IBM Labs in Haifa
U (π M (q; D ); I q ) α ( q ) sim( LM , LPE ) − β (q ) sim( LM , LPIE )
ˆ
Basic idea: The PIE result list is composed of k copies of a pseudo ineffective document
Predictor Pseudo Predictor’s Sim. α(q) β(q)
Ineffective description measure
document
Clarity Corpus Estimates the focus of the result -KL diverg. 0 1
(Cronen-Townsend list with respect to the corpus between
’02, ‘04) language
models
WIG Corpus Measures the difference between -L1 distance 0 1
(Weighted the retrieval scores of documents of retrieval
Information Gain;
Zhou & Croft ‘07)
in the result list and the score of scores
the corpus
NQC Result list Measures the standard deviation -L2 distance 0 1
(Normalized centroid of the retrieval scores of of retrieval
Query Commitment; scores
Shtok et al. ’09)
documents in the result list
6612/07/2012 66
SIGIR 2012 Tutorial: Query Performance Prediction
67. IBM Labs in Haifa
U (π M (q; D ); I q ) α ( q ) sim( LM , LPE ) − β (q ) sim( LM , LPIE )
ˆ
Predictor Pseudo Predictor’s Sim. α(q) β(q)
Effective result description measure
list
QF (Query Use a relevance Measures the “amount of Overlap at 1 0
Feedback; Zhou model for noise“ (non-query-related top ranks
& Croft ’07) retrieval over aspects) in the result list
the corpus
UEF (Utility Re-rank the Estimates the potential utility Rank/Scores
Presumed 0
representat
Estimation given result list of the result list using correlation iveness of
the
Framework; using a relevance models relevance
model
Shtok et al. ‘10) relevance model
Autocorrelation 1.Score 1. The degree to which Pearson 1 0
(Diaz ‘07) regularization retrieval scores “respect correlation
the cluster hypothesis” between
2.Fusion 2. Similarity with a fusion – scores
based result list
6712/07/2012 67
SIGIR 2012 Tutorial: Query Performance Prediction
68. IBM Labs in Haifa
A General Model for Query
Difficulty
68
6812/07/2012 SIGIR 2010, Geneva 68
SIGIR 2012 Tutorial: Query Performance Prediction
69. A model of query difficulty (Carmel et al, SIGIR
IBM Labs in Haifa
2006)
A user with a given information
need (a topic):
Submits a query to a search
engine
Judges the search results Topic
according to their relevance
to this information need.
Thus, the query/ies and the Qrels
Queries (Q) Documents (R)
are two sides of the same
information need
Qrels also depend on the existing
collection (C)
Search engine
Define: Topic = (Q,R|C)
69
6912/07/2012 SIGIR 2010, Geneva 69
SIGIR 2012 Tutorial: Query Performance Prediction
70. IBM Labs in Haifa
A Theoretical Model of Topic Difficulty
Main Hypothesis
Topic difficulty is induced from the distances between the model parts
70
7012/07/2012 SIGIR 2010, Geneva 70
SIGIR 2012 Tutorial: Query Performance Prediction
71. Model validation: The Pearson correlation between
IBM Labs in Haifa
Average Precision and the model distances (see the
paper for estimates of the various distances)
Based on the .gov2 collection (25M docs) and 100 topics of the Terabyte tracks 04/05
Topic
Queries Documents
-0.06
0.15
Combined: 0.45
0.17 0.32
71
7112/07/2012 SIGIR 2010, Geneva 71
SIGIR 2012 Tutorial: Query Performance Prediction
72. A probabilistic framework for QPP (Kurland et al.
IBM Labs in Haifa
CIKM 2012, to appear)
72
7212/07/2012 SIGIR 2010, Geneva 72
SIGIR 2012 Tutorial: Query Performance Prediction
73. IBM Labs in Haifa
Post-retrieval prediction (Kurland et al. 2012)
query-independent result list
properties (e.g.,
cohesion/dispersion)
WIG and Clarity are derived
from this expression
73
7312/07/2012 SIGIR 2010, Geneva 73
SIGIR 2012 Tutorial: Query Performance Prediction
74. IBM Labs in Haifa
Using reference lists (Kurland et al. 2012)
The presumed quality of the
reference list (a prediction
task at its own right)
74
7412/07/2012 SIGIR 2010, Geneva 74
SIGIR 2012 Tutorial: Query Performance Prediction
76. IBM Labs in Haifa
Main applications for QDE
Feedback to the user and the system
Federation and metasearch
Content enhancement using missing content analysis
Selective query expansion/ query selection
Others
76
7612/07/2012 SIGIR 2010, Geneva 76
SIGIR 2012 Tutorial: Query Performance Prediction
77. IBM Labs in Haifa
Feedback to the user and the system
Direct feedback to the user
“We were unable to find relevant
documents to your query”
Estimating the value of terms from
query refinement
Suggest which terms to add, and
estimate their value
Personalization
Which queries would benefit from
personalization? (Teevan et al. ‘08)
77
7712/07/2012 SIGIR 2010, Geneva 77
SIGIR 2012 Tutorial: Query Performance Prediction
78. IBM Labs in Haifa
Federation and metasearch: Problem definition
Merged
ranking
Given several databases that
might contain information Federate
relevant to a given question, results
How do we construct a good
unified list of answers from all
these datasets?
Search Search Search
Similarly, given a set of search
engine engine engine
engines employed for the same
query.
How do we merge their results in
an optimal manner?
Collection Collection Collection
78
7812/07/2012 SIGIR 2010, Geneva 78
SIGIR 2012 Tutorial: Query Performance Prediction
79. IBM Labs in Haifa
Prediction-based Federation & Metasearch
Merged
Weight each result ranking
set by it’s
predicted precision
Federate
(Yom-Tov et al.,
results
2005; Sheldon et
al. 11)
or
Search Search Search
select the best
engine engine engine
predicted SE
(White et al., 2008,
Berger&Savoy,
2007)
Collection Collection Collection
79
7912/07/2012 SIGIR 2010, Geneva 79
SIGIR 2012 Tutorial: Query Performance Prediction
80. Metasearch and Federation using Query
IBM Labs in Haifa
Prediction
Train a predictor for each engine/collection pair
For a given query: Predict its difficulty for each engine/collection
pair
Weight the results retrieved from each engine/collection pair
accordingly, and generate the federated list
80
8012/07/2012 SIGIR 2010, Geneva 80
SIGIR 2012 Tutorial: Query Performance Prediction
81. IBM Labs in Haifa
Metasearch experiment
The LA Times collection was
P@10 %no
indexed using four search engines.
A predictor was trained for each Single SE 1 0.139 47.8
search engine. search
For each query the results-list from engine SE 2 0.153 43.4
each search engine was weighted
SE 3 0.094 55.2
using the prediction for the specific
queries. SE 4 0.171 37.4
The final ranking is a ranking of the
union of the results lists, weighted Meta- Round- 0.164 45.0
by the prediction. search robin
Meta- 0.163 34.9
Crawler
Prediction- 0.183 31.7
based
81
8112/07/2012 SIGIR 2010, Geneva 81
SIGIR 2012 Tutorial: Query Performance Prediction
82. IBM Labs in Haifa
Federation for the TREC 2004 Terabyte track
GOV2 collection: 426GB, 25 million documents
50 topics
For federation, divided into 10 partitions of roughly equal size
P@10 MAP
One collection 0.522 0.292
Prediction-based federation 0.550 0.264
Score-based federation 0.498 0.257
* 10-fold cross-validation
82
8212/07/2012 SIGIR 2010, Geneva 82
SIGIR 2012 Tutorial: Query Performance Prediction
83. Content enhancement using missing content
IBM Labs in Haifa
analysis
Information External
need content
Missing
Search Data
content
engine gathering
estimation
Repository
Quality
estimation
83
8312/07/2012 SIGIR 2010, Geneva 83
SIGIR 2012 Tutorial: Query Performance Prediction
84. IBM Labs in Haifa
Story line
A helpdesk system where system administrators try to find
relevant solutions in a database
The system administrators’ queries are logged and analyzed to
find topics of interest that are not covered in the current
repository
Additional content for topics which are lacking is obtained from
the internet, and is added to the local repository
84
8412/07/2012 SIGIR 2010, Geneva 84
SIGIR 2012 Tutorial: Query Performance Prediction
85. AddingLabs in Haifa where it is missing improves
IBM
content
precision
Cluster user queries
Add content according to:
Lacking content
Size
Random
Adding content where it is missing
brings the best improvement in
precision
85
8512/07/2012 SIGIR 2010, Geneva 85
SIGIR 2012 Tutorial: Query Performance Prediction
86. IBM Labs in Haifa
Selective query expansion
Automatic query expansion: Improving average quality by adding search
terms
Pseudo-relevance feedback (PRF): Considers the (few) top-ranked
documents to be relevant, and uses them to expand the query.
PRF works well on average, but can significantly degrade retrieval
performance for some queries.
PRF fails because of Query Drift: Non-relevant documents infiltrate into
the top results or relevant results contain aspects irrelevant of the
query’s original intention.
Rocchio’s method: The expanded query is obtained by combining the
original query and the centroid of the top results, adding new terms to
the query and reweighting the original terms.
Qm = αQOriginal + β ∑D
D j ∈DRe levant
j −γ ∑D k
Dk ∈DIrrelevant
86
8612/07/2012 SIGIR 2010, Geneva 86
SIGIR 2012 Tutorial: Query Performance Prediction
87. IBM Labs in Haifa
Some work on selective query expansion
Expand only “easy” queries
Predict which are the easy queries and only expand them (Amati et al.
2004, Yom-Tov et al. 2005)
Estimate query drift (Cronen-Townsend et al., 2006)
Compare the expanded list with the unexpanded list to estimate if too
much noise was added by using expansion
Selecting a specific query expansion form from a list of candidates
(Winaver et al. 2007)
Integrating various query-expansion forms by weighting them using
query-performance predictors (Soskin et al., 2009)
Estimate how much expansion (Lv and Zhai, 2009)
Predict the best α in the Rocchio formula using features of the query, the
documents, and the relationship between them
But, see Azzopardi and Hauff 2009
87
8712/07/2012 SIGIR 2010, Geneva 87
SIGIR 2012 Tutorial: Query Performance Prediction
88. IBM Labs in Haifa
Other uses of query difficulty estimation
Collaborative filtering (Bellogin & Castells, 2009): Measure the
lack of ambiguity in a users’ preference.
An “easy” user is one whose preferences are clear-cut, and thus
contributes much to a neighbor.
Term selection (Kumaran & Carvalho, 2009): Identify irrelevant
query terms
Queries with fewer irrelevant terms tend to have better results
Reducing long queries (Cummins et al., 2011)
88
8812/07/2012 SIGIR 2010, Geneva 88
SIGIR 2012 Tutorial: Query Performance Prediction
90. IBM Labs in Haifa
Summary
In this tutorial, we surveyed the current state-of-the-art research on query
difficulty estimation for IR
We discussed the reasons that
Cause search engines to fail for some of the queries
Bring about a high variability in performance among queries as well as among
systems
We summarized several approaches for query performance prediction
Pre-retrieval methods
Post-retrieval methods
Combining predictors
We reviewed evaluation metrics for prediction quality and the results of various
evaluation studies conducted over several TREC benchmarks.
These results show that state-of-the-art existing predictors are able to identify
difficult queries by demonstrating a reasonable prediction quality
However, prediction quality is still moderate and should be substantially improved
in order to be widely used in IR tasks
90
9012/07/2012 SIGIR 2010, Geneva 90
SIGIR 2012 Tutorial: Query Performance Prediction
91. IBM Labs in Haifa
Summary – Main Results
Current linguistic-based predictors do not exhibit meaningful
correlation with query performance
This is quite surprising as intuitively poor performance can be expected
for ambiguous queries
In contrast, statistical pre-retrieval predictors such as SumSCQ,
and maxVAR have relatively significant predictive ability
These pre-retrieval predictors, and a few others, exhibit comparable
performance to post-retrieval methods such as Clarity, WIG, NQC, QF
over large scale Web collections
This is counter-intuitive as post-retrieval methods are exposed to much
more information than pre-retrieval methods
However, current state-of-the-art predictors still suffer from low
robustness in prediction quality
This robustness problem, as well as the moderate prediction quality of
existing predictors, are two of the greatest challenges in query difficulty
prediction, that should be further explored in the future
91
9112/07/2012 SIGIR 2010, Geneva 91
SIGIR 2012 Tutorial: Query Performance Prediction
92. IBM Labs in Haifa
Summary (cont)
We tested whether combining several predictors together may improve
prediction quality
Especially when the different predictors are independent and measure different
aspects of the query and the search results
An example for a combination method is using linear regression
The regression task is to learn how to optimally combine the predicted
performance values in order to best fit them to the actual performance values
Results were moderate, probably due to the sparseness of the training data, which
over-represents the lower end of the performance values
We discussed three frameworks for query-performance prediction
Utility Estimation Framework (UEF); Shtok et al. 2010
State-of-the-art prediction quality
A unified framework for post-retrieval prediction that sets common grounds for various
previously-proposed predictors; Kurland et al. 2011
A fundamental framework for estimating query difficulty; Carmel et al. 2006
92
9212/07/2012 SIGIR 2010, Geneva 92
SIGIR 2012 Tutorial: Query Performance Prediction
93. IBM Labs in Haifa
Summary (cont)
We discussed a few applications that utilize query difficulty
estimators
Handling each query individually based on its estimated difficulty
Find the best terms for query refinement by measuring the expected gain in
performance for each candidate term
Expand the query or not based on predicted performance of the expanded
query
Personalize the query selectively only in cases that personalization is
expected bring value
Collection enhancement guided by the identification of missing content
queries
Fusion of search results from several sources based on their predicted
quality
93
9312/07/2012 SIGIR 2010, Geneva 93
SIGIR 2012 Tutorial: Query Performance Prediction
94. IBM Labs in Haifa
What’s Next?
Predicting the performance for other query types
Navigational Queries
XML queries (Xquery, Xpath)
Domain Specific queries (e.g .Healthcare)
Considering other factors that may affect query difficulty
Who is the person behind the query? in what context?
Geo-spatial features
Temporal aspects
Personal parameters
Query difficulty in other search paradigms
Multifaceted search
Exploratory Search
94
9412/07/2012 SIGIR 2010, Geneva 94
SIGIR 2012 Tutorial: Query Performance Prediction
95. IBM Labs in Haifa
Concluding Remarks
Research on query difficulty estimation has begun only ten years ago with the
pioneering work on the Clarity predictor (2002)
Since then this subfield has found its place at the center of IR research
These studies have revealed alternative prediction approaches, new evaluation
methodologies, and novel applications
In this tutorial we covered
Existing performance prediction methods
Some evaluation studies
Potential applications
Some anticipations on future directions in the field
While the progress we see is enormous already, performance prediction is still
challenging and far from being solved
Much more accurate predictors are required in order to be widely adopted by IR
tasks
We hope that this tutorial will contribute in increasing the interest in query
difficulty estimation
95
9512/07/2012 SIGIR 2010, Geneva 95
SIGIR 2012 Tutorial: Query Performance Prediction