SlideShare une entreprise Scribd logo
1  sur  96
Télécharger pour lire hors ligne
Query Performance Prediction
            for IR
               David Carmel, IBM Haifa Research Lab
               Oren Kurland, Technion


          SIGIR Tutorial
          Portland Oregon, August 12, 2012
IBM Haifa Research Lab                         © 2010 IBM Corporation
IBM Labs in Haifa
Instructors
   Dr. David Carmel
              Research Staff Member at the Information Retrieval group at IBM Haifa
              Research Lab
              Ph.D. in Computer Science from the Technion, Israel, 1997
              Research Interests: search in the enterprise, query performance
              prediction, social search, and text mining
              carmel@il.ibm.com,
              https://researcher.ibm.com/researcher/view.php?person=il-CARMEL

   Dr. Oren Kurland
              Senior lecturer at the Technion --- Israel Institute of Technology
              Ph.D. in Computer Science from Cornell University, 2006
              Research Interests: information retrieval
              kurland@ie.technion.ac.il,
              http://iew3.technion.ac.il/~kurland


2
212/07/2012                                          SIGIR 2010, Geneva               2
                                  SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa




   This tutorial is based in part on the book
              Estimating the query difficulty
              for Information Retrieval
                   Synthesis Lectures on Information
                   Concepts, Retrieval, and Services,
                   Morgan & Claypool publishers
   The tutorial presents the opinions of the
   presenters only, and does not
   necessarily reflect the views of IBM or
   the Technion
   Algorithms, techniques, features, etc.
   mentioned here might or might not be in
   use by IBM

3
312/07/2012                                          SIGIR 2010, Geneva               3
                                  SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Query Difficulty Estimation – Main Challenge
     Estimating the query difficulty is an attempt to quantify the quality of search results
     retrieved for a query from a given collection of documents, when no relevance
     feedback is given

   Even for systems that succeed
   very well on average, the quality
   of results returned for some of
   the queries is poor
   Understanding why some queries
   are inherently more difficult than
   others is essential for IR
   A good answer to this question
   will help search engines to
   reduce the variance in
   performance


4
412/07/2012                                          SIGIR 2010, Geneva                        4
                                  SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa

Estimating the Query Difficulty – Some Benefits
   Feedback to users:
              The IR system can provide the users with an estimate of the expected quality of the
              results retrieved for their queries.
              Users can then rephrase “difficult” queries or, resubmit a “difficult” query to
              alternative search resources.
   Feedback to the search engine:
              The IR system can invoke alternative retrieval strategies for different queries
              according to their estimated difficulty.
                   For example, intensive query analysis procedures may be invoked selectively for difficult
                   queries only
   Feedback to the system administrator:
       For example, administrators can identify missing content queries
                   Then expand the collection of documents to better answer these queries.
   For IR applications:
              For example, a federated (distributed) search application
                   Merging the results of queries employed distributively over different datasets
                   Weighing the results returned from each dataset by the predicted difficulty

5
512/07/2012                                                                                                    5
                                  SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Contents
   Introduction -The Robustness Problem of (ad hoc) Information
   Retrieval
   Basic Concepts
   Query Performance Prediction Methods
       Pre-Retrieval Prediction Methods
       Post-Retrieval Prediction Methods
       Combining Predictors
       A Unified Framework for Post-Retrieval Query-Performance
       Prediction
       A General Model for Query Difficulty
       A Probabilistic Framework for Query-Performance Prediction
       Applications of Query Difficulty Estimation
       Summary
6      Open Challenges 2012 Tutorial:SIGIR 2010, Geneva Prediction
612/07/2012             SIGIR        Query Performance
                                                                     6
Introduction –
  The Robustness Problem of Information
  Retrieval




IBM Haifa Research Lab         © 2010 IBM Corporation
IBM Labs in Haifa

The Robustness problem of IR

   Most IR systems suffer from a radical variance in retrieval
   performance when responding to users’ queries
              Even for systems that succeed very well on average, the quality of results
              returned for some of the queries is poor
              This may lead to user dissatisfaction

   Variability in performance relates to various factors:
              The query itself (e.g., term ambiguity “Golf”)
              The vocabulary mismatch problem - the discrepancy between the query
              vocabulary and the document vocabulary
              Missing content queries - there is no relevant information in the corpus
              that can satisfy the information needs.


8
812/07/2012                                          SIGIR 2010, Geneva                    8
                                  SIGIR 2012 Tutorial: Query Performance Prediction
An example for a difficult query:
    IBM Labs in Haifa


      “The Hubble Telescope achievements”
              Hubble Telescope Achievements:
               Great eye sets sights sky high
               Simple test would have found flaw in Hubble telescope
               Nation in brief
               Hubble space telescope placed aboard shuttle
               Cause of Hubble telescope defect reportedly found
               Flaw in Hubble telescope
               Flawed mirror hampers Hubble space telescope
               Touchy telescope torments controllers
               NASA scrubs launch of discovery
               Hubble builders got award fees, magazine says

     Retrieved results deal with issues related to the Hubble telescope project in
     general; but the gist of that query, achievements, is lost

9
912/07/2012                                    SIGIR 2010, Geneva                    9
                            SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
The variance in performance across queries
and systems (TREC-7)




     Queries are sorted in decreasing order according to average precision attained
     among all TREC participants (green bars).
     The performance of two different systems per query is shown by the two curves.

10
1012/07/2012                                          SIGIR 2010, Geneva               10
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
The Reliable Information Access (RIA) workshop
   The first attempt to rigorously investigate the reasons for performance variability
   across queries and systems.
   Extensive failure analysis of the results of
           6 IR systems
           45 TREC topics
   Main reason to failures - the systems’ inability to identify all important aspects of
   the query
           The failure to emphasize one aspect of a query over another, or emphasize one aspect
           and neglect other aspects
                    “What disasters have occurred in tunnels used for transportation?”
                    Emphasizing only one of these terms will deteriorate performance because each
                    term on its own does not fully reflect the information need
   If systems could estimate what failure categories the query may belong to
           Systems could apply specific automated techniques that correspond to the failure
           mode in order to improve performance

11
1112/07/2012                                          SIGIR 2010, Geneva                      11
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa




12
1212/07/2012                                          SIGIR 2010, Geneva               12
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Instability in Retrieval - The TREC’s Robust
               IBM Labs in Haifa


Tracks
      The diversity in performance among topics and systems led to the
      TREC Robust tracks (2003 – 2005)
                Encouraging systems to decrease variance in query performance by
                focusing on poorly performing topics
      Systems were challenged with 50 old TREC topics found to be
      “difficult” for most systems over the years
                A topic is considered difficult when the median of the average precision
                scores of all participants for that topic is below a given threshold
      A new measure, GMAP, uses the geometric mean instead of the
      arithmetic mean when averaging precision values over topics
                Emphasizes the lowest performing topics, and is thus a useful measure
                that can attest to the robustness of system’s performance


13
1312/07/2012                                          SIGIR 2010, Geneva                   13
                                   SIGIR 2012 Tutorial: Query Performance Prediction
The Robust Tracks – Decreasing Variability
               IBM Labs in Haifa


across Topics
   Several approaches to improving the poor effectiveness for some
   topics were tested
           Selective query processing strategy based on performance prediction
           Post-retrieval reordering
           Selective weighting functions
                 Selective query expansion
   None of these approaches was able to show consistent
   improvement over traditional non-selective approaches

   Apparently, expanding the query by appropriate terms extracted
   from an external collection (the Web) improves the effectiveness
   for many queries, including poorly performing queries

14
1412/07/2012                                          SIGIR 2010, Geneva               14
                                   SIGIR 2012 Tutorial: Query Performance Prediction
The Robust Tracks – Query Performance
               IBM Labs in Haifa



Prediction
      As a second challenge: systems were asked to predict their performance for
      each of the test topics
      Then the TREC topics were ranked
           First, by their predicted performance value
           Second, by their actual performance value
      Evaluation was done by measuring the similarity between the predicted
      performance-based ranking and the actual performance-based ranking

      Most systems failed to exhibit reasonable prediction capability
          14 runs had a negative correlation between the predicted and actual topic
          rankings, demonstrating that measuring performance prediction is
          intrinsically difficult


     On the positive side, the difficulty in developing reliable prediction
     methods raised the awareness of the IR community to this challenge

15
1512/07/2012                                          SIGIR 2010, Geneva               15
                                   SIGIR 2012 Tutorial: Query Performance Prediction
HowIBM Labs in Haifa is the performance prediction task
     difficult
for human experts?
   TREC-6 experiment: Estimating whether human experts can predict the query
   difficulty
   A group of experts were asked to classify a set of TREC topics to three degrees of
   difficulty based on the query expression only
         easy, middle, hard?
   The manual judgments were compared to the median of the average precision
   scores, as determined after evaluating the performance of all participating
   systems
   Results
           The Pearson correlation between the expert judgments and the “true” values was
           very low (0.26).
           The agreement between experts, as measured by the correlation between their
           judgments, was very low too (0.39)


     The low correlation illustrates how difficult this task is and how little is known
     about what makes a query difficult
16
1612/07/2012                                   SIGIR 2010, Geneva                           16
                            SIGIR 2012 Tutorial: Query Performance Prediction
HowIBM Labs in Haifa is the performance prediction task
      difficult
 for human experts? (contd.)
Hauff et al. (CIKM 2010) found
    a low level of agreement between humans with regard to which queries are
    more difficult than others (median kappa = 0.36)
            there was high variance in the ability of humans to estimate query difficulty although
            they shared “similar” backgrounds
    a low correlation between true performance and humans’ estimates of query-
    difficulty (performed in a pre-retrieval fashion)
            median Kendall tau was 0.31 which was quite lower than that posted by the best
            performing pre-retrieval predictors
            however, overall the humans did manage to differentiate between “good” and “bad”
            queries
    a low correlation between humans’ predictions and those of query-performance
    predictors with some exceptions

These findings further demonstrate how difficult the query-performance
   prediction task is and how little is known about what makes a query
17
   difficult                        SIGIR 2010, Geneva                                          17
 1712/07/2012                 SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Are queries found to be difficult in one collection
still considered difficult in another collection?
   Robust track 2005: Difficult topics in the ROBUST collection were tested against another
   collection (AQUAINT)
   The median average precision over the ROBUST collection is 0.126
         Compared to 0.185 for the same topics over the AQUAINT collection
   Apparently, the AQUAINT collection is “easier” than the ROBUST collection
   probably due to
           Collection size
           Many more relevant documents per topic in AQUAINT
           Document features such as structure and coherence
   However, the relative difficulty of the topics is preserved over the two datasets
       The Pearson correlation between topics’ performance in both datasets is 0.463
             This illustrates some dependency between the topic’s median scores on both
            collections

      Even when topics are somewhat easier in one collection than another,
      the relative difficulty among topics is preserved, at least to some extent
18
1812/07/2012                                          SIGIR 2010, Geneva                      18
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Basic concepts




IBM Haifa Research Lab     © 2010 IBM Corporation
IBM Labs in Haifa
The retrieval task
       Given:
                 A document set D (the corpus)
                 A query q
       Retrieve Dq (the result list), a ranked list of documents from D,
       which are most likely to be relevant to q
       Some widely used retrieval methods:
                 Vector space tf-idf based ranking, which estimates relevance by the
                 similarity between the query and a document in the vector space
                 The probabilistic OKAPI-BM25 method, which estimates the probability
                 that the document is relevant to the query
                 Language-model-based approaches, which estimate the probability that
                 the query was generated by a language model induced from the
                 document
                 And more, divergence from randomness (DFR) approaches, inference
                 networks, markov random fields (MRF), …
20
2012/07/2012                                          SIGIR 2010, Geneva                20
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Text REtrieval Conference (TREC)
      A series of workshops for large-scale evaluation of (mostly) text
      retrieval technology:
                Realistic test collections
                Uniform, appropriate scoring procedures                         Title: African Civilian Deaths
                Started in 1992
                                                                                Description: How many civilian
                                                                                non-combatants have been killed
                                                                                in the various civil wars in Africa?

                                                                                Narrative: A relevant document
       A TREC task usually comprises of:                                        will contain specific casualty
                 A document collection (corpus)                                 information for a given area,
                                                                                country, or region. It will cite
                 A list of topics (information needs)                           numbers of civilian deaths caused
                                                                                directly or indirectly by armed
                 A list of relevant documents for each                          conflict.
                 topic (QRELs)


21
2112/07/2012                                          SIGIR 2010, Geneva                                               21
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Precision measures
      Precision at k (P@k): The fraction of relevant documents
      among the top-k results.
      Average precision: AP is the average of precision values
      computed at the ranks of each of the relevant documents
      in the ranked list:
                                1
                        AP(q) =
                                Rq
                                                ∑ P@rank(r)
                                                r∈Rq

                        Rq is the set of documents in the corpus that are relevant to q

      Average Precision is usually computed using a ranking which is truncated at
      some position (typically 1000 in TREC).

22
2212/07/2012                                           SIGIR 2010, Geneva                 22
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Prediction quality measures
      Given:
          Query q
          Result list Dq that contains n documents

      Goal: Estimate the retrieval effectiveness of Dq, in terms of
      satisfying Iq, the information need behind q
           specifically, the prediction task is to predict AP(q) when no
           relevance information (Rq) is given.
           In practice: Estimate, for example, the expected average
           precision for q.
      The quality of a performance predictor can be measured by the
      correlation between the expected average precision and the
      corresponding actual precision values
23
2312/07/2012                                          SIGIR 2010, Geneva               23
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Measures of correlation

      Linear correlation (Pearson)
                considers the true AP values as well as the predicted AP values of queries


      Non-linear correlation (Kendall’s tau, Spearman’s rho)
                considers only the ranking of queries by their true AP values and by their
                predicted AP values




24
2412/07/2012                                          SIGIR 2010, Geneva                     24
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Evaluating a prediction system


                                                                                 Q5
                   Predicted AP




                                                                                    Different correlation metrics will
                                                                                    measure different things
                                                                                    If there is no apriory reason to
                                                                                    use Pearson, prefer Spearman
                                                                                    or KT
                                                                                    In practice, the difference is not
                                         Q1                                         large, because there are
                                   Q4         Q3                                    enough queries and because of
                                        Q2                                          the distribution of AP




                                                    Actual AP


25
2512/07/2012                                           SIGIR 2010, Geneva                                           25
                                    SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Evaluating a prediction system (cond.)
      It is important to note that it was found that state-of-the-art
      query-performance predictors might not be correlated (at all)
      with measures of users’ performance (e.g., the time it takes to
      reach the first relevant document)
                see Terpin and Hersh ADC ’04 and Zhao and Scholer ADC ‘07
      However!,
                This finding might be attributed, as suggested by Terpin and Hersh in ADC
                ’04, to the fact that standard evaluation measures (e.g., average
                precision) and users’ performance are not always strongly correlated
                      Hersh et al. ’00, Turpin and Hersh ’01, Turpin and Scholer ’06, Smucker and
                      Parkash Jethani ‘10




26
2612/07/2012                                          SIGIR 2010, Geneva                            26
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Query Performance Prediction
                   Methods




IBM Haifa Research Lab          © 2010 IBM Corporation
IBM Labs in Haifa




                                                  Performance Prediction
                                                         Methods


                         Pre-Retrieval                                               Post-Retrieval
                                Methods                                                  Methods



                                                                                                                Score
         Linguistics                        Statistics                Clarity            Robustness
                                                                                                                Analysis



Morphologic          Syntactic
                                                                  Query Perturb.    Doc Perturb.   Retrieval Perturb.

          Semantic

                                                                                                      Top score            Avg. score



                  Specificity        Similarity       Coherency       Relatedness                        Variance of    Cluster hypo.
                                                                                                         scores

 28
 2812/07/2012                                              SIGIR 2010, Geneva                                                   28
                                       SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Pre-Retrieval Prediction Methods
      Pre-retrieval prediction approaches estimate the quality of the
      search results before the search takes place.
                Provide effective instantiation of query performance prediction for search
                applications that must efficiently respond to search requests
                Only the query terms, associated with some pre-defined statistics
                gathered at indexing time, can be used for prediction


      Pre-retrieval methods can be split to linguistic and statistical
      methods.
          Linguistic methods apply natural language processing (NLP)
          techniques and use external linguistic resources to identify
          ambiguity and polysemy in the query.
          Statistical methods analyze the distribution of the query
          terms within the collection
29
2912/07/2012                                          SIGIR 2010, Geneva                 29
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Linguistic Approaches (Mothe & Tangui (SIGIR QD
     IBM Labs in Haifa

workshop, SIGIR 2005), Hauff (2010))
   Most linguistic features do not correlate well with the system performance.
           Features include, for example, morphological (avg. # of morphemes per query term),
           syntactic link span (which relates to the average distance between query words in the
           parse tree), semantic (polysemy- the avg. # of synsets per word in the WordNet
           dictionary)
   Only the syntactic links span and the polysemy value were shown to have
   some (low) correlation
   This is quite surprising as intuitively poor performance can be expected for
   ambiguous queries
           Apparently, term ambiguity should be measured using corpus-based approaches,
           since a term that might be ambiguous with respect to the general vocabulary, may
           have only a single interpretation in the corpus




30
3012/07/2012                                   SIGIR 2010, Geneva                             30
                            SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Pre-retrieval Statistical methods
   Analyze the distribution of the query term frequencies within the
   collection.
    Two major frequent term statistics :
           Inverse document frequency                                      idf(t) = log(N/Nt)
           Inverse collection term frequency                               ictf(t) = log(|D|/tf(t,D))
   Specificity based predictors
           Measure the query terms' distribution over the collection
           Specificity-based predictors:
                    avgIDF, avgICTF - Queries composed of infrequent terms are easier to satisfy
                    maxIDF, maxICTF – Similarly
                    varIDF, varICTF - Low variance reflects the lack of dominant terms in the query
           A query composed of non-specific terms is deemed to be more difficult
                    “Who and Whom”


31
3112/07/2012                                          SIGIR 2010, Geneva                                31
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa

Specificity-based predictors
      Query Scope (He & Ounis 2004)
                The percentage of documents in the corpus that contain at least one
                query term; high value indicates many candidates for retrieval, and
                thereby a difficult query
                High query scope indicates marginal prediction quality for short queries
                only, while for long queries its quality drops significantly
                QS is not a ``pure'' pre-retrieval predictor as it requires finding the
                documents containing query terms (consider dynamic corpora)
      Simplified Clarity Score (He & Ounis 2004)
                The Kullback-Leibler (KL) divergence between the (simplified) query
                language model and the corpus language model
                                                                  p (t | q ) 
                          SC S ( q )      ∑      p ( t | q ) log             
                                          t∈ q                    p (t | D ) 
      SCS is strongly related to the avgICTF predictor assuming each term appears only once
      in the query :     SCS(q) = log(1/|q|) + avgICTF(q)
32
3212/07/2012                                          SIGIR 2010, Geneva                      32
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Similarity, coherency, and variance -based predictors

      Similarity: SCQ (Zhao et al. 2008): High similarity to the corpus
      indicates effective retrieval SCQ(t ) = (1 + log(tf (t , D))) ⋅ idf (t )
                maxSCQ(q)/avgSCQ(q) are the maximum/average over the query terms
                contradicts(?) the specificity idea (e.g., SCS)
      Coherency: CS (He et al. 2008): The average inter-document
      similarity between documents containing the query terms,
      averaged over query terms
                This is a conceptual pre-retrieval reminiscent of the post-retrieval
                autocorrelation approach (Diaz 2007) that we will discuss later
                Demanding computation that requires the construction of a pointwise
                similarity matrix for all pair of documents in the index
      Variance (Zhao et al. 2008): var(t) - Variance of term t’s
      weights (e.g., tf.idf) over the documents containing it
                maxVar and sumVar are the maximum/average over query terms
                hypothesis: low variance implies to a difficult query due to low
33
3312/07/2012
                discriminative power        SIGIR 2010, Geneva                         33
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Term Relatedness (Hauff 2010)
      Hypothesis: If the query terms co-occur frequently in the collection we expect good
      performance
      Pointwise mutual information (PMI) is a popular measure of co-occurrence
      statistics of two terms in the collection
            It requires efficient tools for gathering collocation statistics from the corpus, to
            allow dynamic usage at query run-time.
      avgPMI(q)/maxPMI(q) measures the average and the maximum PMI over all pairs of
      terms in the query



                                p(t1, t2 | D)
           PMI (t1, t2 ) log
                             p(t1 | D) p(t2 | D)

34
3412/07/2012                                          SIGIR 2010, Geneva                           34
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Evaluating pre-retrieval methods (Hauff 2010)




Insights
      maxVAR and maxSCQ dominate other predictors, and are most stable over the
      collections and topic sets
      However, their performance significantly drops for one of the query sets (301-350)
      Prediction is harder over the Web collections (WT10G and GOV2) than over the
      news collection (ROBUST), probably due to higher heterogeneity of the data




35
3512/07/2012                                          SIGIR 2010, Geneva                   35
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Post-retrieval Predictors




IBM Haifa Research Lab            © 2010 IBM Corporation
IBM Labs in Haifa
Post-Retrieval Predictors
                                                                                             Post-Retrieval
      Analyze the search results in addition to the query
                                                                                                  Methods
      Usually are more complex as the top-results are
      retrieved and analyzed
            Prediction quality depends on the retrieval
            process                                                              Clarity          Robustness
                                                                                                                 Score
                                                                                                                 Analysis
                     As different results are expected for the same
                     query when using different retrieval methods
                In contrast to pre-retrieval methods, the
                                                                                                       Doc
                search results may depend on query-                                     Query
                                                                                       Perturb.       Perturb.
                                                                                                                   Retrieval
                                                                                                                  Perturb.
                independent factors
                     Such as document authority scores, search
                     personalization etc.
      Post-retrieval methods can be categorized into three main paradigms:
         Clarity-based methods directly measure the coherence of the search results.
         Robustness-based methods evaluate how robust the results are to perturbations in the
       query, the result list, and the retrieval method.
         Score distribution based methods analyze the score distribution of the search results


37
3712/07/2012                                          SIGIR 2010, Geneva                                                 37
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa

Clarity (Cronen-Townsend et al. SIGIR 2002)
     Clarity measures the coherence (clarity) of the result-list with
     respect to the corpus
                Good results are expected to be focused on the query's topic
     Clarity considers the discrepancy between the likelihood of
     words most frequently used in retrieved documents and their
     likelihood in the whole corpus
                Good results - The language of the retrieved documents should be
                distinct from the general language of the whole corpus
                Bad results - The language of retrieved documents tends to be more
                similar to the general language
     Accordingly, clarity measures the KL divergence between a
     language model induced from the result list and that induced
     from the corpus

38
3812/07/2012                                          SIGIR 2010, Geneva               38
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Clarity Computation
Clarity (q )            KL( p(• | Dq ) || p (• | D))                          KL divergence between
                                                       p (t | Dq )               Dq and D
Clarity (q ) =              ∑
                       t∈V ( D )
                                   p(t | Dq ) log
                                                     pMLE (t | D)
                        tf (t , X )                                           X’s unsmoothed LM (MLE)
 pMLE (t | X )
                           |X|
p (t | d ) = λ p MLE (t | d ) + (1 − λ ) p (t | D )                           d’s smoothed LM

 p(t | Dq ) =        ∑ p(t | d ) p(d | q)
                     d∈Dq
                                                                              Dq’s LM (RM1)


                p(q | d ) p(d )
 p(d | q) =                          ∝ p(q | d )                              d’s “relevance” to q
            ∑d '∈D p(q | d ') p(d ')
                            q



  p(q | d )          ∏ p(t | d )
                      t∈q
                                                                              query likelihood model

 39
 3912/07/2012                                            SIGIR 2010, Geneva                             39
                                      SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa

Example
      Consider the two queries variants for TREC topic 56 (from the
      TREC query track):
                Query A: Show me any predictions for changes in the prime lending rate
                and any changes made in the prime lending rates
                Query B: What adjustments should be made once federal action occurs?




40
4012/07/2012                                          SIGIR 2010, Geneva                 40
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa

The Clarity of
        1)relevant results,                              2) non-relevant results,                 3)random documents

                                             10
                                                                                                  Relevant
                                                                                                  Non−relevant
                                             9
                                                                                                  Collection−wide

                                             8

                                             7
                       Query clarity score




                                             6

                                             5

                                             4

                                             3

                                             2

                                             1

                                             0
                                             300                 350                        400                 450
                                                                        Topic number

41
4112/07/2012                                                           SIGIR 2010, Geneva                              41
                                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Novel interpretation of clarity - Hummel et al. 2012
                          (See the Clarity Revisited poster!)
                      Clarity ( q )     KL ( p (• | Dq ) || p (• | D ) =
                                         CE ( p (• | Dq ) || p (• | D )) − H ( p (• | Dq ))
                                              Cross entropy                    Entropy
                      Clarity ( q ) = C Distance( q )             −            L Diversity( q )




42
4212/07/2012                                          SIGIR 2010, Geneva                          42
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Clarity variants
      Divergence from randomness approach (Amati et al.
      ’02)

      Emphasize query terms (Cronen Townsend et al. ’04,
      Hauff et al. ’08)

      Only consider terms that appear in a very low
      percentage of all documents in the corpus (Hauff et al.
      ’08)
                Beneficial for noisy Web settings


43
4312/07/2012                                          SIGIR 2010, Geneva               43
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Robustness
      Robustness can be measured with respect to perturbations of the
                Query
                      The robustness of the result list to small modifications of the query (e.g.,
                      perturbation of term weights)
                Documents
                      Small random perturbations of the document representation are unlikely to
                      result in major changes to documents' retrieval scores
                            If scores of documents are spread over a wide range, then these
                            perturbations are unlikely to result in significant changes to the ranking
                Retrieval method
                      In general, different retrieval methods tend to retrieve different results for the
                      same query, when applied over the same document collection
                      A high overlap in results retrieved by different methods may be related to high
                      agreement on the (usually sparse) set of relevant results for the query.
                      A low overlap may indicate no agreement on the relevant results; hence, query
                      difficulty

44
4412/07/2012                                          SIGIR 2010, Geneva                             44
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Query Labs in Haifa
   IBM
       Perturbations
      Overlap between the query and its sub-queries (Yom Tov et al.
      SIGIR 2005)
               Observation: Some query terms have little or no influence on the
               retrieved documents, especially in difficult queries.
      The query feedback (QF) method (Zhou and Croft SIGIR 2007)
      models retrieval as a communication channel problem
               The input is the query, the channel is the search system, and the set of
               results is the noisy output of the channel
               A new query is generated from the list of results, using the terms with
               maximal contribution to the Clarity score, and then a second list of results
               is retrieved for that query
               The overlap between the two lists is used as a robustness score.

                                                               Original R
         q                 SE                       q’                                     AP (q )
                                                                                 Overlap

                                                                 New R

4512/07/2012                                                                                         45
                             SIGIR 2012 Tutorial: Query Performance Prediction
Document Perturbation (Zhou & Croft CIKM 06)
               IBM Labs in Haifa

(A conceptually similar approach, but which uses a different technique, was presented by
Vinay et al. in SIGIR 06)

   How stable is the ranking in the presence of uncertainty in the
   ranked documents?
           Compare a ranked list from the original collection to the corresponding
           ranked list from the corrupted collection using the same query and ranking
           function
                                                              docs’
                                                  d                          d’
                                   D                          lang. model              D’

                                       q
                                                                      SE


                                                   d1                         d’1

                                                   d2                         d’2
                                                   …
                                                                Compare       …

                                                   dk                         d’k

46
4612/07/2012                                            SIGIR 2010, Geneva                  46
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Cohesioninof the Result list – Clustering tendency
    IBM Labs Haifa


(Vinay et al. SIGIR 2006)
      The cohesion of the result list can be measured by its clustering
      patterns
               Following the ``cluster hypothesis'' which implies that documents
               relevant to a given query are likely to be similar to one another
               A good retrieval returns a single, tight cluster, while poor retrieval returns
               a loosely related set of documents covering many topics
      The ``clustering tendency'' of the result set
               Corresponds to the Cox-Lewis statistic which measures the
               ``randomness'' level of the result list
                   Measured by the distance between a randomly selected document and it’s
                   nearest neighbor from the result list.
                   When the list contains “inherent” clusters, the distance between the random
                   document and its closest neighbor is likely to be much larger than the
                   distance between this neighbor and its own nearest neighbor in the list



47
4712/07/2012                                     SIGIR 2010, Geneva                          47
                              SIGIR 2012 Tutorial: Query Performance Prediction
Retrieval in Haifa
     IBM Labs
              Method Perturbation (Aslam & Pavlu,
ECIR 2007)
      Query difficulty is predicted by submitting the query to different
      retrieval methods and measuring the diversity of the retrieved
      ranked lists
               Each ranking is mapped to a distribution over the document collection
               JSD distance is used to measure the diversity of these distributions
      Evaluation: Submissions of all participants to several TREC tracks
      were analyzed
               The agreement between submissions highly correlates with the query
               difficulty, as measured by the median performance (AP) of all participants
               The more submissions are analyzed, the prediction quality improves.




48
4812/07/2012                                    SIGIR 2010, Geneva                      48
                             SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Score Distribution Analysis
      Often, the retrieval scores reflect the similarity of documents to a
      query
                Hence, the distribution of retrieval scores can potentially help predict
                query performance.
      Naïve predictors:
                The highest retrieval score or the mean of top scores (Thomlison 2004)
                The difference between query-independent scores and a query-
                dependent scores
                      Reflects the ``discriminative power'' of the query (Bernstein et al. 2005)




49
4912/07/2012                                          SIGIR 2010, Geneva                           49
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Spatial autocorrelation (Diaz SIGIR 2007)
      Query performance is correlated with the extent to which the result list
      “respects” the cluster hypothesis
                The extent to which similar documents receive similar retrieval scores
                In contrast, a difficult query might be detected when similar documents are
                scored differently.
      A document's ``regularized'' retrieval score is determined based on the
      weighted sum of the scores of its most similar documents



                  ScoreRe g (q, d )          ∑Sim(d, d ')Score(q, d ')
                                               d'

The linear correlation of the regularized scores with the original scores is used for
query-performance prediction.


50
5012/07/2012                                          SIGIR 2010, Geneva                      50
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Weighted Haifa
   IBM Labs in
               Information Gain –WIG
(Zhou & Croft 2007)
      WIG measures the divergence between the mean retrieval score of top-
      ranked documents and that of the entire corpus.
                   The more similar these documents are to the query, with respect to the
                   corpus, the more effective the retrieval
                         The corpus represents a general non-relevant document

                                           p(t | d )            1                                  
          WIG(q)           1
                           k    ∑k ∑λt log p(t | D)                    avg ( Score(d )) − Score(D) 
                                                              k | q |  d∈Dqk
                               d∈Dq t∈q                                                             

               k
          Dq is the list of the k highest ranked documents
       λt reflects the weight of the term's type
                   When all query terms are simple keywords, this parameter collapses to
                   1/sqrt(|q|)
       WIG was originally proposed and employed in the MRF framework
                   However, for a bag-of-words representation MRF reduces to the query likelihood
                   model and is effective (Shtok et al. ’07, Zhou ’07)
51
5112/07/2012                                           SIGIR 2010, Geneva                               51
                                    SIGIR 2012 Tutorial: Query Performance Prediction
Normalized Query Commitment - NQC
   IBM Labs in Haifa

(Shtok et al. ICTIR 2009 and TOIS 2012)
   NQC estimates the presumed amount of query drift in the list of top-retrieved documents
   Query expansion often uses a centroid of the list Dq as an expanded query model
       The centroid usually manifests query drift (Mitra et al. ‘98)
       The centroid could be viewed as a prototypical misleader as it exhibits (some)
       similarity to the query
               This similarity is dominated by non-query-related aspects that lead to query drift
   Shtok et al. showed that the mean retrieval score of documents in the result list
   corresponds, in several retrieval methods, to the retrieval score of some centroid-based
   representation of Dq
        Thus, the mean score represents the score of a prototypical misleader
   The standard deviation of scores which reflects their dispersion around the mean,
   represents the divergence of retrieval scores of documents in the list from that of a non-
   relevant document that exhibits high query similarity (the centroid).
                                               1
                                                    ∑ ( Score(d ) − µ )
                                                                          2

                                               k        k
                                                    d ∈Dq
                               NQC (q )
                                                            Score( D )
52
5212/07/2012                                       SIGIR 2010, Geneva                               52
                              SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
                   A geometric interpretation of NQC




53
5312/07/2012                                          SIGIR 2010, Geneva               53
                                   SIGIR 2012 Tutorial: Query Performance Prediction
EvaluatingHaifa
    IBM Labs in
                post-retrieval methods
(Shtok et al. TOIS 2012)




                       Prediction quality (Pearson correlation and Kendall’s tau)



      Similarly to pre-retrieval methods, there is no clear winner
               QF and NQC exhibit comparable results
      NQC exhibits good performance over most collections but does
      not perform very well on GOV2.
      QF performs well over some of the collections but is inferior to
      other predictors on ROBUST.

54
5412/07/2012                                     SIGIR 2010, Geneva                 54
                              SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
 Additional predictors that analyze the retrieval
 scores distribution
      Computing the standard deviation of retrieval scores at a query-
      dependent cutoff
               Perez-Iglesias and Araujo SPIRE 2010
               Cummins et al. SIGIR 2011

      Computing expected ranks for documents
               Vinay et al. CIKM 2008


      Inferring AP directly from the retrieval score distribution
               Cummins AIRS 2011




55
5512/07/2012                                          SIGIR 2010, Geneva               55
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Combining Predictors




IBM Haifa Research Lab               © 2010 IBM Corporation
IBM Labs in Haifa
Combining post-retrieval predictors
      Some integration efforts of a few predictors based on linear
      regression:
                Yom-Tov et al (SIGIR’05) combined avgIDF with the Overlap predictor
                Zhou and Croft (SIGIR’07) integrated WIG and QF using a simple linear
                combination
                Diaz (SIGIR’07) incorporated the spatial autocorrelation predictor with
                Clarity and with the document perturbation based predictor
      In all those trials, the results of the combined predictor were
      much better than the results of the single predictors
                This suggests that these predictors measure (at least semi)
                complementary properties of the retrieved results




57
5712/07/2012                                          SIGIR 2010, Geneva                  57
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Utility Labs in Haifa
     IBM estimation framework (UEF) for query

performance prediction (Shtok et al. SIGIR 2010)
      Suppose that we have a true model of relevance, RI , for the                              q

      information need I q that is represented by the query q
      Then, the ranking π ( Dq , RI q ) , induced over the given result list
      using RI , is the most effective for these documents
                   q


      Accordingly, the utility (with respect to the information need)
      provided by the given ranking, which can be though of as
      reflecting query performance, can be defined as
                   U ( Dq | I q )        Similarity ( Dq , π ( Dq , RI ))               q

      In practice, we have no explicit knowledge of the underlying
      information need and of RIq
      Using statistical decision theory principles, we can approximate
      the utility by estimating RIq

58
5812/07/2012
               q       q   ∫
                           ˆ
                                             (
       U ( D | I ) ≈ Similarity D , π ( D , R ) p ( R | I )dR
                                            ˆ
                                                  q
                                                    ˆ       ˆ
                                                               q      q
                                                      SIGIR 2010, Geneva
                                                                           )   q    q       q
                                                                                                    58
                           Rq   SIGIR 2012 Tutorial: Query Performance Prediction
Instantiating predictors
     IBM Labs in Haifa

     U ( D | I ) ≈ ∫ Similarity ( D , π ( D , R ) ) p ( R
                   q   q                     q
                                              ˆ
                                                    q
                                                        ˆ
                                                        q       q
                                                                            ˆ
                                                                    | I q )dRq
                            ˆ
                            Rq

                                 ˆ
      Relevance-model estimates (Rq )
               relevance language models constructed from documents sampled from
               the highest ranks of some initial ranking
      Estimate the relevance-model presumed “representativeness” of
                               ˆ
      the information need (p( Rq | I q ) )
               apply previously proposed predictors (Clarity, WIG, QF, NQC) upon the
               sampled documents from which the relevance model is constructed
      Inter-list similarity measures
               Pearson, Kendall’s-tau, Spearman
      A specific, highly effective, instantiated predictor:
               Construct a single relevance model from the given result list, Dq
               Use a previously proposed predictor upon Dq to estimate relevance-
               model effectiveness
               Use Pearson’s correlation between retrieval scores for the similarity
59                                                SIGIR 2010, Geneva                   59
5912/07/2012
               measure        SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
The UEF framework - A flow diagram

                                       Dq
    q             SE                                            Model
                                                                                       ˆ
                                        d1                      Sampling               R q ;S
                                        d2
                                                                       Re-Ranking
                                        …

                                        dk
                                                     π( Dq , Rq ;S )
                                                             ˆ
                                                                                    Relevance
                                                          d’1                       Estimator

                                                          d’2
                                                          …

                                                          d’k
                                                                                    Performance
                                                                                                  AP (q )
                                   Ranking
                                                                                    predictor
                                   Similarity


60
6012/07/2012                                          SIGIR 2010, Geneva                                    60
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Prediction in Haifa
     IBM Labs quality of UEF (Pearson correlation with true AP)
            Clarity->UEF(Clarity) (+31.4%)                WIG->UEF(WIG) (+27.8%)
 0.75                                        0.75
                  Clarity                                   WIG
  0.7                                         0.7
                  UEF(Clarity)                              UEF(WIG)
 0.65                                        0.65
  0.6                                         0.6
 0.55
                                             0.55
                                              0.5
  0.5
                                             0.45
 0.45
                                              0.4
  0.4
                                             0.35
 0.35                                         0.3
  0.3                                        0.25
         TREC4 TREC5   WT10G ROBUST GOV2            TREC4 TREC5   WT10G ROBUST GOV2

              NQC->UEF(NQC) (+17.7%)                      QF->UEF(QF) (+24.7%)
  0.75                                       0.75
                                                             QF
   0.7             NQC                        0.7
                                             0.65            UEF(QF)
  0.65             UEF(NQC)
                                              0.6
   0.6
                                             0.55
  0.55
                                              0.5
   0.5
                                             0.45
  0.45                                        0.4
   0.4                                       0.35
  0.35                                        0.3
   0.3                                       0.25
         TREC4 TREC5   WT10G ROBUST GOV2            TREC4 TREC5   WT10G ROBUST GOV2
                                                                    61
                                                                                      61
IBM Labs in Haifa                                                           *
   Prediction quality for ClueWeb (Category A)
                                        TREC 2009                 TREC 2010

                                   LM             LM+SpamRm    LM        LM+SpamRm
                                  .465                 .526    .302           .312
SumVar     (Zhao et al. 08)

SumIDF                            .463                 .524    .294           .292
Clarity                           .017                 -.178   -.385          -.111
UEF(Clarity)                      .124                 .473    .303           .295
ImpClarity (Hauff et al. ’08)     .348                 .133    .072           -.008
UEF(ImpClarity)                   .221                 .580    .340           .366
WIG                               .423                 .542    .269           .349
UEF(WIG)                          .236                 .651    .375           .414
NQC                               .083                 .430    .269           .214
UEF(NQC)                          .154                 .633    .342           .460
QF                                .494                 .630    .368           .617
UEF(QF)                           .637                 .708    .358           .649
  * Thanks to Fiana Raiber for producing the results
                                                                    62
                                                                                          62
IBM Labs in Haifa




Are the various post-retrieval predictors that
different from each other?



A unified framework for explaining post-retrieval
predictors (Kurland et al. ICTIR 2011)



                                          63
                                                    63
A unified post-retrieval prediction framework
      IBM Labs in Haifa
    We want to predict the effectiveness of a ranking π M (q; D) of the corpus that was induced by
    retrieval method M in response to query q

    Assume a true model of relevance RIq that can be used for retrieval ⇒
    the resultant ranking, π opt (q; D), is of optimal utility
               Utility (π M (q; D); I q ) sim(π M (q; D), π opt (q; D))

    Use as reference comparisons a pseudo effective (PE)
    and pseudo ineffective (PIE) rankings (cf. Rocchio '71):
        Utility (π M (q; D); I q ) α (q)sim(π M (q; D), π PE (q; D)) − β (q) sim(π M (q; D), π PIE (q; D))
         ˆ

                                                        Pseudo –Effective ranking πPE(q;D)
    Corpus ranking πM(q;D)

                                                                       …
                   …
                                                        Pseudo -Ineffective ranking πPIE(q;D)



                                                                           …
6412/07/2012                                                                                           64
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
    Instantiating predictors

  Focus on the result lists of the documents most highly ranked by each ranking
                U (π M (q; D); I q ) α (q)sim( Dq , LPE ) − β (q)sim( Dq , LPIE )
                 ˆ



  Deriving predictors:
  1. "Guess" a PE result list (LPE ) and/or a PIE result list (LPIE )
  2. Select weights α (q) and β (q)
  3. Select an inter-list (ranking) similarity measure
     - Pearson's r, Kendall's-τ , Spearman's-ρ




6512/07/2012                                                                           65
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa




                       U (π M (q; D ); I q ) α ( q ) sim( LM , LPE ) − β (q ) sim( LM , LPIE )
                        ˆ


                   Basic idea: The PIE result list is composed of k copies of a pseudo ineffective document

     Predictor               Pseudo                       Predictor’s                       Sim.              α(q)   β(q)
                           Ineffective                    description                      measure
                           document
Clarity                      Corpus          Estimates the focus of the result            -KL diverg.          0      1
(Cronen-Townsend                              list with respect to the corpus             between
’02, ‘04)                                                                                 language
                                                                                          models
WIG                          Corpus         Measures the difference between -L1 distance                       0      1
(Weighted                                   the retrieval scores of documents of retrieval
Information Gain;
Zhou & Croft ‘07)
                                             in the result list and the score of scores
                                                        the corpus
NQC                        Result list      Measures the standard deviation               -L2 distance         0      1
(Normalized                centroid            of the retrieval scores of                 of retrieval
Query Commitment;                                                                         scores
Shtok et al. ’09)
                                              documents in the result list

6612/07/2012                                                                                                              66
                                     SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa



                      U (π M (q; D ); I q ) α ( q ) sim( LM , LPE ) − β (q ) sim( LM , LPIE )
                       ˆ


    Predictor                Pseudo                   Predictor’s                   Sim.          α(q)        β(q)
                         Effective result             description                  measure
                               list
QF (Query               Use a relevance        Measures the “amount of            Overlap at        1          0
Feedback; Zhou          model for              noise“ (non-query-related          top ranks
& Croft ’07)            retrieval over          aspects) in the result list
                        the corpus
UEF (Utility            Re-rank the          Estimates the potential utility     Rank/Scores
                                                                                                Presumed       0
                                                                                                representat
Estimation              given result list        of the result list using         correlation   iveness of
                                                                                                the
Framework;              using a                    relevance models                             relevance
                                                                                                model
Shtok et al. ‘10)       relevance model

Autocorrelation         1.Score              1. The degree to which                Pearson          1          0
(Diaz ‘07)              regularization          retrieval scores “respect         correlation
                                                the cluster hypothesis”            between
                        2.Fusion             2. Similarity with a fusion –          scores
                                                based result list
6712/07/2012                                                                                                       67
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa




                 A General Model for Query
                          Difficulty



68
6812/07/2012                                          SIGIR 2010, Geneva               68
                                   SIGIR 2012 Tutorial: Query Performance Prediction
A model of query difficulty (Carmel et al, SIGIR
   IBM Labs in Haifa


2006)

       A user with a given information
       need (a topic):
               Submits a query to a search
               engine
               Judges the search results                                            Topic
               according to their relevance
               to this information need.

       Thus, the query/ies and the Qrels
                                                              Queries (Q)               Documents (R)
       are two sides of the same
       information need
       Qrels also depend on the existing
       collection (C)
                                                                               Search engine


                         Define: Topic = (Q,R|C)
69
6912/07/2012                                       SIGIR 2010, Geneva                              69
                                SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
A Theoretical Model of Topic Difficulty




                                            Main Hypothesis
      Topic difficulty is induced from the distances between the model parts

70
7012/07/2012                                          SIGIR 2010, Geneva               70
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Model validation: The Pearson correlation between
       IBM Labs in Haifa

    Average Precision and the model distances (see the
    paper for estimates of the various distances)
     Based on the .gov2 collection (25M docs) and 100 topics of the Terabyte tracks 04/05


                                       Topic

               Queries                                              Documents
                                             -0.06
                                                                                  0.15

                                    Combined: 0.45
                     0.17                                                       0.32




71
7112/07/2012                                   SIGIR 2010, Geneva                           71
                            SIGIR 2012 Tutorial: Query Performance Prediction
A probabilistic framework for QPP (Kurland et al.
    IBM Labs in Haifa


CIKM 2012, to appear)




72
7212/07/2012                      SIGIR 2010, Geneva               72
               SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Post-retrieval prediction (Kurland et al. 2012)




    query-independent result list
    properties (e.g.,
    cohesion/dispersion)


                                                                WIG and Clarity are derived
                                                                from this expression
73
7312/07/2012                                          SIGIR 2010, Geneva                      73
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Using reference lists (Kurland et al. 2012)




                                                                           The presumed quality of the
                                                                           reference list (a prediction
                                                                           task at its own right)




74
7412/07/2012                                          SIGIR 2010, Geneva                                  74
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Applications of Query Difficulty Estimation




IBM Haifa Research Lab           © 2010 IBM Corporation
IBM Labs in Haifa
Main applications for QDE
      Feedback to the user and the system

      Federation and metasearch

      Content enhancement using missing content analysis

      Selective query expansion/ query selection

      Others




76
7612/07/2012                                          SIGIR 2010, Geneva               76
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Feedback to the user and the system
      Direct feedback to the user
               “We were unable to find relevant
               documents to your query”


      Estimating the value of terms from
      query refinement
               Suggest which terms to add, and
               estimate their value


      Personalization
               Which queries would benefit from
               personalization? (Teevan et al. ‘08)




77
7712/07/2012                                          SIGIR 2010, Geneva               77
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Federation and metasearch: Problem definition

                                                                                 Merged
                                                                                 ranking
     Given several databases that
     might contain information                                                  Federate
     relevant to a given question,                                               results
     How do we construct a good
     unified list of answers from all
     these datasets?
                                                           Search                 Search      Search
     Similarly, given a set of search
                                                           engine                 engine      engine
     engines employed for the same
     query.
     How do we merge their results in
     an optimal manner?
                                                         Collection             Collection   Collection


78
7812/07/2012                                          SIGIR 2010, Geneva                               78
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Prediction-based Federation & Metasearch

                                                                                 Merged
 Weight each result                                                              ranking
 set by it’s
 predicted precision
                                                                                Federate
 (Yom-Tov et al.,
                                                                                 results
 2005; Sheldon et
 al. 11)
 or
                                                           Search                 Search      Search
 select the best
                                                           engine                 engine      engine
 predicted SE
 (White et al., 2008,
 Berger&Savoy,
 2007)
                                                         Collection             Collection   Collection


79
7912/07/2012                                          SIGIR 2010, Geneva                               79
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Metasearch and Federation using Query
    IBM Labs in Haifa


Prediction

     Train a predictor for each engine/collection pair

    For a given query: Predict its difficulty for each engine/collection
    pair

    Weight the results retrieved from each engine/collection pair
    accordingly, and generate the federated list




80
8012/07/2012                             SIGIR 2010, Geneva                80
                      SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Metasearch experiment
      The LA Times collection was
                                                                                            P@10    %no
      indexed using four search engines.
      A predictor was trained for each                         Single         SE 1          0.139   47.8
      search engine.                                           search
      For each query the results-list from                     engine         SE 2          0.153   43.4
      each search engine was weighted
                                                                              SE 3          0.094   55.2
      using the prediction for the specific
      queries.                                                                SE 4          0.171   37.4
      The final ranking is a ranking of the
      union of the results lists, weighted                     Meta-          Round-        0.164   45.0
      by the prediction.                                       search         robin
                                                                              Meta-         0.163   34.9
                                                                              Crawler
                                                                              Prediction-   0.183   31.7
                                                                              based

81
8112/07/2012                                          SIGIR 2010, Geneva                                   81
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Federation for the TREC 2004 Terabyte track
      GOV2 collection: 426GB, 25 million documents
      50 topics

      For federation, divided into 10 partitions of roughly equal size



                                                                  P@10          MAP
                    One collection                                0.522        0.292
                    Prediction-based federation                   0.550        0.264
                    Score-based federation                        0.498        0.257


                                                                                       * 10-fold cross-validation

82
8212/07/2012                                          SIGIR 2010, Geneva                                        82
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Content enhancement using missing content
    IBM Labs in Haifa


analysis

               Information                                                      External
                  need                                                          content

                                               Missing
                 Search                                                         Data
                                               content
                 engine                                                       gathering
                                              estimation


               Repository
                                                                    Quality
                                                                   estimation
83
8312/07/2012                                 SIGIR 2010, Geneva                            83
                          SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Story line
      A helpdesk system where system administrators try to find
      relevant solutions in a database
      The system administrators’ queries are logged and analyzed to
      find topics of interest that are not covered in the current
      repository
      Additional content for topics which are lacking is obtained from
      the internet, and is added to the local repository




84
8412/07/2012                                          SIGIR 2010, Geneva               84
                                   SIGIR 2012 Tutorial: Query Performance Prediction
AddingLabs in Haifa where it is missing improves
    IBM
        content
precision
      Cluster user queries

      Add content according to:
           Lacking content
           Size
           Random

      Adding content where it is missing
      brings the best improvement in
      precision




85
8512/07/2012                                    SIGIR 2010, Geneva               85
                             SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Selective query expansion

      Automatic query expansion: Improving average quality by adding search
      terms
      Pseudo-relevance feedback (PRF): Considers the (few) top-ranked
      documents to be relevant, and uses them to expand the query.
      PRF works well on average, but can significantly degrade retrieval
      performance for some queries.
      PRF fails because of Query Drift: Non-relevant documents infiltrate into
      the top results or relevant results contain aspects irrelevant of the
      query’s original intention.
      Rocchio’s method: The expanded query is obtained by combining the
      original query and the centroid of the top results, adding new terms to
      the query and reweighting the original terms.
                              Qm = αQOriginal + β            ∑D
                                                        D j ∈DRe levant
                                                                          j   −γ        ∑D       k
                                                                                   Dk ∈DIrrelevant


86
8612/07/2012                                          SIGIR 2010, Geneva                             86
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa


Some work on selective query expansion
      Expand only “easy” queries
                Predict which are the easy queries and only expand them (Amati et al.
                2004, Yom-Tov et al. 2005)
      Estimate query drift (Cronen-Townsend et al., 2006)
                Compare the expanded list with the unexpanded list to estimate if too
                much noise was added by using expansion
      Selecting a specific query expansion form from a list of candidates
      (Winaver et al. 2007)
      Integrating various query-expansion forms by weighting them using
      query-performance predictors (Soskin et al., 2009)
      Estimate how much expansion (Lv and Zhai, 2009)
                Predict the best α in the Rocchio formula using features of the query, the
                documents, and the relationship between them
      But, see Azzopardi and Hauff 2009
87
8712/07/2012                                          SIGIR 2010, Geneva                 87
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Other uses of query difficulty estimation
      Collaborative filtering (Bellogin & Castells, 2009): Measure the
      lack of ambiguity in a users’ preference.
                An “easy” user is one whose preferences are clear-cut, and thus
                contributes much to a neighbor.


      Term selection (Kumaran & Carvalho, 2009): Identify irrelevant
      query terms
                Queries with fewer irrelevant terms tend to have better results
      Reducing long queries (Cummins et al., 2011)




88
8812/07/2012                                          SIGIR 2010, Geneva               88
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Summary & Conclusions




IBM Haifa Research Lab            © 2010 IBM Corporation
IBM Labs in Haifa
Summary
      In this tutorial, we surveyed the current state-of-the-art research on query
      difficulty estimation for IR
      We discussed the reasons that
                Cause search engines to fail for some of the queries
                Bring about a high variability in performance among queries as well as among
                systems
      We summarized several approaches for query performance prediction
                Pre-retrieval methods
                Post-retrieval methods
                Combining predictors
      We reviewed evaluation metrics for prediction quality and the results of various
      evaluation studies conducted over several TREC benchmarks.
                These results show that state-of-the-art existing predictors are able to identify
                difficult queries by demonstrating a reasonable prediction quality
                However, prediction quality is still moderate and should be substantially improved
                in order to be widely used in IR tasks

90
9012/07/2012                                          SIGIR 2010, Geneva                        90
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Summary – Main Results
      Current linguistic-based predictors do not exhibit meaningful
      correlation with query performance
                This is quite surprising as intuitively poor performance can be expected
                for ambiguous queries
      In contrast, statistical pre-retrieval predictors such as SumSCQ,
      and maxVAR have relatively significant predictive ability
                These pre-retrieval predictors, and a few others, exhibit comparable
                performance to post-retrieval methods such as Clarity, WIG, NQC, QF
                over large scale Web collections
                This is counter-intuitive as post-retrieval methods are exposed to much
                more information than pre-retrieval methods
      However, current state-of-the-art predictors still suffer from low
      robustness in prediction quality
                This robustness problem, as well as the moderate prediction quality of
                existing predictors, are two of the greatest challenges in query difficulty
                prediction, that should be further explored in the future
91
9112/07/2012                                          SIGIR 2010, Geneva                      91
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Summary (cont)
      We tested whether combining several predictors together may improve
      prediction quality
                Especially when the different predictors are independent and measure different
                aspects of the query and the search results
      An example for a combination method is using linear regression
                The regression task is to learn how to optimally combine the predicted
                performance values in order to best fit them to the actual performance values
                Results were moderate, probably due to the sparseness of the training data, which
                over-represents the lower end of the performance values
      We discussed three frameworks for query-performance prediction
                Utility Estimation Framework (UEF); Shtok et al. 2010
                      State-of-the-art prediction quality
                A unified framework for post-retrieval prediction that sets common grounds for various
                previously-proposed predictors; Kurland et al. 2011
                A fundamental framework for estimating query difficulty; Carmel et al. 2006




92
9212/07/2012                                                SIGIR 2010, Geneva                           92
                                    SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Summary (cont)

      We discussed a few applications that utilize query difficulty
      estimators
                Handling each query individually based on its estimated difficulty
                      Find the best terms for query refinement by measuring the expected gain in
                      performance for each candidate term
                      Expand the query or not based on predicted performance of the expanded
                      query
                      Personalize the query selectively only in cases that personalization is
                      expected bring value
                Collection enhancement guided by the identification of missing content
                queries
                Fusion of search results from several sources based on their predicted
                quality



93
9312/07/2012                                          SIGIR 2010, Geneva                           93
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
What’s Next?

   Predicting the performance for other query types
           Navigational Queries
           XML queries (Xquery, Xpath)
           Domain Specific queries (e.g .Healthcare)
   Considering other factors that may affect query difficulty
           Who is the person behind the query? in what context?
                    Geo-spatial features
                    Temporal aspects
                    Personal parameters
   Query difficulty in other search paradigms
           Multifaceted search
           Exploratory Search


94
9412/07/2012                                          SIGIR 2010, Geneva               94
                                   SIGIR 2012 Tutorial: Query Performance Prediction
IBM Labs in Haifa
Concluding Remarks
      Research on query difficulty estimation has begun only ten years ago with the
      pioneering work on the Clarity predictor (2002)
      Since then this subfield has found its place at the center of IR research
                These studies have revealed alternative prediction approaches, new evaluation
                methodologies, and novel applications
      In this tutorial we covered
                Existing performance prediction methods
                Some evaluation studies
                Potential applications
                Some anticipations on future directions in the field
      While the progress we see is enormous already, performance prediction is still
      challenging and far from being solved
                Much more accurate predictors are required in order to be widely adopted by IR
                tasks
      We hope that this tutorial will contribute in increasing the interest in query
      difficulty estimation

95
9512/07/2012                                          SIGIR 2010, Geneva                         95
                                   SIGIR 2012 Tutorial: Query Performance Prediction
Thank You!




IBM Haifa Research Lab                © 2010 IBM Corporation

Contenu connexe

Tendances

CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesDmytro Mishkin
 
Repository pattern in swift
Repository pattern in swiftRepository pattern in swift
Repository pattern in swiftnaoty_bot
 
LeapMotionとpythonで遊ぶ
LeapMotionとpythonで遊ぶLeapMotionとpythonで遊ぶ
LeapMotionとpythonで遊ぶmonochrojazz
 
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [투니버스] : 스파크 기반 네이버 웹툰 댓글 수집 및 분석
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [투니버스] : 스파크 기반 네이버 웹툰 댓글 수집 및 분석제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [투니버스] : 스파크 기반 네이버 웹툰 댓글 수집 및 분석
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [투니버스] : 스파크 기반 네이버 웹툰 댓글 수집 및 분석BOAZ Bigdata
 
SSII2018TS: コンピュテーショナルイルミネーション
SSII2018TS: コンピュテーショナルイルミネーションSSII2018TS: コンピュテーショナルイルミネーション
SSII2018TS: コンピュテーショナルイルミネーションSSII
 
ARグラスの自作と6DoF化
ARグラスの自作と6DoF化ARグラスの自作と6DoF化
ARグラスの自作と6DoF化emon 石川
 
DEEPFAKE DETECTION TECHNIQUES: A REVIEW
DEEPFAKE DETECTION TECHNIQUES: A REVIEWDEEPFAKE DETECTION TECHNIQUES: A REVIEW
DEEPFAKE DETECTION TECHNIQUES: A REVIEWvivatechijri
 
ゲームAIから見るAIの歴史
ゲームAIから見るAIの歴史ゲームAIから見るAIの歴史
ゲームAIから見るAIの歴史Youichiro Miyake
 
RealSenseとYOLOを用いた物体把持
RealSenseとYOLOを用いた物体把持RealSenseとYOLOを用いた物体把持
RealSenseとYOLOを用いた物体把持NaotakaKawata
 
Open Cv – An Introduction To The Vision
Open Cv – An Introduction To The VisionOpen Cv – An Introduction To The Vision
Open Cv – An Introduction To The VisionHemanth Haridas
 
스토리텔링을 위한 시각 중심의 인공지능(AI for Visual Storytelling)
스토리텔링을 위한 시각 중심의 인공지능(AI for Visual Storytelling)스토리텔링을 위한 시각 중심의 인공지능(AI for Visual Storytelling)
스토리텔링을 위한 시각 중심의 인공지능(AI for Visual Storytelling)Byoung-Hee Kim
 
マテリアルデザインを用いたデザインリニューアル [フリル編]
マテリアルデザインを用いたデザインリニューアル [フリル編]マテリアルデザインを用いたデザインリニューアル [フリル編]
マテリアルデザインを用いたデザインリニューアル [フリル編]YUKI YAMAGUCHI
 
ARコンテンツ作成勉強会:Webブラウザで簡単作成!スマホAR(Zappar編)
ARコンテンツ作成勉強会:Webブラウザで簡単作成!スマホAR(Zappar編)ARコンテンツ作成勉強会:Webブラウザで簡単作成!スマホAR(Zappar編)
ARコンテンツ作成勉強会:Webブラウザで簡単作成!スマホAR(Zappar編)Takashi Yoshinaga
 
Multiple object detection
Multiple object detectionMultiple object detection
Multiple object detectionSAURABH KUMAR
 
Image attendance system
Image attendance systemImage attendance system
Image attendance systemMayank Garg
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroBill Liu
 
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018승호 박
 
移動ロボットのナビゲーション
移動ロボットのナビゲーション移動ロボットのナビゲーション
移動ロボットのナビゲーションRyuichi Ueda
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketakiKetaki Patwari
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5taeseon ryu
 

Tendances (20)

CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
 
Repository pattern in swift
Repository pattern in swiftRepository pattern in swift
Repository pattern in swift
 
LeapMotionとpythonで遊ぶ
LeapMotionとpythonで遊ぶLeapMotionとpythonで遊ぶ
LeapMotionとpythonで遊ぶ
 
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [투니버스] : 스파크 기반 네이버 웹툰 댓글 수집 및 분석
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [투니버스] : 스파크 기반 네이버 웹툰 댓글 수집 및 분석제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [투니버스] : 스파크 기반 네이버 웹툰 댓글 수집 및 분석
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [투니버스] : 스파크 기반 네이버 웹툰 댓글 수집 및 분석
 
SSII2018TS: コンピュテーショナルイルミネーション
SSII2018TS: コンピュテーショナルイルミネーションSSII2018TS: コンピュテーショナルイルミネーション
SSII2018TS: コンピュテーショナルイルミネーション
 
ARグラスの自作と6DoF化
ARグラスの自作と6DoF化ARグラスの自作と6DoF化
ARグラスの自作と6DoF化
 
DEEPFAKE DETECTION TECHNIQUES: A REVIEW
DEEPFAKE DETECTION TECHNIQUES: A REVIEWDEEPFAKE DETECTION TECHNIQUES: A REVIEW
DEEPFAKE DETECTION TECHNIQUES: A REVIEW
 
ゲームAIから見るAIの歴史
ゲームAIから見るAIの歴史ゲームAIから見るAIの歴史
ゲームAIから見るAIの歴史
 
RealSenseとYOLOを用いた物体把持
RealSenseとYOLOを用いた物体把持RealSenseとYOLOを用いた物体把持
RealSenseとYOLOを用いた物体把持
 
Open Cv – An Introduction To The Vision
Open Cv – An Introduction To The VisionOpen Cv – An Introduction To The Vision
Open Cv – An Introduction To The Vision
 
스토리텔링을 위한 시각 중심의 인공지능(AI for Visual Storytelling)
스토리텔링을 위한 시각 중심의 인공지능(AI for Visual Storytelling)스토리텔링을 위한 시각 중심의 인공지능(AI for Visual Storytelling)
스토리텔링을 위한 시각 중심의 인공지능(AI for Visual Storytelling)
 
マテリアルデザインを用いたデザインリニューアル [フリル編]
マテリアルデザインを用いたデザインリニューアル [フリル編]マテリアルデザインを用いたデザインリニューアル [フリル編]
マテリアルデザインを用いたデザインリニューアル [フリル編]
 
ARコンテンツ作成勉強会:Webブラウザで簡単作成!スマホAR(Zappar編)
ARコンテンツ作成勉強会:Webブラウザで簡単作成!スマホAR(Zappar編)ARコンテンツ作成勉強会:Webブラウザで簡単作成!スマホAR(Zappar編)
ARコンテンツ作成勉強会:Webブラウザで簡単作成!スマホAR(Zappar編)
 
Multiple object detection
Multiple object detectionMultiple object detection
Multiple object detection
 
Image attendance system
Image attendance systemImage attendance system
Image attendance system
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
 
移動ロボットのナビゲーション
移動ロボットのナビゲーション移動ロボットのナビゲーション
移動ロボットのナビゲーション
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 

En vedette

A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionRakebul Hasan
 
Bioinformatioc: Information Retrieval - II
Bioinformatioc: Information Retrieval - IIBioinformatioc: Information Retrieval - II
Bioinformatioc: Information Retrieval - IIDr. Rupak Chakravarty
 
Strategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataStrategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataRakebul Hasan
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
devices and methods for automatic data capture
devices and methods for automatic data capturedevices and methods for automatic data capture
devices and methods for automatic data capturehina6349
 
Storage And Retrieval Of Information
Storage And Retrieval Of InformationStorage And Retrieval Of Information
Storage And Retrieval Of InformationMarcus9000
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrievalSadaf Rafiq
 

En vedette (9)

A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance Prediction
 
Bioinformatioc: Information Retrieval - II
Bioinformatioc: Information Retrieval - IIBioinformatioc: Information Retrieval - II
Bioinformatioc: Information Retrieval - II
 
Strategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataStrategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked Data
 
From federated to aggregated search
From federated to aggregated searchFrom federated to aggregated search
From federated to aggregated search
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
devices and methods for automatic data capture
devices and methods for automatic data capturedevices and methods for automatic data capture
devices and methods for automatic data capture
 
Storage And Retrieval Of Information
Storage And Retrieval Of InformationStorage And Retrieval Of Information
Storage And Retrieval Of Information
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 

Similaire à Sigir12 tutorial: Query Perfromance Prediction for IR

2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...SBGC
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directionsTao He
 
OO Development 1 - Introduction to Object-Oriented Development
OO Development 1 - Introduction to Object-Oriented DevelopmentOO Development 1 - Introduction to Object-Oriented Development
OO Development 1 - Introduction to Object-Oriented DevelopmentRandy Connolly
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEEFINALYEARSTUDENTPROJECTS
 
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISONSTATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISONijseajournal
 
Lionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteLionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteICSM 2011
 
IRJET- Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET-  	  Review of Tencent ML-Images Large-Scale Multi-Label Image DatabaseIRJET-  	  Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET- Review of Tencent ML-Images Large-Scale Multi-Label Image DatabaseIRJET Journal
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareLaurens De Vocht
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...Edge AI and Vision Alliance
 
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java ApplicationsTowards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java ApplicationsGábor Szárnyas
 
GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.Alexandru Iosup
 
Why is Test Driven Development for Analytics or Data Projects so Hard?
Why is Test Driven Development for Analytics or Data Projects so Hard?Why is Test Driven Development for Analytics or Data Projects so Hard?
Why is Test Driven Development for Analytics or Data Projects so Hard?Phil Watt
 
Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceLionel Briand
 
Multi-Tenancy and Virtualization in Cloud Computing
Multi-Tenancy and Virtualization in Cloud ComputingMulti-Tenancy and Virtualization in Cloud Computing
Multi-Tenancy and Virtualization in Cloud ComputingAlexandru Iosup
 
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
IRJET- Object Detection and Recognition using Single Shot Multi-Box DetectorIRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
IRJET- Object Detection and Recognition using Single Shot Multi-Box DetectorIRJET Journal
 

Similaire à Sigir12 tutorial: Query Perfromance Prediction for IR (20)

2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 
Cloud testing v1
Cloud testing v1Cloud testing v1
Cloud testing v1
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
 
OO Development 1 - Introduction to Object-Oriented Development
OO Development 1 - Introduction to Object-Oriented DevelopmentOO Development 1 - Introduction to Object-Oriented Development
OO Development 1 - Introduction to Object-Oriented Development
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
 
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISONSTATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
 
Lionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteLionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 Keynote
 
IRJET- Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET-  	  Review of Tencent ML-Images Large-Scale Multi-Label Image DatabaseIRJET-  	  Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET- Review of Tencent ML-Images Large-Scale Multi-Label Image Database
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
 
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java ApplicationsTowards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
 
2008.11560v2.pdf
2008.11560v2.pdf2008.11560v2.pdf
2008.11560v2.pdf
 
GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.
 
Why is Test Driven Development for Analytics or Data Projects so Hard?
Why is Test Driven Development for Analytics or Data Projects so Hard?Why is Test Driven Development for Analytics or Data Projects so Hard?
Why is Test Driven Development for Analytics or Data Projects so Hard?
 
Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial Intelligence
 
Multi-Tenancy and Virtualization in Cloud Computing
Multi-Tenancy and Virtualization in Cloud ComputingMulti-Tenancy and Virtualization in Cloud Computing
Multi-Tenancy and Virtualization in Cloud Computing
 
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
IRJET- Object Detection and Recognition using Single Shot Multi-Box DetectorIRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
 

Dernier

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Sigir12 tutorial: Query Perfromance Prediction for IR

  • 1. Query Performance Prediction for IR David Carmel, IBM Haifa Research Lab Oren Kurland, Technion SIGIR Tutorial Portland Oregon, August 12, 2012 IBM Haifa Research Lab © 2010 IBM Corporation
  • 2. IBM Labs in Haifa Instructors Dr. David Carmel Research Staff Member at the Information Retrieval group at IBM Haifa Research Lab Ph.D. in Computer Science from the Technion, Israel, 1997 Research Interests: search in the enterprise, query performance prediction, social search, and text mining carmel@il.ibm.com, https://researcher.ibm.com/researcher/view.php?person=il-CARMEL Dr. Oren Kurland Senior lecturer at the Technion --- Israel Institute of Technology Ph.D. in Computer Science from Cornell University, 2006 Research Interests: information retrieval kurland@ie.technion.ac.il, http://iew3.technion.ac.il/~kurland 2 212/07/2012 SIGIR 2010, Geneva 2 SIGIR 2012 Tutorial: Query Performance Prediction
  • 3. IBM Labs in Haifa This tutorial is based in part on the book Estimating the query difficulty for Information Retrieval Synthesis Lectures on Information Concepts, Retrieval, and Services, Morgan & Claypool publishers The tutorial presents the opinions of the presenters only, and does not necessarily reflect the views of IBM or the Technion Algorithms, techniques, features, etc. mentioned here might or might not be in use by IBM 3 312/07/2012 SIGIR 2010, Geneva 3 SIGIR 2012 Tutorial: Query Performance Prediction
  • 4. IBM Labs in Haifa Query Difficulty Estimation – Main Challenge Estimating the query difficulty is an attempt to quantify the quality of search results retrieved for a query from a given collection of documents, when no relevance feedback is given Even for systems that succeed very well on average, the quality of results returned for some of the queries is poor Understanding why some queries are inherently more difficult than others is essential for IR A good answer to this question will help search engines to reduce the variance in performance 4 412/07/2012 SIGIR 2010, Geneva 4 SIGIR 2012 Tutorial: Query Performance Prediction
  • 5. IBM Labs in Haifa Estimating the Query Difficulty – Some Benefits Feedback to users: The IR system can provide the users with an estimate of the expected quality of the results retrieved for their queries. Users can then rephrase “difficult” queries or, resubmit a “difficult” query to alternative search resources. Feedback to the search engine: The IR system can invoke alternative retrieval strategies for different queries according to their estimated difficulty. For example, intensive query analysis procedures may be invoked selectively for difficult queries only Feedback to the system administrator: For example, administrators can identify missing content queries Then expand the collection of documents to better answer these queries. For IR applications: For example, a federated (distributed) search application Merging the results of queries employed distributively over different datasets Weighing the results returned from each dataset by the predicted difficulty 5 512/07/2012 5 SIGIR 2012 Tutorial: Query Performance Prediction
  • 6. IBM Labs in Haifa Contents Introduction -The Robustness Problem of (ad hoc) Information Retrieval Basic Concepts Query Performance Prediction Methods Pre-Retrieval Prediction Methods Post-Retrieval Prediction Methods Combining Predictors A Unified Framework for Post-Retrieval Query-Performance Prediction A General Model for Query Difficulty A Probabilistic Framework for Query-Performance Prediction Applications of Query Difficulty Estimation Summary 6 Open Challenges 2012 Tutorial:SIGIR 2010, Geneva Prediction 612/07/2012 SIGIR Query Performance 6
  • 7. Introduction – The Robustness Problem of Information Retrieval IBM Haifa Research Lab © 2010 IBM Corporation
  • 8. IBM Labs in Haifa The Robustness problem of IR Most IR systems suffer from a radical variance in retrieval performance when responding to users’ queries Even for systems that succeed very well on average, the quality of results returned for some of the queries is poor This may lead to user dissatisfaction Variability in performance relates to various factors: The query itself (e.g., term ambiguity “Golf”) The vocabulary mismatch problem - the discrepancy between the query vocabulary and the document vocabulary Missing content queries - there is no relevant information in the corpus that can satisfy the information needs. 8 812/07/2012 SIGIR 2010, Geneva 8 SIGIR 2012 Tutorial: Query Performance Prediction
  • 9. An example for a difficult query: IBM Labs in Haifa “The Hubble Telescope achievements” Hubble Telescope Achievements: Great eye sets sights sky high Simple test would have found flaw in Hubble telescope Nation in brief Hubble space telescope placed aboard shuttle Cause of Hubble telescope defect reportedly found Flaw in Hubble telescope Flawed mirror hampers Hubble space telescope Touchy telescope torments controllers NASA scrubs launch of discovery Hubble builders got award fees, magazine says Retrieved results deal with issues related to the Hubble telescope project in general; but the gist of that query, achievements, is lost 9 912/07/2012 SIGIR 2010, Geneva 9 SIGIR 2012 Tutorial: Query Performance Prediction
  • 10. IBM Labs in Haifa The variance in performance across queries and systems (TREC-7) Queries are sorted in decreasing order according to average precision attained among all TREC participants (green bars). The performance of two different systems per query is shown by the two curves. 10 1012/07/2012 SIGIR 2010, Geneva 10 SIGIR 2012 Tutorial: Query Performance Prediction
  • 11. IBM Labs in Haifa The Reliable Information Access (RIA) workshop The first attempt to rigorously investigate the reasons for performance variability across queries and systems. Extensive failure analysis of the results of 6 IR systems 45 TREC topics Main reason to failures - the systems’ inability to identify all important aspects of the query The failure to emphasize one aspect of a query over another, or emphasize one aspect and neglect other aspects “What disasters have occurred in tunnels used for transportation?” Emphasizing only one of these terms will deteriorate performance because each term on its own does not fully reflect the information need If systems could estimate what failure categories the query may belong to Systems could apply specific automated techniques that correspond to the failure mode in order to improve performance 11 1112/07/2012 SIGIR 2010, Geneva 11 SIGIR 2012 Tutorial: Query Performance Prediction
  • 12. IBM Labs in Haifa 12 1212/07/2012 SIGIR 2010, Geneva 12 SIGIR 2012 Tutorial: Query Performance Prediction
  • 13. Instability in Retrieval - The TREC’s Robust IBM Labs in Haifa Tracks The diversity in performance among topics and systems led to the TREC Robust tracks (2003 – 2005) Encouraging systems to decrease variance in query performance by focusing on poorly performing topics Systems were challenged with 50 old TREC topics found to be “difficult” for most systems over the years A topic is considered difficult when the median of the average precision scores of all participants for that topic is below a given threshold A new measure, GMAP, uses the geometric mean instead of the arithmetic mean when averaging precision values over topics Emphasizes the lowest performing topics, and is thus a useful measure that can attest to the robustness of system’s performance 13 1312/07/2012 SIGIR 2010, Geneva 13 SIGIR 2012 Tutorial: Query Performance Prediction
  • 14. The Robust Tracks – Decreasing Variability IBM Labs in Haifa across Topics Several approaches to improving the poor effectiveness for some topics were tested Selective query processing strategy based on performance prediction Post-retrieval reordering Selective weighting functions Selective query expansion None of these approaches was able to show consistent improvement over traditional non-selective approaches Apparently, expanding the query by appropriate terms extracted from an external collection (the Web) improves the effectiveness for many queries, including poorly performing queries 14 1412/07/2012 SIGIR 2010, Geneva 14 SIGIR 2012 Tutorial: Query Performance Prediction
  • 15. The Robust Tracks – Query Performance IBM Labs in Haifa Prediction As a second challenge: systems were asked to predict their performance for each of the test topics Then the TREC topics were ranked First, by their predicted performance value Second, by their actual performance value Evaluation was done by measuring the similarity between the predicted performance-based ranking and the actual performance-based ranking Most systems failed to exhibit reasonable prediction capability 14 runs had a negative correlation between the predicted and actual topic rankings, demonstrating that measuring performance prediction is intrinsically difficult On the positive side, the difficulty in developing reliable prediction methods raised the awareness of the IR community to this challenge 15 1512/07/2012 SIGIR 2010, Geneva 15 SIGIR 2012 Tutorial: Query Performance Prediction
  • 16. HowIBM Labs in Haifa is the performance prediction task difficult for human experts? TREC-6 experiment: Estimating whether human experts can predict the query difficulty A group of experts were asked to classify a set of TREC topics to three degrees of difficulty based on the query expression only easy, middle, hard? The manual judgments were compared to the median of the average precision scores, as determined after evaluating the performance of all participating systems Results The Pearson correlation between the expert judgments and the “true” values was very low (0.26). The agreement between experts, as measured by the correlation between their judgments, was very low too (0.39) The low correlation illustrates how difficult this task is and how little is known about what makes a query difficult 16 1612/07/2012 SIGIR 2010, Geneva 16 SIGIR 2012 Tutorial: Query Performance Prediction
  • 17. HowIBM Labs in Haifa is the performance prediction task difficult for human experts? (contd.) Hauff et al. (CIKM 2010) found a low level of agreement between humans with regard to which queries are more difficult than others (median kappa = 0.36) there was high variance in the ability of humans to estimate query difficulty although they shared “similar” backgrounds a low correlation between true performance and humans’ estimates of query- difficulty (performed in a pre-retrieval fashion) median Kendall tau was 0.31 which was quite lower than that posted by the best performing pre-retrieval predictors however, overall the humans did manage to differentiate between “good” and “bad” queries a low correlation between humans’ predictions and those of query-performance predictors with some exceptions These findings further demonstrate how difficult the query-performance prediction task is and how little is known about what makes a query 17 difficult SIGIR 2010, Geneva 17 1712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
  • 18. IBM Labs in Haifa Are queries found to be difficult in one collection still considered difficult in another collection? Robust track 2005: Difficult topics in the ROBUST collection were tested against another collection (AQUAINT) The median average precision over the ROBUST collection is 0.126 Compared to 0.185 for the same topics over the AQUAINT collection Apparently, the AQUAINT collection is “easier” than the ROBUST collection probably due to Collection size Many more relevant documents per topic in AQUAINT Document features such as structure and coherence However, the relative difficulty of the topics is preserved over the two datasets The Pearson correlation between topics’ performance in both datasets is 0.463 This illustrates some dependency between the topic’s median scores on both collections Even when topics are somewhat easier in one collection than another, the relative difficulty among topics is preserved, at least to some extent 18 1812/07/2012 SIGIR 2010, Geneva 18 SIGIR 2012 Tutorial: Query Performance Prediction
  • 19. Basic concepts IBM Haifa Research Lab © 2010 IBM Corporation
  • 20. IBM Labs in Haifa The retrieval task Given: A document set D (the corpus) A query q Retrieve Dq (the result list), a ranked list of documents from D, which are most likely to be relevant to q Some widely used retrieval methods: Vector space tf-idf based ranking, which estimates relevance by the similarity between the query and a document in the vector space The probabilistic OKAPI-BM25 method, which estimates the probability that the document is relevant to the query Language-model-based approaches, which estimate the probability that the query was generated by a language model induced from the document And more, divergence from randomness (DFR) approaches, inference networks, markov random fields (MRF), … 20 2012/07/2012 SIGIR 2010, Geneva 20 SIGIR 2012 Tutorial: Query Performance Prediction
  • 21. IBM Labs in Haifa Text REtrieval Conference (TREC) A series of workshops for large-scale evaluation of (mostly) text retrieval technology: Realistic test collections Uniform, appropriate scoring procedures Title: African Civilian Deaths Started in 1992 Description: How many civilian non-combatants have been killed in the various civil wars in Africa? Narrative: A relevant document A TREC task usually comprises of: will contain specific casualty A document collection (corpus) information for a given area, country, or region. It will cite A list of topics (information needs) numbers of civilian deaths caused directly or indirectly by armed A list of relevant documents for each conflict. topic (QRELs) 21 2112/07/2012 SIGIR 2010, Geneva 21 SIGIR 2012 Tutorial: Query Performance Prediction
  • 22. IBM Labs in Haifa Precision measures Precision at k (P@k): The fraction of relevant documents among the top-k results. Average precision: AP is the average of precision values computed at the ranks of each of the relevant documents in the ranked list: 1 AP(q) = Rq ∑ P@rank(r) r∈Rq Rq is the set of documents in the corpus that are relevant to q Average Precision is usually computed using a ranking which is truncated at some position (typically 1000 in TREC). 22 2212/07/2012 SIGIR 2010, Geneva 22 SIGIR 2012 Tutorial: Query Performance Prediction
  • 23. IBM Labs in Haifa Prediction quality measures Given: Query q Result list Dq that contains n documents Goal: Estimate the retrieval effectiveness of Dq, in terms of satisfying Iq, the information need behind q specifically, the prediction task is to predict AP(q) when no relevance information (Rq) is given. In practice: Estimate, for example, the expected average precision for q. The quality of a performance predictor can be measured by the correlation between the expected average precision and the corresponding actual precision values 23 2312/07/2012 SIGIR 2010, Geneva 23 SIGIR 2012 Tutorial: Query Performance Prediction
  • 24. IBM Labs in Haifa Measures of correlation Linear correlation (Pearson) considers the true AP values as well as the predicted AP values of queries Non-linear correlation (Kendall’s tau, Spearman’s rho) considers only the ranking of queries by their true AP values and by their predicted AP values 24 2412/07/2012 SIGIR 2010, Geneva 24 SIGIR 2012 Tutorial: Query Performance Prediction
  • 25. IBM Labs in Haifa Evaluating a prediction system Q5 Predicted AP Different correlation metrics will measure different things If there is no apriory reason to use Pearson, prefer Spearman or KT In practice, the difference is not Q1 large, because there are Q4 Q3 enough queries and because of Q2 the distribution of AP Actual AP 25 2512/07/2012 SIGIR 2010, Geneva 25 SIGIR 2012 Tutorial: Query Performance Prediction
  • 26. IBM Labs in Haifa Evaluating a prediction system (cond.) It is important to note that it was found that state-of-the-art query-performance predictors might not be correlated (at all) with measures of users’ performance (e.g., the time it takes to reach the first relevant document) see Terpin and Hersh ADC ’04 and Zhao and Scholer ADC ‘07 However!, This finding might be attributed, as suggested by Terpin and Hersh in ADC ’04, to the fact that standard evaluation measures (e.g., average precision) and users’ performance are not always strongly correlated Hersh et al. ’00, Turpin and Hersh ’01, Turpin and Scholer ’06, Smucker and Parkash Jethani ‘10 26 2612/07/2012 SIGIR 2010, Geneva 26 SIGIR 2012 Tutorial: Query Performance Prediction
  • 27. Query Performance Prediction Methods IBM Haifa Research Lab © 2010 IBM Corporation
  • 28. IBM Labs in Haifa Performance Prediction Methods Pre-Retrieval Post-Retrieval Methods Methods Score Linguistics Statistics Clarity Robustness Analysis Morphologic Syntactic Query Perturb. Doc Perturb. Retrieval Perturb. Semantic Top score Avg. score Specificity Similarity Coherency Relatedness Variance of Cluster hypo. scores 28 2812/07/2012 SIGIR 2010, Geneva 28 SIGIR 2012 Tutorial: Query Performance Prediction
  • 29. IBM Labs in Haifa Pre-Retrieval Prediction Methods Pre-retrieval prediction approaches estimate the quality of the search results before the search takes place. Provide effective instantiation of query performance prediction for search applications that must efficiently respond to search requests Only the query terms, associated with some pre-defined statistics gathered at indexing time, can be used for prediction Pre-retrieval methods can be split to linguistic and statistical methods. Linguistic methods apply natural language processing (NLP) techniques and use external linguistic resources to identify ambiguity and polysemy in the query. Statistical methods analyze the distribution of the query terms within the collection 29 2912/07/2012 SIGIR 2010, Geneva 29 SIGIR 2012 Tutorial: Query Performance Prediction
  • 30. Linguistic Approaches (Mothe & Tangui (SIGIR QD IBM Labs in Haifa workshop, SIGIR 2005), Hauff (2010)) Most linguistic features do not correlate well with the system performance. Features include, for example, morphological (avg. # of morphemes per query term), syntactic link span (which relates to the average distance between query words in the parse tree), semantic (polysemy- the avg. # of synsets per word in the WordNet dictionary) Only the syntactic links span and the polysemy value were shown to have some (low) correlation This is quite surprising as intuitively poor performance can be expected for ambiguous queries Apparently, term ambiguity should be measured using corpus-based approaches, since a term that might be ambiguous with respect to the general vocabulary, may have only a single interpretation in the corpus 30 3012/07/2012 SIGIR 2010, Geneva 30 SIGIR 2012 Tutorial: Query Performance Prediction
  • 31. IBM Labs in Haifa Pre-retrieval Statistical methods Analyze the distribution of the query term frequencies within the collection. Two major frequent term statistics : Inverse document frequency idf(t) = log(N/Nt) Inverse collection term frequency ictf(t) = log(|D|/tf(t,D)) Specificity based predictors Measure the query terms' distribution over the collection Specificity-based predictors: avgIDF, avgICTF - Queries composed of infrequent terms are easier to satisfy maxIDF, maxICTF – Similarly varIDF, varICTF - Low variance reflects the lack of dominant terms in the query A query composed of non-specific terms is deemed to be more difficult “Who and Whom” 31 3112/07/2012 SIGIR 2010, Geneva 31 SIGIR 2012 Tutorial: Query Performance Prediction
  • 32. IBM Labs in Haifa Specificity-based predictors Query Scope (He & Ounis 2004) The percentage of documents in the corpus that contain at least one query term; high value indicates many candidates for retrieval, and thereby a difficult query High query scope indicates marginal prediction quality for short queries only, while for long queries its quality drops significantly QS is not a ``pure'' pre-retrieval predictor as it requires finding the documents containing query terms (consider dynamic corpora) Simplified Clarity Score (He & Ounis 2004) The Kullback-Leibler (KL) divergence between the (simplified) query language model and the corpus language model  p (t | q )  SC S ( q ) ∑ p ( t | q ) log   t∈ q  p (t | D )  SCS is strongly related to the avgICTF predictor assuming each term appears only once in the query : SCS(q) = log(1/|q|) + avgICTF(q) 32 3212/07/2012 SIGIR 2010, Geneva 32 SIGIR 2012 Tutorial: Query Performance Prediction
  • 33. IBM Labs in Haifa Similarity, coherency, and variance -based predictors Similarity: SCQ (Zhao et al. 2008): High similarity to the corpus indicates effective retrieval SCQ(t ) = (1 + log(tf (t , D))) ⋅ idf (t ) maxSCQ(q)/avgSCQ(q) are the maximum/average over the query terms contradicts(?) the specificity idea (e.g., SCS) Coherency: CS (He et al. 2008): The average inter-document similarity between documents containing the query terms, averaged over query terms This is a conceptual pre-retrieval reminiscent of the post-retrieval autocorrelation approach (Diaz 2007) that we will discuss later Demanding computation that requires the construction of a pointwise similarity matrix for all pair of documents in the index Variance (Zhao et al. 2008): var(t) - Variance of term t’s weights (e.g., tf.idf) over the documents containing it maxVar and sumVar are the maximum/average over query terms hypothesis: low variance implies to a difficult query due to low 33 3312/07/2012 discriminative power SIGIR 2010, Geneva 33 SIGIR 2012 Tutorial: Query Performance Prediction
  • 34. IBM Labs in Haifa Term Relatedness (Hauff 2010) Hypothesis: If the query terms co-occur frequently in the collection we expect good performance Pointwise mutual information (PMI) is a popular measure of co-occurrence statistics of two terms in the collection It requires efficient tools for gathering collocation statistics from the corpus, to allow dynamic usage at query run-time. avgPMI(q)/maxPMI(q) measures the average and the maximum PMI over all pairs of terms in the query p(t1, t2 | D) PMI (t1, t2 ) log p(t1 | D) p(t2 | D) 34 3412/07/2012 SIGIR 2010, Geneva 34 SIGIR 2012 Tutorial: Query Performance Prediction
  • 35. IBM Labs in Haifa Evaluating pre-retrieval methods (Hauff 2010) Insights maxVAR and maxSCQ dominate other predictors, and are most stable over the collections and topic sets However, their performance significantly drops for one of the query sets (301-350) Prediction is harder over the Web collections (WT10G and GOV2) than over the news collection (ROBUST), probably due to higher heterogeneity of the data 35 3512/07/2012 SIGIR 2010, Geneva 35 SIGIR 2012 Tutorial: Query Performance Prediction
  • 36. Post-retrieval Predictors IBM Haifa Research Lab © 2010 IBM Corporation
  • 37. IBM Labs in Haifa Post-Retrieval Predictors Post-Retrieval Analyze the search results in addition to the query Methods Usually are more complex as the top-results are retrieved and analyzed Prediction quality depends on the retrieval process Clarity Robustness Score Analysis As different results are expected for the same query when using different retrieval methods In contrast to pre-retrieval methods, the Doc search results may depend on query- Query Perturb. Perturb. Retrieval Perturb. independent factors Such as document authority scores, search personalization etc. Post-retrieval methods can be categorized into three main paradigms: Clarity-based methods directly measure the coherence of the search results. Robustness-based methods evaluate how robust the results are to perturbations in the query, the result list, and the retrieval method. Score distribution based methods analyze the score distribution of the search results 37 3712/07/2012 SIGIR 2010, Geneva 37 SIGIR 2012 Tutorial: Query Performance Prediction
  • 38. IBM Labs in Haifa Clarity (Cronen-Townsend et al. SIGIR 2002) Clarity measures the coherence (clarity) of the result-list with respect to the corpus Good results are expected to be focused on the query's topic Clarity considers the discrepancy between the likelihood of words most frequently used in retrieved documents and their likelihood in the whole corpus Good results - The language of the retrieved documents should be distinct from the general language of the whole corpus Bad results - The language of retrieved documents tends to be more similar to the general language Accordingly, clarity measures the KL divergence between a language model induced from the result list and that induced from the corpus 38 3812/07/2012 SIGIR 2010, Geneva 38 SIGIR 2012 Tutorial: Query Performance Prediction
  • 39. IBM Labs in Haifa Clarity Computation Clarity (q ) KL( p(• | Dq ) || p (• | D)) KL divergence between p (t | Dq ) Dq and D Clarity (q ) = ∑ t∈V ( D ) p(t | Dq ) log pMLE (t | D) tf (t , X ) X’s unsmoothed LM (MLE) pMLE (t | X ) |X| p (t | d ) = λ p MLE (t | d ) + (1 − λ ) p (t | D ) d’s smoothed LM p(t | Dq ) = ∑ p(t | d ) p(d | q) d∈Dq Dq’s LM (RM1) p(q | d ) p(d ) p(d | q) = ∝ p(q | d ) d’s “relevance” to q ∑d '∈D p(q | d ') p(d ') q p(q | d ) ∏ p(t | d ) t∈q query likelihood model 39 3912/07/2012 SIGIR 2010, Geneva 39 SIGIR 2012 Tutorial: Query Performance Prediction
  • 40. IBM Labs in Haifa Example Consider the two queries variants for TREC topic 56 (from the TREC query track): Query A: Show me any predictions for changes in the prime lending rate and any changes made in the prime lending rates Query B: What adjustments should be made once federal action occurs? 40 4012/07/2012 SIGIR 2010, Geneva 40 SIGIR 2012 Tutorial: Query Performance Prediction
  • 41. IBM Labs in Haifa The Clarity of 1)relevant results, 2) non-relevant results, 3)random documents 10 Relevant Non−relevant 9 Collection−wide 8 7 Query clarity score 6 5 4 3 2 1 0 300 350 400 450 Topic number 41 4112/07/2012 SIGIR 2010, Geneva 41 SIGIR 2012 Tutorial: Query Performance Prediction
  • 42. IBM Labs in Haifa Novel interpretation of clarity - Hummel et al. 2012 (See the Clarity Revisited poster!) Clarity ( q ) KL ( p (• | Dq ) || p (• | D ) = CE ( p (• | Dq ) || p (• | D )) − H ( p (• | Dq )) Cross entropy Entropy Clarity ( q ) = C Distance( q ) − L Diversity( q ) 42 4212/07/2012 SIGIR 2010, Geneva 42 SIGIR 2012 Tutorial: Query Performance Prediction
  • 43. IBM Labs in Haifa Clarity variants Divergence from randomness approach (Amati et al. ’02) Emphasize query terms (Cronen Townsend et al. ’04, Hauff et al. ’08) Only consider terms that appear in a very low percentage of all documents in the corpus (Hauff et al. ’08) Beneficial for noisy Web settings 43 4312/07/2012 SIGIR 2010, Geneva 43 SIGIR 2012 Tutorial: Query Performance Prediction
  • 44. IBM Labs in Haifa Robustness Robustness can be measured with respect to perturbations of the Query The robustness of the result list to small modifications of the query (e.g., perturbation of term weights) Documents Small random perturbations of the document representation are unlikely to result in major changes to documents' retrieval scores If scores of documents are spread over a wide range, then these perturbations are unlikely to result in significant changes to the ranking Retrieval method In general, different retrieval methods tend to retrieve different results for the same query, when applied over the same document collection A high overlap in results retrieved by different methods may be related to high agreement on the (usually sparse) set of relevant results for the query. A low overlap may indicate no agreement on the relevant results; hence, query difficulty 44 4412/07/2012 SIGIR 2010, Geneva 44 SIGIR 2012 Tutorial: Query Performance Prediction
  • 45. Query Labs in Haifa IBM Perturbations Overlap between the query and its sub-queries (Yom Tov et al. SIGIR 2005) Observation: Some query terms have little or no influence on the retrieved documents, especially in difficult queries. The query feedback (QF) method (Zhou and Croft SIGIR 2007) models retrieval as a communication channel problem The input is the query, the channel is the search system, and the set of results is the noisy output of the channel A new query is generated from the list of results, using the terms with maximal contribution to the Clarity score, and then a second list of results is retrieved for that query The overlap between the two lists is used as a robustness score. Original R q SE q’ AP (q ) Overlap New R 4512/07/2012 45 SIGIR 2012 Tutorial: Query Performance Prediction
  • 46. Document Perturbation (Zhou & Croft CIKM 06) IBM Labs in Haifa (A conceptually similar approach, but which uses a different technique, was presented by Vinay et al. in SIGIR 06) How stable is the ranking in the presence of uncertainty in the ranked documents? Compare a ranked list from the original collection to the corresponding ranked list from the corrupted collection using the same query and ranking function docs’ d d’ D lang. model D’ q SE d1 d’1 d2 d’2 … Compare … dk d’k 46 4612/07/2012 SIGIR 2010, Geneva 46 SIGIR 2012 Tutorial: Query Performance Prediction
  • 47. Cohesioninof the Result list – Clustering tendency IBM Labs Haifa (Vinay et al. SIGIR 2006) The cohesion of the result list can be measured by its clustering patterns Following the ``cluster hypothesis'' which implies that documents relevant to a given query are likely to be similar to one another A good retrieval returns a single, tight cluster, while poor retrieval returns a loosely related set of documents covering many topics The ``clustering tendency'' of the result set Corresponds to the Cox-Lewis statistic which measures the ``randomness'' level of the result list Measured by the distance between a randomly selected document and it’s nearest neighbor from the result list. When the list contains “inherent” clusters, the distance between the random document and its closest neighbor is likely to be much larger than the distance between this neighbor and its own nearest neighbor in the list 47 4712/07/2012 SIGIR 2010, Geneva 47 SIGIR 2012 Tutorial: Query Performance Prediction
  • 48. Retrieval in Haifa IBM Labs Method Perturbation (Aslam & Pavlu, ECIR 2007) Query difficulty is predicted by submitting the query to different retrieval methods and measuring the diversity of the retrieved ranked lists Each ranking is mapped to a distribution over the document collection JSD distance is used to measure the diversity of these distributions Evaluation: Submissions of all participants to several TREC tracks were analyzed The agreement between submissions highly correlates with the query difficulty, as measured by the median performance (AP) of all participants The more submissions are analyzed, the prediction quality improves. 48 4812/07/2012 SIGIR 2010, Geneva 48 SIGIR 2012 Tutorial: Query Performance Prediction
  • 49. IBM Labs in Haifa Score Distribution Analysis Often, the retrieval scores reflect the similarity of documents to a query Hence, the distribution of retrieval scores can potentially help predict query performance. Naïve predictors: The highest retrieval score or the mean of top scores (Thomlison 2004) The difference between query-independent scores and a query- dependent scores Reflects the ``discriminative power'' of the query (Bernstein et al. 2005) 49 4912/07/2012 SIGIR 2010, Geneva 49 SIGIR 2012 Tutorial: Query Performance Prediction
  • 50. IBM Labs in Haifa Spatial autocorrelation (Diaz SIGIR 2007) Query performance is correlated with the extent to which the result list “respects” the cluster hypothesis The extent to which similar documents receive similar retrieval scores In contrast, a difficult query might be detected when similar documents are scored differently. A document's ``regularized'' retrieval score is determined based on the weighted sum of the scores of its most similar documents ScoreRe g (q, d ) ∑Sim(d, d ')Score(q, d ') d' The linear correlation of the regularized scores with the original scores is used for query-performance prediction. 50 5012/07/2012 SIGIR 2010, Geneva 50 SIGIR 2012 Tutorial: Query Performance Prediction
  • 51. Weighted Haifa IBM Labs in Information Gain –WIG (Zhou & Croft 2007) WIG measures the divergence between the mean retrieval score of top- ranked documents and that of the entire corpus. The more similar these documents are to the query, with respect to the corpus, the more effective the retrieval The corpus represents a general non-relevant document p(t | d ) 1   WIG(q) 1 k ∑k ∑λt log p(t | D)  avg ( Score(d )) − Score(D)  k | q |  d∈Dqk d∈Dq t∈q  k Dq is the list of the k highest ranked documents λt reflects the weight of the term's type When all query terms are simple keywords, this parameter collapses to 1/sqrt(|q|) WIG was originally proposed and employed in the MRF framework However, for a bag-of-words representation MRF reduces to the query likelihood model and is effective (Shtok et al. ’07, Zhou ’07) 51 5112/07/2012 SIGIR 2010, Geneva 51 SIGIR 2012 Tutorial: Query Performance Prediction
  • 52. Normalized Query Commitment - NQC IBM Labs in Haifa (Shtok et al. ICTIR 2009 and TOIS 2012) NQC estimates the presumed amount of query drift in the list of top-retrieved documents Query expansion often uses a centroid of the list Dq as an expanded query model The centroid usually manifests query drift (Mitra et al. ‘98) The centroid could be viewed as a prototypical misleader as it exhibits (some) similarity to the query This similarity is dominated by non-query-related aspects that lead to query drift Shtok et al. showed that the mean retrieval score of documents in the result list corresponds, in several retrieval methods, to the retrieval score of some centroid-based representation of Dq Thus, the mean score represents the score of a prototypical misleader The standard deviation of scores which reflects their dispersion around the mean, represents the divergence of retrieval scores of documents in the list from that of a non- relevant document that exhibits high query similarity (the centroid). 1 ∑ ( Score(d ) − µ ) 2 k k d ∈Dq NQC (q ) Score( D ) 52 5212/07/2012 SIGIR 2010, Geneva 52 SIGIR 2012 Tutorial: Query Performance Prediction
  • 53. IBM Labs in Haifa A geometric interpretation of NQC 53 5312/07/2012 SIGIR 2010, Geneva 53 SIGIR 2012 Tutorial: Query Performance Prediction
  • 54. EvaluatingHaifa IBM Labs in post-retrieval methods (Shtok et al. TOIS 2012) Prediction quality (Pearson correlation and Kendall’s tau) Similarly to pre-retrieval methods, there is no clear winner QF and NQC exhibit comparable results NQC exhibits good performance over most collections but does not perform very well on GOV2. QF performs well over some of the collections but is inferior to other predictors on ROBUST. 54 5412/07/2012 SIGIR 2010, Geneva 54 SIGIR 2012 Tutorial: Query Performance Prediction
  • 55. IBM Labs in Haifa Additional predictors that analyze the retrieval scores distribution Computing the standard deviation of retrieval scores at a query- dependent cutoff Perez-Iglesias and Araujo SPIRE 2010 Cummins et al. SIGIR 2011 Computing expected ranks for documents Vinay et al. CIKM 2008 Inferring AP directly from the retrieval score distribution Cummins AIRS 2011 55 5512/07/2012 SIGIR 2010, Geneva 55 SIGIR 2012 Tutorial: Query Performance Prediction
  • 56. Combining Predictors IBM Haifa Research Lab © 2010 IBM Corporation
  • 57. IBM Labs in Haifa Combining post-retrieval predictors Some integration efforts of a few predictors based on linear regression: Yom-Tov et al (SIGIR’05) combined avgIDF with the Overlap predictor Zhou and Croft (SIGIR’07) integrated WIG and QF using a simple linear combination Diaz (SIGIR’07) incorporated the spatial autocorrelation predictor with Clarity and with the document perturbation based predictor In all those trials, the results of the combined predictor were much better than the results of the single predictors This suggests that these predictors measure (at least semi) complementary properties of the retrieved results 57 5712/07/2012 SIGIR 2010, Geneva 57 SIGIR 2012 Tutorial: Query Performance Prediction
  • 58. Utility Labs in Haifa IBM estimation framework (UEF) for query performance prediction (Shtok et al. SIGIR 2010) Suppose that we have a true model of relevance, RI , for the q information need I q that is represented by the query q Then, the ranking π ( Dq , RI q ) , induced over the given result list using RI , is the most effective for these documents q Accordingly, the utility (with respect to the information need) provided by the given ranking, which can be though of as reflecting query performance, can be defined as U ( Dq | I q ) Similarity ( Dq , π ( Dq , RI )) q In practice, we have no explicit knowledge of the underlying information need and of RIq Using statistical decision theory principles, we can approximate the utility by estimating RIq 58 5812/07/2012 q q ∫ ˆ ( U ( D | I ) ≈ Similarity D , π ( D , R ) p ( R | I )dR ˆ q ˆ ˆ q q SIGIR 2010, Geneva ) q q q 58 Rq SIGIR 2012 Tutorial: Query Performance Prediction
  • 59. Instantiating predictors IBM Labs in Haifa U ( D | I ) ≈ ∫ Similarity ( D , π ( D , R ) ) p ( R q q q ˆ q ˆ q q ˆ | I q )dRq ˆ Rq ˆ Relevance-model estimates (Rq ) relevance language models constructed from documents sampled from the highest ranks of some initial ranking Estimate the relevance-model presumed “representativeness” of ˆ the information need (p( Rq | I q ) ) apply previously proposed predictors (Clarity, WIG, QF, NQC) upon the sampled documents from which the relevance model is constructed Inter-list similarity measures Pearson, Kendall’s-tau, Spearman A specific, highly effective, instantiated predictor: Construct a single relevance model from the given result list, Dq Use a previously proposed predictor upon Dq to estimate relevance- model effectiveness Use Pearson’s correlation between retrieval scores for the similarity 59 SIGIR 2010, Geneva 59 5912/07/2012 measure SIGIR 2012 Tutorial: Query Performance Prediction
  • 60. IBM Labs in Haifa The UEF framework - A flow diagram Dq q SE Model ˆ d1 Sampling R q ;S d2 Re-Ranking … dk π( Dq , Rq ;S ) ˆ Relevance d’1 Estimator d’2 … d’k Performance AP (q ) Ranking predictor Similarity 60 6012/07/2012 SIGIR 2010, Geneva 60 SIGIR 2012 Tutorial: Query Performance Prediction
  • 61. Prediction in Haifa IBM Labs quality of UEF (Pearson correlation with true AP) Clarity->UEF(Clarity) (+31.4%) WIG->UEF(WIG) (+27.8%) 0.75 0.75 Clarity WIG 0.7 0.7 UEF(Clarity) UEF(WIG) 0.65 0.65 0.6 0.6 0.55 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 TREC4 TREC5 WT10G ROBUST GOV2 TREC4 TREC5 WT10G ROBUST GOV2 NQC->UEF(NQC) (+17.7%) QF->UEF(QF) (+24.7%) 0.75 0.75 QF 0.7 NQC 0.7 0.65 UEF(QF) 0.65 UEF(NQC) 0.6 0.6 0.55 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 TREC4 TREC5 WT10G ROBUST GOV2 TREC4 TREC5 WT10G ROBUST GOV2 61 61
  • 62. IBM Labs in Haifa * Prediction quality for ClueWeb (Category A) TREC 2009 TREC 2010 LM LM+SpamRm LM LM+SpamRm .465 .526 .302 .312 SumVar (Zhao et al. 08) SumIDF .463 .524 .294 .292 Clarity .017 -.178 -.385 -.111 UEF(Clarity) .124 .473 .303 .295 ImpClarity (Hauff et al. ’08) .348 .133 .072 -.008 UEF(ImpClarity) .221 .580 .340 .366 WIG .423 .542 .269 .349 UEF(WIG) .236 .651 .375 .414 NQC .083 .430 .269 .214 UEF(NQC) .154 .633 .342 .460 QF .494 .630 .368 .617 UEF(QF) .637 .708 .358 .649 * Thanks to Fiana Raiber for producing the results 62 62
  • 63. IBM Labs in Haifa Are the various post-retrieval predictors that different from each other? A unified framework for explaining post-retrieval predictors (Kurland et al. ICTIR 2011) 63 63
  • 64. A unified post-retrieval prediction framework IBM Labs in Haifa We want to predict the effectiveness of a ranking π M (q; D) of the corpus that was induced by retrieval method M in response to query q Assume a true model of relevance RIq that can be used for retrieval ⇒ the resultant ranking, π opt (q; D), is of optimal utility Utility (π M (q; D); I q ) sim(π M (q; D), π opt (q; D)) Use as reference comparisons a pseudo effective (PE) and pseudo ineffective (PIE) rankings (cf. Rocchio '71): Utility (π M (q; D); I q ) α (q)sim(π M (q; D), π PE (q; D)) − β (q) sim(π M (q; D), π PIE (q; D)) ˆ Pseudo –Effective ranking πPE(q;D) Corpus ranking πM(q;D) … … Pseudo -Ineffective ranking πPIE(q;D) … 6412/07/2012 64 SIGIR 2012 Tutorial: Query Performance Prediction
  • 65. IBM Labs in Haifa Instantiating predictors Focus on the result lists of the documents most highly ranked by each ranking U (π M (q; D); I q ) α (q)sim( Dq , LPE ) − β (q)sim( Dq , LPIE ) ˆ Deriving predictors: 1. "Guess" a PE result list (LPE ) and/or a PIE result list (LPIE ) 2. Select weights α (q) and β (q) 3. Select an inter-list (ranking) similarity measure - Pearson's r, Kendall's-τ , Spearman's-ρ 6512/07/2012 65 SIGIR 2012 Tutorial: Query Performance Prediction
  • 66. IBM Labs in Haifa U (π M (q; D ); I q ) α ( q ) sim( LM , LPE ) − β (q ) sim( LM , LPIE ) ˆ Basic idea: The PIE result list is composed of k copies of a pseudo ineffective document Predictor Pseudo Predictor’s Sim. α(q) β(q) Ineffective description measure document Clarity Corpus Estimates the focus of the result -KL diverg. 0 1 (Cronen-Townsend list with respect to the corpus between ’02, ‘04) language models WIG Corpus Measures the difference between -L1 distance 0 1 (Weighted the retrieval scores of documents of retrieval Information Gain; Zhou & Croft ‘07) in the result list and the score of scores the corpus NQC Result list Measures the standard deviation -L2 distance 0 1 (Normalized centroid of the retrieval scores of of retrieval Query Commitment; scores Shtok et al. ’09) documents in the result list 6612/07/2012 66 SIGIR 2012 Tutorial: Query Performance Prediction
  • 67. IBM Labs in Haifa U (π M (q; D ); I q ) α ( q ) sim( LM , LPE ) − β (q ) sim( LM , LPIE ) ˆ Predictor Pseudo Predictor’s Sim. α(q) β(q) Effective result description measure list QF (Query Use a relevance Measures the “amount of Overlap at 1 0 Feedback; Zhou model for noise“ (non-query-related top ranks & Croft ’07) retrieval over aspects) in the result list the corpus UEF (Utility Re-rank the Estimates the potential utility Rank/Scores Presumed 0 representat Estimation given result list of the result list using correlation iveness of the Framework; using a relevance models relevance model Shtok et al. ‘10) relevance model Autocorrelation 1.Score 1. The degree to which Pearson 1 0 (Diaz ‘07) regularization retrieval scores “respect correlation the cluster hypothesis” between 2.Fusion 2. Similarity with a fusion – scores based result list 6712/07/2012 67 SIGIR 2012 Tutorial: Query Performance Prediction
  • 68. IBM Labs in Haifa A General Model for Query Difficulty 68 6812/07/2012 SIGIR 2010, Geneva 68 SIGIR 2012 Tutorial: Query Performance Prediction
  • 69. A model of query difficulty (Carmel et al, SIGIR IBM Labs in Haifa 2006) A user with a given information need (a topic): Submits a query to a search engine Judges the search results Topic according to their relevance to this information need. Thus, the query/ies and the Qrels Queries (Q) Documents (R) are two sides of the same information need Qrels also depend on the existing collection (C) Search engine Define: Topic = (Q,R|C) 69 6912/07/2012 SIGIR 2010, Geneva 69 SIGIR 2012 Tutorial: Query Performance Prediction
  • 70. IBM Labs in Haifa A Theoretical Model of Topic Difficulty Main Hypothesis Topic difficulty is induced from the distances between the model parts 70 7012/07/2012 SIGIR 2010, Geneva 70 SIGIR 2012 Tutorial: Query Performance Prediction
  • 71. Model validation: The Pearson correlation between IBM Labs in Haifa Average Precision and the model distances (see the paper for estimates of the various distances) Based on the .gov2 collection (25M docs) and 100 topics of the Terabyte tracks 04/05 Topic Queries Documents -0.06 0.15 Combined: 0.45 0.17 0.32 71 7112/07/2012 SIGIR 2010, Geneva 71 SIGIR 2012 Tutorial: Query Performance Prediction
  • 72. A probabilistic framework for QPP (Kurland et al. IBM Labs in Haifa CIKM 2012, to appear) 72 7212/07/2012 SIGIR 2010, Geneva 72 SIGIR 2012 Tutorial: Query Performance Prediction
  • 73. IBM Labs in Haifa Post-retrieval prediction (Kurland et al. 2012) query-independent result list properties (e.g., cohesion/dispersion) WIG and Clarity are derived from this expression 73 7312/07/2012 SIGIR 2010, Geneva 73 SIGIR 2012 Tutorial: Query Performance Prediction
  • 74. IBM Labs in Haifa Using reference lists (Kurland et al. 2012) The presumed quality of the reference list (a prediction task at its own right) 74 7412/07/2012 SIGIR 2010, Geneva 74 SIGIR 2012 Tutorial: Query Performance Prediction
  • 75. Applications of Query Difficulty Estimation IBM Haifa Research Lab © 2010 IBM Corporation
  • 76. IBM Labs in Haifa Main applications for QDE Feedback to the user and the system Federation and metasearch Content enhancement using missing content analysis Selective query expansion/ query selection Others 76 7612/07/2012 SIGIR 2010, Geneva 76 SIGIR 2012 Tutorial: Query Performance Prediction
  • 77. IBM Labs in Haifa Feedback to the user and the system Direct feedback to the user “We were unable to find relevant documents to your query” Estimating the value of terms from query refinement Suggest which terms to add, and estimate their value Personalization Which queries would benefit from personalization? (Teevan et al. ‘08) 77 7712/07/2012 SIGIR 2010, Geneva 77 SIGIR 2012 Tutorial: Query Performance Prediction
  • 78. IBM Labs in Haifa Federation and metasearch: Problem definition Merged ranking Given several databases that might contain information Federate relevant to a given question, results How do we construct a good unified list of answers from all these datasets? Search Search Search Similarly, given a set of search engine engine engine engines employed for the same query. How do we merge their results in an optimal manner? Collection Collection Collection 78 7812/07/2012 SIGIR 2010, Geneva 78 SIGIR 2012 Tutorial: Query Performance Prediction
  • 79. IBM Labs in Haifa Prediction-based Federation & Metasearch Merged Weight each result ranking set by it’s predicted precision Federate (Yom-Tov et al., results 2005; Sheldon et al. 11) or Search Search Search select the best engine engine engine predicted SE (White et al., 2008, Berger&Savoy, 2007) Collection Collection Collection 79 7912/07/2012 SIGIR 2010, Geneva 79 SIGIR 2012 Tutorial: Query Performance Prediction
  • 80. Metasearch and Federation using Query IBM Labs in Haifa Prediction Train a predictor for each engine/collection pair For a given query: Predict its difficulty for each engine/collection pair Weight the results retrieved from each engine/collection pair accordingly, and generate the federated list 80 8012/07/2012 SIGIR 2010, Geneva 80 SIGIR 2012 Tutorial: Query Performance Prediction
  • 81. IBM Labs in Haifa Metasearch experiment The LA Times collection was P@10 %no indexed using four search engines. A predictor was trained for each Single SE 1 0.139 47.8 search engine. search For each query the results-list from engine SE 2 0.153 43.4 each search engine was weighted SE 3 0.094 55.2 using the prediction for the specific queries. SE 4 0.171 37.4 The final ranking is a ranking of the union of the results lists, weighted Meta- Round- 0.164 45.0 by the prediction. search robin Meta- 0.163 34.9 Crawler Prediction- 0.183 31.7 based 81 8112/07/2012 SIGIR 2010, Geneva 81 SIGIR 2012 Tutorial: Query Performance Prediction
  • 82. IBM Labs in Haifa Federation for the TREC 2004 Terabyte track GOV2 collection: 426GB, 25 million documents 50 topics For federation, divided into 10 partitions of roughly equal size P@10 MAP One collection 0.522 0.292 Prediction-based federation 0.550 0.264 Score-based federation 0.498 0.257 * 10-fold cross-validation 82 8212/07/2012 SIGIR 2010, Geneva 82 SIGIR 2012 Tutorial: Query Performance Prediction
  • 83. Content enhancement using missing content IBM Labs in Haifa analysis Information External need content Missing Search Data content engine gathering estimation Repository Quality estimation 83 8312/07/2012 SIGIR 2010, Geneva 83 SIGIR 2012 Tutorial: Query Performance Prediction
  • 84. IBM Labs in Haifa Story line A helpdesk system where system administrators try to find relevant solutions in a database The system administrators’ queries are logged and analyzed to find topics of interest that are not covered in the current repository Additional content for topics which are lacking is obtained from the internet, and is added to the local repository 84 8412/07/2012 SIGIR 2010, Geneva 84 SIGIR 2012 Tutorial: Query Performance Prediction
  • 85. AddingLabs in Haifa where it is missing improves IBM content precision Cluster user queries Add content according to: Lacking content Size Random Adding content where it is missing brings the best improvement in precision 85 8512/07/2012 SIGIR 2010, Geneva 85 SIGIR 2012 Tutorial: Query Performance Prediction
  • 86. IBM Labs in Haifa Selective query expansion Automatic query expansion: Improving average quality by adding search terms Pseudo-relevance feedback (PRF): Considers the (few) top-ranked documents to be relevant, and uses them to expand the query. PRF works well on average, but can significantly degrade retrieval performance for some queries. PRF fails because of Query Drift: Non-relevant documents infiltrate into the top results or relevant results contain aspects irrelevant of the query’s original intention. Rocchio’s method: The expanded query is obtained by combining the original query and the centroid of the top results, adding new terms to the query and reweighting the original terms. Qm = αQOriginal + β ∑D D j ∈DRe levant j −γ ∑D k Dk ∈DIrrelevant 86 8612/07/2012 SIGIR 2010, Geneva 86 SIGIR 2012 Tutorial: Query Performance Prediction
  • 87. IBM Labs in Haifa Some work on selective query expansion Expand only “easy” queries Predict which are the easy queries and only expand them (Amati et al. 2004, Yom-Tov et al. 2005) Estimate query drift (Cronen-Townsend et al., 2006) Compare the expanded list with the unexpanded list to estimate if too much noise was added by using expansion Selecting a specific query expansion form from a list of candidates (Winaver et al. 2007) Integrating various query-expansion forms by weighting them using query-performance predictors (Soskin et al., 2009) Estimate how much expansion (Lv and Zhai, 2009) Predict the best α in the Rocchio formula using features of the query, the documents, and the relationship between them But, see Azzopardi and Hauff 2009 87 8712/07/2012 SIGIR 2010, Geneva 87 SIGIR 2012 Tutorial: Query Performance Prediction
  • 88. IBM Labs in Haifa Other uses of query difficulty estimation Collaborative filtering (Bellogin & Castells, 2009): Measure the lack of ambiguity in a users’ preference. An “easy” user is one whose preferences are clear-cut, and thus contributes much to a neighbor. Term selection (Kumaran & Carvalho, 2009): Identify irrelevant query terms Queries with fewer irrelevant terms tend to have better results Reducing long queries (Cummins et al., 2011) 88 8812/07/2012 SIGIR 2010, Geneva 88 SIGIR 2012 Tutorial: Query Performance Prediction
  • 89. Summary & Conclusions IBM Haifa Research Lab © 2010 IBM Corporation
  • 90. IBM Labs in Haifa Summary In this tutorial, we surveyed the current state-of-the-art research on query difficulty estimation for IR We discussed the reasons that Cause search engines to fail for some of the queries Bring about a high variability in performance among queries as well as among systems We summarized several approaches for query performance prediction Pre-retrieval methods Post-retrieval methods Combining predictors We reviewed evaluation metrics for prediction quality and the results of various evaluation studies conducted over several TREC benchmarks. These results show that state-of-the-art existing predictors are able to identify difficult queries by demonstrating a reasonable prediction quality However, prediction quality is still moderate and should be substantially improved in order to be widely used in IR tasks 90 9012/07/2012 SIGIR 2010, Geneva 90 SIGIR 2012 Tutorial: Query Performance Prediction
  • 91. IBM Labs in Haifa Summary – Main Results Current linguistic-based predictors do not exhibit meaningful correlation with query performance This is quite surprising as intuitively poor performance can be expected for ambiguous queries In contrast, statistical pre-retrieval predictors such as SumSCQ, and maxVAR have relatively significant predictive ability These pre-retrieval predictors, and a few others, exhibit comparable performance to post-retrieval methods such as Clarity, WIG, NQC, QF over large scale Web collections This is counter-intuitive as post-retrieval methods are exposed to much more information than pre-retrieval methods However, current state-of-the-art predictors still suffer from low robustness in prediction quality This robustness problem, as well as the moderate prediction quality of existing predictors, are two of the greatest challenges in query difficulty prediction, that should be further explored in the future 91 9112/07/2012 SIGIR 2010, Geneva 91 SIGIR 2012 Tutorial: Query Performance Prediction
  • 92. IBM Labs in Haifa Summary (cont) We tested whether combining several predictors together may improve prediction quality Especially when the different predictors are independent and measure different aspects of the query and the search results An example for a combination method is using linear regression The regression task is to learn how to optimally combine the predicted performance values in order to best fit them to the actual performance values Results were moderate, probably due to the sparseness of the training data, which over-represents the lower end of the performance values We discussed three frameworks for query-performance prediction Utility Estimation Framework (UEF); Shtok et al. 2010 State-of-the-art prediction quality A unified framework for post-retrieval prediction that sets common grounds for various previously-proposed predictors; Kurland et al. 2011 A fundamental framework for estimating query difficulty; Carmel et al. 2006 92 9212/07/2012 SIGIR 2010, Geneva 92 SIGIR 2012 Tutorial: Query Performance Prediction
  • 93. IBM Labs in Haifa Summary (cont) We discussed a few applications that utilize query difficulty estimators Handling each query individually based on its estimated difficulty Find the best terms for query refinement by measuring the expected gain in performance for each candidate term Expand the query or not based on predicted performance of the expanded query Personalize the query selectively only in cases that personalization is expected bring value Collection enhancement guided by the identification of missing content queries Fusion of search results from several sources based on their predicted quality 93 9312/07/2012 SIGIR 2010, Geneva 93 SIGIR 2012 Tutorial: Query Performance Prediction
  • 94. IBM Labs in Haifa What’s Next? Predicting the performance for other query types Navigational Queries XML queries (Xquery, Xpath) Domain Specific queries (e.g .Healthcare) Considering other factors that may affect query difficulty Who is the person behind the query? in what context? Geo-spatial features Temporal aspects Personal parameters Query difficulty in other search paradigms Multifaceted search Exploratory Search 94 9412/07/2012 SIGIR 2010, Geneva 94 SIGIR 2012 Tutorial: Query Performance Prediction
  • 95. IBM Labs in Haifa Concluding Remarks Research on query difficulty estimation has begun only ten years ago with the pioneering work on the Clarity predictor (2002) Since then this subfield has found its place at the center of IR research These studies have revealed alternative prediction approaches, new evaluation methodologies, and novel applications In this tutorial we covered Existing performance prediction methods Some evaluation studies Potential applications Some anticipations on future directions in the field While the progress we see is enormous already, performance prediction is still challenging and far from being solved Much more accurate predictors are required in order to be widely adopted by IR tasks We hope that this tutorial will contribute in increasing the interest in query difficulty estimation 95 9512/07/2012 SIGIR 2010, Geneva 95 SIGIR 2012 Tutorial: Query Performance Prediction
  • 96. Thank You! IBM Haifa Research Lab © 2010 IBM Corporation