4. IIR 08: Table of contents
• 8 Evaluation in information retrieval 151
– 8.1 Information retrieval system evaluation 152
– 8.2 Standard test collections 153
– 8.3 Evaluation of unranked retrieval sets 154
– 8.4 Evaluation of ranked retrieval results 158
– 8.5 Assessing relevance 164
• 8.5.1 Critiques and justifications of the concept of relevance 166
– 8.6 A broader perspective: System quality and user utility 168
• 8.6.1 System issues 168
• 8.6.2 User utility 169
• 8.6.3 Refining a deployed system 170
– 8.7 Results snippets 170
– 8.8 References and further reading 173
5. IIR 08 KEYWORDS
• relevance, gold standard=ground truth,information need,
development test collections, TREC, precision,
recall, accuracy, F measure, precision-recall
Curve, interpolated precision, eleven-point
interpolated average precision, mean average
precision(MAP), precision at k, R-precision, break-
eleven point, ROC curve, sensitively, specificity,
cumulative gain, normalized discounted
cumulative gain(NDCG), pooling, kappa statistic,
marginal, marginal relevance, A/B testing, click
rough log analysis=clickstream mining, snipet, static,
summary<->dynamic summary, text summarization,
keyword-in-context(KWIC),
7. 明確な測定指標
• How fast does it index
– Number of documents/hour
– (Average document size)
• How fast does it search
– Latency as a function of index size
• Expressiveness of query language
– Ability to express complex information needs
– Speed on complex queries
• Uncluttered UI
• Is it free?
評価法としては簡単
7
9. Happiness: elusive to measure
• Most common proxy: relevance of search
results
– But how do you measure relevance?
• We will detail a methodology here, then
examine its issues
• Relevant measurement requires 3 elements:
1. A benchmark document collection
2. A benchmark suite of queries
3. A usually binary assessment of either Relevant or
Nonrelevant for each query and each document
• Some work on more-than-binary, but not the standard
9
10. Evaluating an IR system
• Note: the information need is translated into a query
• Relevance is assessed relative to the information need
not the query
– E.g.,
• Information need: I'm looking for information on whether drinking
red wine is more effective at reducing your risk of heart attacks
than white wine.
• Query: wine red white heart attack effective
query⊂information need
• ∴ 人力による適合性判定データが必要
10
23. Evaluation
• Graphs are good, but people want summary measures!
– Precision at fixed retrieval level
• Precision-at-k: Precision of top k results
• Perhaps appropriate for most of web search: all people
want are good matches on the first one or two results
pages
• But: averages badly and has an arbitrary parameter of k
– 11-point interpolated average precision
• The standard measure in the early TREC competitions: you
take the precision at 11 levels of recall varying from 0 to 1
by tenths of the documents, using interpolation (the value
for 0 is always interpolated!), and average them
• Evaluates performance at all recall levels
23
36. Can we avoid human judgment?
• No
• Makes experimental work hard
– Especially on a large scale
• In some very specific settings, can use proxies
– E.g.: for approximate vector space retrieval, we can
compare the cosine distance closeness of the closest
docs to those found by an approximate retrieval
algorithm
• But once we have test collections, we can reuse
them (so long as we don’t overtrain too badly)
36
37. Fine.
• See also
– 酒井哲也(東芝),”よりよい検索システム実
現のために:正解の良し悪しを考慮した情報
検索評価動向”,IPSJ Magazine,Vol.47, No.2,
Feb.,2006
• http://voice.fresheye.com/sakai/IPSJ-MGN470211.pdf
37