Iir 08 ver.1.0

Introduction to Information Retrieval

Chapter 8:
Evaluation in IR

引用元（参照先）
• IIRのサイト
– http://www-csli.stanford.edu/~hinrich/information-retrieval-
book.html
• 本と同等の内容を公開
• Stanford CS276でのSlideを公開

• はてなおやさんの説明スライド
– http://bloghackers.net/~naoya/iir/ppt/

• Y!J Labs たつをさんによる補足情報
– http://chalow.net/clsearch.cgi?cat=IIR

• 基本的にこれらの資料を切り貼り、さらに私の知識と分析を
追加して資料を作成しました

IIR重要部分

• 情報推薦にとっては
– ６、７、９、１８、19章あたりが重要と考え
ます
• 6章 scoring(理論より)
• 7章 scoring(実装より)
• 8章評価手法
• 9章 relevance feedback
• 18章 Scaleする実装
– Matrix decompositions, LSI, 特異値分解など
• 19章 PageRank, HITSなど

IIR 08: Table of contents
• 8 Evaluation in information retrieval 151
– 8.1 Information retrieval system evaluation 152
– 8.2 Standard test collections 153
– 8.3 Evaluation of unranked retrieval sets 154
– 8.4 Evaluation of ranked retrieval results 158
– 8.5 Assessing relevance 164
• 8.5.1 Critiques and justifications of the concept of relevance 166
– 8.6 A broader perspective: System quality and user utility 168
• 8.6.1 System issues 168
• 8.6.2 User utility 169
• 8.6.3 Refining a deployed system 170
– 8.7 Results snippets 170
– 8.8 References and further reading 173

IIR 08 KEYWORDS
• relevance, gold standard=ground truth,information need,
development test collections, TREC, precision,
recall, accuracy, F measure, precision-recall
Curve, interpolated precision, eleven-point
interpolated average precision, mean average
precision(MAP), precision at k, R-precision, break-
eleven point, ROC curve, sensitively, specificity,
cumulative gain, normalized discounted
cumulative gain(NDCG), pooling, kappa statistic,
marginal, marginal relevance, A/B testing, click
rough log analysis=clickstream mining, snipet, static,
summary<->dynamic summary, text summarization,
keyword-in-context(KWIC),

明確な測定指標
• How fast does it index
– Number of documents/hour
– (Average document size)
• How fast does it search
– Latency as a function of index size
• Expressiveness of query language
– Ability to express complex information needs
– Speed on complex queries
• Uncluttered UI
• Is it free?
評価法としては簡単

7

明確でない測定指標
• ユーザ満足度（user happiness）の定量的解析が必要
– ユーザ満足度とは?
– 応答スピードやインデックスサイズも要因
– しかし、不要なanswersはユーザをハッピーにはできないことは明白

• 我々がハッピーにしたいユーザとは誰なのか？
– Depends on the setting
• Web engine: ユーザが欲しいものをクリックなどのフィードバックで取得
• eCommerce site: ユーザが欲しいものを購入
– 満足度を測るのはエンドユーザか、eコマースサイトか？
– 購入までの時間、購入した人の特徴
• Enterprise (company/govt/academic): ユーザの生産性が大事
– 時間のsave 情報を探すための時間
– 情報の幅広さ（検索対象が幅広い、検索結果が固定ではない）、安全なアクセスなど

どう評価したら
良いのかが難しい
8

Happiness: elusive to measure
• Most common proxy: relevance of search
results
– But how do you measure relevance?
• We will detail a methodology here, then
examine its issues
• Relevant measurement requires 3 elements:
1. A benchmark document collection
2. A benchmark suite of queries
3. A usually binary assessment of either Relevant or
Nonrelevant for each query and each document
• Some work on more-than-binary, but not the standard
9

Evaluating an IR system
• Note: the information need is translated into a query
• Relevance is assessed relative to the information need
not the query
– E.g.,
• Information need: I'm looking for information on whether drinking
red wine is more effective at reducing your risk of heart attacks
than white wine.
• Query: wine red white heart attack effective

query⊂information need

• ∴ 人力による適合性判定データが必要

10

標準的なテストコレクション

Cranfield パイオニア。現在では小さすぎる
TREC NIST による Text Retrieval Conference で使われた
もの。450 の情報ニーズ、189万文書
GOV2 NIST による。現在研究目的に利用できる最大の
Web コレクション。2,500万ページ
NTCIR Asia版のTREC. 東アジア言語 / クロス言語にフォー
カス。 TREC と同規模。(marginal評価データあり)
CLEF ヨーロッパ言語と言語横断情報検索に集中
Reuters Reuter-21578, Reuter-RCV1。テキスト分類のため
に最もよく使われてきた。RCV1 は 806,791文書
20 Newsgroups Usenet の 20 グループの記事。テキスト分類で広く
利用される。18,941 記事。

※ 最近は Wikipedia のアーカイブも良く利用されるとか。他にMovieLensやNetflixなど

IIR-08 サマリ
• ランク付けなしの検索結果の評価
– positive / negative, true / false
– Precision と Recall
– P と R のトレードオフ指標 → F値

• ランク付けされた検索結果の評価
– Presicion - Recall 曲線
• 補完適合率
• 曲線を調べる統計手法 ... 11 point interpolated average precision
– → より良い統計指標に MAP
– MAP では判断しづらい物 (Web検索 etc) → Precision-top K → R-Precision
– ほか
• ROC曲線
• NDCG

• 情報ニーズに対する適合性の評価
– kappa statistic

ランク付けなしの
検索結果の評価

（ランク付けなしとは
絶対的な0/1推定）

positive/negative -> true/false
• 推定内容
陽性: positive (p)
陰性: negative (n)
• 推定内容の正確さ
正解: true (t)
不正解: false (f)

relevant retrieved

tp
fn fp

tn

Precision and Recall
relevant retrieved

tp
fn fp

tn 欠点：
全ドキュメント
Precision Recall をretrievedとすれ
ば１にできてし
= tp/(tp+fp) =tp/(tp+fn) まう

(= tp/p) 検索もれの少なさ
ゴミの少なさ
Ex.8.1
PecisionとRecallはtrade-off

Accuracy and Jaccard Index
relevant retrieved

tp
fn fp

tn

Accuracy
non-relevantの割合が99.9%だと全て
=(tp+tn)/(tp+fp+fn+tn) をnegativeと推定すればAccuracyが高
くなってしまう
(=t/(t+f))
Jaccard index
例: 試験者の中で0.1%が癌でも
=tp/(tp+fp+fn) それぞれのみんな癌でないと判定すれば
99.9%の正解率
利点・欠点

全ドキュメントを
F-measure retrievedとすれば
0.5にできてしま
う

• P と R の加重調和平均(加重平均だと良くな
い)

• β=1(α=0.5)の時のFを代表的なF-measureで
あるF1と呼ぶ

Ex.8.2, Ex.8.3
Ex.8.7

F1 and other averages

Combined Measures

100

80 Minimum
Maximum
60
Arithmetic
Geometric
40
Harmonic
20

0
0 20 40 60 80 100
Precision (Recall fixed at 70%)

19

ランク付けありの
検索結果の評価

（ランク付けありとは
相対的なオーダー）

ランクありの検索結果

• Precision, Recall, F値は集合ベースの方法 →
ランクありでは拡張する必要あり

• 検索された文書の集合 = top K の検索結果

A precision-recall curve と
Interpolated Precision
1.0

Interpolated Precision (Pinterp)
0.8
Precision

0.6

0.4
Trueなら右上、
0.2 Falseなら下に向かってい
る
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Ex.8.4
22

Evaluation
• Graphs are good, but people want summary measures!
– Precision at fixed retrieval level
• Precision-at-k: Precision of top k results
• Perhaps appropriate for most of web search: all people
want are good matches on the first one or two results
pages
• But: averages badly and has an arbitrary parameter of k
– 11-point interpolated average precision
• The standard measure in the early TREC competitions: you
take the precision at 11 levels of recall varying from 0 to 1
by tenths of the documents, using interpolation (the value
for 0 is always interpolated!), and average them
• Evaluates performance at all recall levels
23

11point interpolated average precision
Recall=0の点は暴れや
すい

単調減少かそれに
近い方が良い

※ グラフを見て妙な特異点がないかなどを調査する

Ex.8.5, Ex.8.6

MAP
Mean Average Precision

• Q 情報ニーズの集合
• mj 情報ニーズjの適合文書の数
• Rjk 情報ニーズjのtopから文書kまでのretrieved
集合
• MAPの特徴
• Interpolatedではない
• Recallではなく，適合文書ごとの値の和 Recall軸が基準
• 複数の情報ニーズにおける平均値

Precision at K / R-Precision
Ex.8.8,EX.8.9
(1点で)評価
• MAPのように全retrievedを見る必要があるの？
• Web 検索では top 10 から 30 の precision が重要
– 平均ではなく適切な１つでいいのでは？→ precision at K, R-
Precision
• Precision at K ユーザの労力
が基準
– 上位K個のretrieved集合のPrecision

• でもKって何が適切な数なの？情報セットごとに違うんじゃない
の？
• K= |Rel| (Rel: set of relevant document)としたPrecision at KがR-Precision (K
はRecallを1にできる可能性のある最小値)
• 答えは５つある、これはと思う５つを選んでみよ、という感じ
Recall軸が基準
• この値においてPrecision = Recallとなる
• R-Precisionは１点での評価だがMAPとかなり相関がある
|Rel|が分からない場
ご参考： TRECなどではMAPとR-precision（Non-
合はできない
Interpolated）が使われている

ROC曲線
• Precision / Recall曲線は全体に対するrelevant documentの割合で
形が多く違う（違う情報ニーズ間の比較はできない）
• 縦軸を recall、横軸を false-positive 率 ( fp / (fp + tn) ) ... quot;見えたゴ
ミ率quot;
• ゴミが見えるのをどの程度許容できたら recall が上がるか
• Top kを見るには不適、全体を見るには適する
Retrievedした
relevant document
の割合

このグラフ上で
precisionはどの
ように評点され
るか
Retrievedした
Non-relevant document
の割合

NDCG
(Normalized Discounted Cumulative Gain)

• quot;marginalquot; な適合性判断を加味した指標
• 機械学習による適合性判断をした場合などに使われ
る
• パラメータ設定が大事ユーザの労力
Logの底はどのよ
が基準
うに設定するか
– kとlogの底
私はこの重み付けを考えをROC Curveに適用し、
MovieLensによって評価したことがある

NDCGといっても一意の方ご参考： MSN Search EngineはNDCGの一種を使っている
式ではないと言われている

情報ニーズに対する
適合性の評価

適合性の評価

• そもそも適合とは
• 主観的な判断
• さらにユーザは同じ状況でも全く同じ選
択をするとは限らない（ゆらぐ）
• そのテストデータが真に適合かどうか →
統計指標でその品質を算出

Kappa measure for inter-judge
(dis)agreement

• Kappa measure
– 判断の一致率
– カテゴリの判断のために設計された
– “偶然の一致” を補正する統計指標
• Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
– P(A) – proportion of time judges agree
– P(E) – what agreement would be by chance
• Kappa = 0 偶然の一致
• Kappa = 1 完全に一致

32

Kappa Measure: Example (from lecture08...ppt)

Number of Judge 1 Judge 2
docs
300 Relevant Relevant

70 Nonrelevant Nonrelevant

20 Relevant Nonrelevant

10 Nonrelevant relevant

Kappa Example

• P(A) = 370/400 = 0.925
• P(nonrelevant) = (10+20+70+70)/800 = 0.2125
• P(relevant) = (10+20+300+300)/800 = 0.7875
• P(E) = 0.2125^2 + 0.7875^2 = 0.665
• Kappa = (0.925 – 0.665)/(1-0.665) = 0.776

• Kappa > 0.8 = good agreement
• 0.67 < Kappa < 0.8 -> “tentative conclusions” (Carletta ’96)
• Depends on purpose of study
• For >2 judges: average pairwise kappas Ex.8.10
34

8章その他の話題 (読み物的)

• 検索結果のフォーマルな指標以外に、ユーザーが快
適度を判断する軸
– 検索スピード、ユーザビリティ、etc
– 二値判断でなく quot;marginalquot; な判断をどう加味するか

• 定量的な評価 vs 人間の主観による評価
– A/B testing
• ユーザ分け

• Snnipets
– 静的 / 動的

Can we avoid human judgment?
• No
• Makes experimental work hard
– Especially on a large scale
• In some very specific settings, can use proxies
– E.g.: for approximate vector space retrieval, we can
compare the cosine distance closeness of the closest
docs to those found by an approximate retrieval
algorithm
• But once we have test collections, we can reuse
them (so long as we don’t overtrain too badly)

36

Fine.
• See also
– 酒井哲也（東芝），”よりよい検索システム実
現のために：正解の良し悪しを考慮した情報
検索評価動向”，IPSJ Magazine，Vol.47, No.2,
Feb.,2006
• http://voice.fresheye.com/sakai/IPSJ-MGN470211.pdf

37

Iir 08 ver.1.0

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Iir 08 ver.1.0

Similaire à Iir 08 ver.1.0 (20)

Dernier

Dernier (20)

Iir 08 ver.1.0