This document discusses evaluation in information retrieval. It describes standard test collections which consist of a document collection, queries on the collection, and relevance judgments. It also discusses various evaluation measures used in information retrieval like precision, recall, F-measure, mean average precision, and kappa statistic which measure reliability of relevance judgments. R-precision and normalized discounted cumulative gain are also summarized as important single number evaluation measures.
1. Evaluation in Information
Retrieval
(Book chapter from C.D. Manning, P. Raghavan, and H. Schutze.
Introduction to information retrieval)
Dishant Ailawadi
INF384H / CS395T: Concepts of Information Retrieval (and Web Search) Fall11
5. Standard Test Collections
● Cranfield: 1950s in UK. Too small to be used nowadays.
TREC (text retrieval conference)
●
● Early TREC had 50 Information needs, TREC 68 provide 150
information needs over more than 500 thousand articles.
● Recent work on 25 million pages of GOV2 is now available for
research.
NTCIR EastAsian Language and Cross Language IR Systems
●
Cross Language Evaluation Forum (CLEF)
●
Reuters21578 collection most used for text classification.
●
6. Evaluation Measures
Retrieved True positives (tp) False positives (fp)
Not Retrieved False negatives (fn) True negatives (tn)
Relevant Non Relevant
Number of relevant documents retrieved = tp/(tp + fn)
recall =
Total number of relevant documents
Number of relevant documents retrieved
precision = = tp/(tp + fp)
Total number of documents retrieved
(How many correct selections?) Accuracy = (tp + tn)/(tp + fp + fn + tn)
7. An Example
n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 589 x
3 576
R=1/6=0.167; P=1/1=1
4 590 x
5 986
R=2/6=0.333; P=2/2=1
6 592 x
7 984 R=3/6=0.5; P=3/4=0.75
8 988
9 578 R=4/6=0.667; P=4/6=0.667
10 985
Missing one
11 103 relevant document.
12 591 Never reach
13 772 x R=5/6=0.833; p=5/13=0.38 100% recall
14 990
7
8. Combining Precision & Recall
FMeasure: Weighted HM of precision and recall.
Value of β controls tradeoff:
●β = 1: Equally weight precision and recall.
●β > 1: Weight recall more.
●
β < 1: Weight precision more.
2 PR 2
F= = 1 1
P + R R+P
12. Assesing Relevance
Pooling: To obtain a subset of collection related to query
●
– Use a set of search engines/algorithms
– The topk results (k is between 20 to 50 in TREC) are
merged into a pool, duplicates are removed
– Present the documents in a random order to analysts for
relevance judgments
Kappa Statistic:
●
If we have multiple judges on one information need, how consistent are
those judges?
kappa = (P(A) – P(E)) / (1 – P(E))
– P(A) is the proportion of the times that the judges
agreed
– P(E) is the proportion of the times they would be
expected to agree by chance
14. Evaluation
n doc # relevant
RPRECISION : 1 588 x
R = # of relevant docs = 7 2 589 x
3 576
RPrecision = 4/7 = 0.571 4 590 x
5 986
6 592 x
7 984
8 988
A/B Test : Precisely one change between 9 578
10 985
current and previous system. We evaluate the 11 103
Affect of that change on system. 12 591
13 772 x
14 990