SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Evaluation in Information
               Retrieval


      (Book chapter from C.D. Manning, P. Raghavan, and H. Schutze. 
                Introduction to information retrieval)



                            Dishant Ailawadi
    INF384H / CS395T: Concepts of Information Retrieval (and Web Search) Fall11




                                         
Outline

● Why Evaluation?
● Standard test collections.

● Precision and Recall

● Mean Average Precision

● Kappa Statistic

● R­Precision

● Summary




                           
Why Evaluation?


●
  There are many retrieval models/ algorithms/ systems, 
which one is the best?
●
  Measure effect of adding new features.
●
  How far down the ranked list will a user need to look to find 
some/all relevant documents?
●
  Difficulties : Relevance, it is not binary but continuous. How 
to say if a document is relevant?



                                  
Standard Test Collections
 A standard test collection consists of three things:
1. A document collection.
2. A set of queries on this collection
3. A set of relevance judgments on those queries.

If a document in test collection is given a binary classification.  
This decision is referred to as the gold standard or ground 
truth judgment of relevance.  




                                  
Standard Test Collections

    ●    Cranfield: 1950s in UK. Too small to be used nowadays.
     TREC (text retrieval conference)
    ●


           ●   Early TREC had 50 Information needs, TREC 6­8 provide 150 
                 information needs over more than 500 thousand articles.
           ●   Recent work on 25 million pages of GOV2 is now available for 
                 research.
     NTCIR East­Asian Language and Cross Language IR Systems
    ●



     Cross Language Evaluation Forum (CLEF)
    ●



     Reuters­21578 collection most used for text classification.
    ●



                                           
Evaluation Measures
         Retrieved    True positives (tp)    False positives (fp)

     Not Retrieved    False negatives (fn)   True negatives (tn)
                       Relevant               Non Relevant


               Number  of  relevant  documents retrieved            = tp/(tp + fn)
    recall  = 
                Total  number  of  relevant  documents


                 Number  of  relevant documents  retrieved
    precision =                                                       = tp/(tp + fp)
                  Total number of  documents  retrieved



 
    (How many correct selections?) Accuracy = (tp + tn)/(tp + fp + fn + tn)
                                     
An Example
    n doc # relevant
                       Let total # of relevant docs = 6
    1 588       x
                       Check each new recall point:
    2 589       x
    3 576
                       R=1/6=0.167;     P=1/1=1
    4 590       x
    5 986
                       R=2/6=0.333;     P=2/2=1
    6 592       x
    7 984              R=3/6=0.5;     P=3/4=0.75
    8 988
    9 578              R=4/6=0.667; P=4/6=0.667
    10 985
                                                    Missing one 
    11 103                                          relevant document.
    12 591                                          Never reach 
    13 772      x      R=5/6=0.833;     p=5/13=0.38 100% recall
    14 990
                                                              7

                                 
Combining Precision & Recall
F­Measure: Weighted HM of precision and recall.




Value of β controls trade­off:
●β = 1: Equally weight precision and recall.


●β > 1: Weight recall more.


●
 β < 1: Weight precision more.
                     2 PR    2
                  F=      = 1 1
                     P + R R+P

                                   
Precision-Recall curve




Interpolated Precision: To get smooth curve.

                                  
11-point Interpolated Average Precision

Recall   Interp.
          Precision
   0.0      1.00
   0.1      0.67
   0.2      0.63
   0.3      0.55
   0.4      0.45
   0.5      0.41
   0.6      0.36
   0.7      0.29
   0.8      0.13
   0.9      0.10
   1.0      0.08

                         
Single Figure Measures

Mean Average Precision (MAP): Average Precision over all 
queries.
Example: Average Precision: (1 + 1 + 0.75 + 0.667 + 0.38 + 
0)/6 = 0.633



Normalized Distributed Cumulative Gain (NDCG): For non­
binary notions. 



                              
Assesing Relevance
 Pooling: To obtain a subset of collection related to query
●

    – Use a set of search engines/algorithms
    – The top­k results (k is between 20 to 50 in TREC) are
      merged into a pool, duplicates are removed
    – Present the documents in a random order to analysts for
      relevance judgments


 Kappa Statistic:
●

     If we have multiple judges on one information need, how consistent are 
      those judges?
  kappa = (P(A) – P(E)) / (1 – P(E))
   – P(A) is the proportion of the times that the judges
     agreed
   – P(E) is the proportion of the times they would be
                                         
    expected to agree by chance
Example: Kappa Statistic
                           Judge 2 Relevance
                            Yes      No  Total
Judge 1      Yes     300     20    320
Relevance   No      10      70     80
                 Total   310     90    400
Observed proportion of the times the judges agreed :


Pooled marginals: 


Probability that two judges agreed by chance (Max Value=1, Min =0.5): 


Kappa statistic: 


Kappa Value between 0.67 and 0.8 is fair agreement but below 0.67 is 
                                       
seen as data providing a dubious basis for evaluation.
Evaluation
                                                  n doc # relevant
R­PRECISION :                                      1 588      x
                     R = # of relevant docs = 7    2 589      x
                                                   3 576
                      R­Precision = 4/7 = 0.571    4 590      x
                                                   5 986
                                                   6 592      x
                                                   7 984
                                                   8 988
A/B Test : Precisely one change between            9 578
                                                  10 985
 current and previous system. We evaluate the     11 103
Affect of that change on system.                  12 591
                                                  13 772      x
                                                  14 990




                               
Summary
● F­Measure: To combine Precision and recall. 
● Recall­precision graph – conveying more information than


 a single number measure.
● Mean average precision – single number value, popular 


measure.
● Normalized Discounted Cumulative Gain (NDCG) – single 


number summary for each rank level emphasizing top ranked 
documents, relevance judgments only needed to a specific rank 
depth (e.g., 10)
● Kappa Measure: Judgement reliability

● R­Precision: Only need to examine top rel documents. 




                                 
THANK YOU!




         

Contenu connexe

Similaire à Presentation

Statistics
StatisticsStatistics
Statisticsmegamsma
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR modelsNisha Arankandath
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"nozyh
 
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.pptDECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.pptglorypreciousj
 
2 Machine Learning General.pdf
2 Machine Learning General.pdf2 Machine Learning General.pdf
2 Machine Learning General.pdfadityamcse
 
S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...CAChemE
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 InternshipTaylor Martell
 
Lecture 7
Lecture 7Lecture 7
Lecture 7butest
 
Lecture 7
Lecture 7Lecture 7
Lecture 7butest
 
GC-S005-DataAnalysis
GC-S005-DataAnalysisGC-S005-DataAnalysis
GC-S005-DataAnalysishenry kang
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsLeanleaders.org
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsLeanleaders.org
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGrubhubTech
 

Similaire à Presentation (20)

Statistics chm 235
Statistics chm 235Statistics chm 235
Statistics chm 235
 
Statistics
StatisticsStatistics
Statistics
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.pptDECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
 
2 Machine Learning General.pdf
2 Machine Learning General.pdf2 Machine Learning General.pdf
2 Machine Learning General.pdf
 
S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
T test statistics
T test statisticsT test statistics
T test statistics
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
GC-S005-DataAnalysis
GC-S005-DataAnalysisGC-S005-DataAnalysis
GC-S005-DataAnalysis
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
 

Dernier

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Dernier (20)

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Presentation