SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
The Philosophy of Information
 Retrieval Evaluation (2001)

       by Ellen Voorhees
The Author

• Computer scientist, Retrieval Group,
  NIST (15 years)
    o   TREC, TRECVid , and TAC - large-scale evaluation of
        technologies for processing natural language text and
        searching diverse media types
•   Research focus: "developing and validating
    appropriate evaluation schemes to measure system
    effectiveness in these areas"

• Siemens Corporate Research (9 years)
    o   factory automation, intelligence agents, agents
        applied to information access




                  http://www.linkedin.com/pub/ellen-voorhees/6/115/3b8
NIST (National Institute of Standards and
Technology)
• Non-regulatory agency of U.S. Dept of Commerce

• "Promote U.S. innovation and industrial competitiveness [...]
  enhance economic security and improve our quality of life"

• Estimated 2011 budget: $722 million

• Standards Reference Materials (experimental control samples,
  quality control benchmarks), election technology, ID cards

• 3 Nobel Prize Winners




          http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology
Premises

• User-based evaluation (p.1)

  o   better, more direct measure of user needs
  o   BUT very expensive and difficult to execute properly

• System evaluation (p.1)

  o   less expensive
  o   abstraction of retrieval process
  o   can control variables
         increases power of comparative experiments
  o   diagnostic information about system behavior
The Cranfield Paradigm
• Dominant model for 4 decades (p.1)

• Cranfield 2 experiment (1960s) - first lab testing of IR system
  (p.2)

   o   investigated which indexing languages is best
   o   design: considering the performance of index languages
       free from operational variable contamination
   o   aeronautics experts, aeronautics collection
   o   test collection: documents, information needs/topics,
       relevance judgment set
   o   assumptions:
          relevance approximated by topical similarity
          single judgment set representative of user population
          lists of relevant documents for each topic complete
Modern Adaptations to Cranfield
Paradigm not true, need to decrease noise (p.3)
• Assumptions
   o   modern collections larger and more diverse
   o   less complete relevance judgments

• Adaptations:
   o Ranked list of documents for each topic
        ordered by decreasing retrieval likelihood
   o Effectiveness as a whole computed as average across
     topics
   o Large number of topics
   o Use pooling (subsets of documents) instead (p.4)
   o Assumptions don't need to be strictly true for test
     collection to be viable
        different retrieval run scores compared on same test
        collections
How to Build a Test Collection
(TREC example)
• Set of documents and topics (reflective of operational setting
  and real tasks) (p.4)
   o e.g. law articles for law library

• Participants run topics against documents
   o return top documents per topic

• Pool formed, then judged by relevance assessors
   o evaluated using relevance judgments (binary)

• Results returned to participant

• Relevance judgments turn documents and topics into test
  collection (p.5)
Effects of Pooling and Incomplete Judgments
• Pooling doesn't produce complete judgments (p.5)
   o Some relevant documents not judged
   o If added later, from lower in system rankings

• Skewed across topics (p.6)
   o if have many relevant documents initially and later on

• What to do?
  o deep and diverse pool (p.9)
  o recall-oriented manual runs to supplement
  o opt for smaller, fair judgment set rather than larger biased
    set
Assessor Relevance Judgments

• Different judges, different time settings (p.9)

• Different assessor makes different relevance sets for same
  topics (subjectivity of relevance)

• TREC: 3 judges (p.10)

• Overlap < 50%, assessors really disagreed
Evaluating with Assessor Inconsistency
• Perform system ranking, sorting by value obtained by each
  system (p.10)

• Query-Relevance Set: different combinations of assessor
  judgments per topic

• Repeat experiments several times: (p.13)
  o different measures
  o different topic sets
  o different systems
  o different assessor groups

• Comparative evaluation result: stability of ranked retrieval
  results
Cross-Language Collections

• More difficult to build than monolingual collections (p.13)
  o separate set of assessors for each language
  o multiple assessors for 1 topic
  o need diverse pools for all languages
      minority language pools smaller and less diverse (p.14)

• What to do?
  o close coordination for consistency (p.13)
  o proceed with care
Discussion

• Do laboratory experiments translate to operational settings?

• Which metrics or evaluation scores are more meaningful to
  you?

• Are there other ways to reduce noise and error?

Contenu connexe

En vedette

Pojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační věděPojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační vědě
Jiří Stodola
 

En vedette (6)

Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...
 
Pojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační věděPojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační vědě
 
The Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical RevolutionsThe Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical Revolutions
 
Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007 Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007
 
Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...
 

Similaire à Philosophy of IR Evaluation Ellen Voorhees

Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluated
GESIS
 
Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)
Abdul Gaffar
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
PyData
 
Advanced topics research
Advanced topics researchAdvanced topics research
Advanced topics research
kieran122
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
audeleypearl
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
roushhsiu
 

Similaire à Philosophy of IR Evaluation Ellen Voorhees (20)

Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluated
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
 
qury.pdf
qury.pdfqury.pdf
qury.pdf
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptx
 
Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
Cue Forum2008
Cue Forum2008Cue Forum2008
Cue Forum2008
 
Systematic literature review technique.pptx
Systematic literature review technique.pptxSystematic literature review technique.pptx
Systematic literature review technique.pptx
 
Advanced topics research
Advanced topics researchAdvanced topics research
Advanced topics research
 
Systematic Literature Review
Systematic Literature ReviewSystematic Literature Review
Systematic Literature Review
 
Proposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender FrameworkProposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender Framework
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Introduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodIntroduction to Systematic Literature Review method
Introduction to Systematic Literature Review method
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Philosophy of IR Evaluation Ellen Voorhees

  • 1. The Philosophy of Information Retrieval Evaluation (2001) by Ellen Voorhees
  • 2. The Author • Computer scientist, Retrieval Group, NIST (15 years) o TREC, TRECVid , and TAC - large-scale evaluation of technologies for processing natural language text and searching diverse media types • Research focus: "developing and validating appropriate evaluation schemes to measure system effectiveness in these areas" • Siemens Corporate Research (9 years) o factory automation, intelligence agents, agents applied to information access http://www.linkedin.com/pub/ellen-voorhees/6/115/3b8
  • 3. NIST (National Institute of Standards and Technology) • Non-regulatory agency of U.S. Dept of Commerce • "Promote U.S. innovation and industrial competitiveness [...] enhance economic security and improve our quality of life" • Estimated 2011 budget: $722 million • Standards Reference Materials (experimental control samples, quality control benchmarks), election technology, ID cards • 3 Nobel Prize Winners http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology
  • 4. Premises • User-based evaluation (p.1) o better, more direct measure of user needs o BUT very expensive and difficult to execute properly • System evaluation (p.1) o less expensive o abstraction of retrieval process o can control variables increases power of comparative experiments o diagnostic information about system behavior
  • 5. The Cranfield Paradigm • Dominant model for 4 decades (p.1) • Cranfield 2 experiment (1960s) - first lab testing of IR system (p.2) o investigated which indexing languages is best o design: considering the performance of index languages free from operational variable contamination o aeronautics experts, aeronautics collection o test collection: documents, information needs/topics, relevance judgment set o assumptions: relevance approximated by topical similarity single judgment set representative of user population lists of relevant documents for each topic complete
  • 6. Modern Adaptations to Cranfield Paradigm not true, need to decrease noise (p.3) • Assumptions o modern collections larger and more diverse o less complete relevance judgments • Adaptations: o Ranked list of documents for each topic ordered by decreasing retrieval likelihood o Effectiveness as a whole computed as average across topics o Large number of topics o Use pooling (subsets of documents) instead (p.4) o Assumptions don't need to be strictly true for test collection to be viable different retrieval run scores compared on same test collections
  • 7. How to Build a Test Collection (TREC example) • Set of documents and topics (reflective of operational setting and real tasks) (p.4) o e.g. law articles for law library • Participants run topics against documents o return top documents per topic • Pool formed, then judged by relevance assessors o evaluated using relevance judgments (binary) • Results returned to participant • Relevance judgments turn documents and topics into test collection (p.5)
  • 8. Effects of Pooling and Incomplete Judgments • Pooling doesn't produce complete judgments (p.5) o Some relevant documents not judged o If added later, from lower in system rankings • Skewed across topics (p.6) o if have many relevant documents initially and later on • What to do? o deep and diverse pool (p.9) o recall-oriented manual runs to supplement o opt for smaller, fair judgment set rather than larger biased set
  • 9. Assessor Relevance Judgments • Different judges, different time settings (p.9) • Different assessor makes different relevance sets for same topics (subjectivity of relevance) • TREC: 3 judges (p.10) • Overlap < 50%, assessors really disagreed
  • 10. Evaluating with Assessor Inconsistency • Perform system ranking, sorting by value obtained by each system (p.10) • Query-Relevance Set: different combinations of assessor judgments per topic • Repeat experiments several times: (p.13) o different measures o different topic sets o different systems o different assessor groups • Comparative evaluation result: stability of ranked retrieval results
  • 11. Cross-Language Collections • More difficult to build than monolingual collections (p.13) o separate set of assessors for each language o multiple assessors for 1 topic o need diverse pools for all languages minority language pools smaller and less diverse (p.14) • What to do? o close coordination for consistency (p.13) o proceed with care
  • 12. Discussion • Do laboratory experiments translate to operational settings? • Which metrics or evaluation scores are more meaningful to you? • Are there other ways to reduce noise and error?