SlideShare a Scribd company logo
1 of 17
Download to read offline
Information Retrieval

   James Melzer

   June 15, 2006




                        1
How Does Search Work?




                        2
The basics of search

• A search engine mediates between user’s query and metadata surrogates for
  documents


• Documents are reduced to metadata


• User’s need is translated into a query


• Query terms are used to find matching metadata terms


• Lots and lots of room for error...




                                                                              3
The search process

1. Crawl content for metadata


2. Index document terms into an inverted file;
   an inverted file is very fast to search


3. Search the index to identify the result set;
   search the index - not the documents


4. Rank the results for display;
   ranking is the hardest part




                                                  4
Search algorithm 1

Term-based Ranking (tf/idf)


• tf = term frequency
  documents that use the query terms most are presumed to be most relevant


• idf = inverse document frequency
  terms that are more rare are better indicators of relevance


• Assumptions
  1) relevance can be measured with document terms




                                                                             5
Search algorithm 2

PageRank (Google)


• Relevant set is still identified by term matching


• A revolution in ranking:
  based on linking between documents


• Assumptions:
  1) important sites link to other important sites
  2) if many people link to a site, it is important




                                                      6
Citation Analysis

• Authors carefully select articles to cite


• The more citations an article gets,
  the better it must be


• Citations by authors who have a lot of citations confers their power to those
  they cite


• Aggregate and leverage all these small individual decisions...




                                                                                  7
How Complex is
Google?
    Google has about
    36 ranking algorithms

    Examples:

    Citation Analysis

    Statistical Clustering

    Parsing Document Structure

    Parsing Data in the Document

    Microcontent Parsing

8
How to Make Search Better?




                             9
Evaluating Search

Recall


the percentage of all relevant documents retrieved


100% recall means every relevant document is retrieved


Precision


the percentage of documents retrieved that are relevant


100% precision means only relevant documents are retrieved



                                                             10
Thoughts & Reservations about Evaluating Search

• Precision and Recall are usually inversely proportional, so improving one often
  reduces the other.


• Given a corpus of content like the web (tens of billions of items)...
  Recall is unmeasurable, and thus essentially meaningless


• What is relevance?


• Measuring Precision depends on an agreed definition of relevance, which is
  tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard
  to quantify)
Zipf
Best Bets

• Manually selected results, tied to specific query terms or phrases


• User-driven phrases
  select the most-used phrases from search traffic;
  go for easy wins, because returns diminish sharply


• Business-driven phrases
  select phrases important to the business;
  such as product names or office locations;
  or politically sensitive phrases, so you can control the message people see




                                                                                12
Relevance Feedback

• The user provides direct or indirect feedback on the search results


• Click tracking


• “More like this” or “Find similar”


• Clustering




                                                                        13
Structured Search

• Designers use patterns in search behavior to guess user’s intent;
  this requires a substantial understanding of user behavior;
  it may require structured content (although, not necessarily)


Examples

• Zip Code -> Zip Code Lookup Tool

• Person’s name -> Directory Listing

• Product Name -> Shop or Support?

• Address -> Map this?

• Topic -> Introduction, Forms, Policies or Reports?


                                                                      14
Controlled Vocabularies

• Classification with a controlled vocabulary is the best way to ensure 100%
  Recall


• Lead-in synonyms
  enter “fridge”; get “refrigerator” instead;
  best if the collection is well-cataloged
  increases precision (e.g. in a library)


• Term-expansion synonyms;
  enter “refrigerator”; get “fridge” too;
  best if the collection is not well-cataloged
  increases recall at the cost of precision (e.g on eBay)


• Spell check on query phrases

                                                                              15
Why is search
important?

IF:
About half of all users prefer to
search first*


THEN:
What percentage of a content
site’s development effort should
be devoted to search?




* This statistic is highly context-dependent. People’s
behavior depends on the context of their actions.
The stat is from Jared Spool.

16
Questions?
James Melzer
Information Architect
SRA International
james_melzer@sra.com




                        17

More Related Content

Similar to Information Retrieval (for beginners)

Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise SearchFindwise
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for FindabilityFindwise
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalMarianne Sweeny
 
What IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each OtherWhat IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each OtherIan Lurie
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Marianne Sweeny
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Optimising Your Content for findability
Optimising Your Content for findabilityOptimising Your Content for findability
Optimising Your Content for findabilityKristian Norling
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction documentrajatkr
 
Change Your Search to Find – SharePoint and Office 365 Webinar
Change Your Search to Find – SharePoint and Office 365 WebinarChange Your Search to Find – SharePoint and Office 365 Webinar
Change Your Search to Find – SharePoint and Office 365 WebinarConcept Searching, Inc
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignMarianne Sweeny
 
Quality not quantity
Quality not quantityQuality not quantity
Quality not quantityvanesz
 
Essential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual SearchEssential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual Searchandrew_paulsen
 
Search is the UI
Search is the UI Search is the UI
Search is the UI danielbeach
 
Search Behavior Patterns
Search Behavior PatternsSearch Behavior Patterns
Search Behavior PatternsRamzi Alqrainy
 
Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...
Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...
Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...Findwise
 
Developing a Search & Findability Practice for the Enterprise
Developing a Search & Findability Practice for the EnterpriseDeveloping a Search & Findability Practice for the Enterprise
Developing a Search & Findability Practice for the EnterpriseRavi Mynampaty
 
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeBearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeMarianne Sweeny
 

Similar to Information Retrieval (for beginners) (20)

Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for Findability
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
 
What IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each OtherWhat IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each Other
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Optimising Your Content for findability
Optimising Your Content for findabilityOptimising Your Content for findability
Optimising Your Content for findability
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction document
 
Change Your Search to Find – SharePoint and Office 365 Webinar
Change Your Search to Find – SharePoint and Office 365 WebinarChange Your Search to Find – SharePoint and Office 365 Webinar
Change Your Search to Find – SharePoint and Office 365 Webinar
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
Search Analytics - Comperio
Search Analytics - ComperioSearch Analytics - Comperio
Search Analytics - Comperio
 
Quality not quantity
Quality not quantityQuality not quantity
Quality not quantity
 
Essential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual SearchEssential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual Search
 
Search is the UI
Search is the UI Search is the UI
Search is the UI
 
Search Behavior Patterns
Search Behavior PatternsSearch Behavior Patterns
Search Behavior Patterns
 
Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...
Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...
Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...
 
Developing a Search & Findability Practice for the Enterprise
Developing a Search & Findability Practice for the EnterpriseDeveloping a Search & Findability Practice for the Enterprise
Developing a Search & Findability Practice for the Enterprise
 
Needle in a Haystack_ACS
Needle in a Haystack_ACSNeedle in a Haystack_ACS
Needle in a Haystack_ACS
 
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeBearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Information Retrieval (for beginners)

  • 1. Information Retrieval James Melzer June 15, 2006 1
  • 3. The basics of search • A search engine mediates between user’s query and metadata surrogates for documents • Documents are reduced to metadata • User’s need is translated into a query • Query terms are used to find matching metadata terms • Lots and lots of room for error... 3
  • 4. The search process 1. Crawl content for metadata 2. Index document terms into an inverted file; an inverted file is very fast to search 3. Search the index to identify the result set; search the index - not the documents 4. Rank the results for display; ranking is the hardest part 4
  • 5. Search algorithm 1 Term-based Ranking (tf/idf) • tf = term frequency documents that use the query terms most are presumed to be most relevant • idf = inverse document frequency terms that are more rare are better indicators of relevance • Assumptions 1) relevance can be measured with document terms 5
  • 6. Search algorithm 2 PageRank (Google) • Relevant set is still identified by term matching • A revolution in ranking: based on linking between documents • Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important 6
  • 7. Citation Analysis • Authors carefully select articles to cite • The more citations an article gets, the better it must be • Citations by authors who have a lot of citations confers their power to those they cite • Aggregate and leverage all these small individual decisions... 7
  • 8. How Complex is Google? Google has about 36 ranking algorithms Examples: Citation Analysis Statistical Clustering Parsing Document Structure Parsing Data in the Document Microcontent Parsing 8
  • 9. How to Make Search Better? 9
  • 10. Evaluating Search Recall the percentage of all relevant documents retrieved 100% recall means every relevant document is retrieved Precision the percentage of documents retrieved that are relevant 100% precision means only relevant documents are retrieved 10
  • 11. Thoughts & Reservations about Evaluating Search • Precision and Recall are usually inversely proportional, so improving one often reduces the other. • Given a corpus of content like the web (tens of billions of items)... Recall is unmeasurable, and thus essentially meaningless • What is relevance? • Measuring Precision depends on an agreed definition of relevance, which is tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard to quantify)
  • 12. Zipf Best Bets • Manually selected results, tied to specific query terms or phrases • User-driven phrases select the most-used phrases from search traffic; go for easy wins, because returns diminish sharply • Business-driven phrases select phrases important to the business; such as product names or office locations; or politically sensitive phrases, so you can control the message people see 12
  • 13. Relevance Feedback • The user provides direct or indirect feedback on the search results • Click tracking • “More like this” or “Find similar” • Clustering 13
  • 14. Structured Search • Designers use patterns in search behavior to guess user’s intent; this requires a substantial understanding of user behavior; it may require structured content (although, not necessarily) Examples • Zip Code -> Zip Code Lookup Tool • Person’s name -> Directory Listing • Product Name -> Shop or Support? • Address -> Map this? • Topic -> Introduction, Forms, Policies or Reports? 14
  • 15. Controlled Vocabularies • Classification with a controlled vocabulary is the best way to ensure 100% Recall • Lead-in synonyms enter “fridge”; get “refrigerator” instead; best if the collection is well-cataloged increases precision (e.g. in a library) • Term-expansion synonyms; enter “refrigerator”; get “fridge” too; best if the collection is not well-cataloged increases recall at the cost of precision (e.g on eBay) • Spell check on query phrases 15
  • 16. Why is search important? IF: About half of all users prefer to search first* THEN: What percentage of a content site’s development effort should be devoted to search? * This statistic is highly context-dependent. People’s behavior depends on the context of their actions. The stat is from Jared Spool. 16
  • 17. Questions? James Melzer Information Architect SRA International james_melzer@sra.com 17