SlideShare a Scribd company logo
1 of 22
Investigating the
     Semantic Gap
through Query Log
           Analysis
                      Peter Mika
                Yahoo! Research
                      Edgar Meij
          University of Amsterdam
                 Hugo Zaragoza
                Yahoo! Research
The Semantic Gap

  • Significant efforts have focused on creating data
     – See Linking Open Data
  • Time to consider the extent to which data serves a purpose
     – Our purpose is fulfilling the information needs of our users
  • Using query logs to study the mismatch between
     – Data on the Web and information needs
     – Ontologies on the Web and expressions of information needs



     Demand                                           Supply
     = needs                                          = information


                                   -2-
The Data Gap
Lot’s of data or very little?




Linked Data cloud (Mar, 2009)
                                -4-
Another five-fold
RDFa on the   rise
                increase between
                October 2010 and
                January, 2012




              510% increase
              between March,
              2009 and
              October, 2010




     Percentage of URLs with embedded metadata in various formats
                                    -5-
Investigating the data gap through query logs

   • How big a Semantic Web do we need?
   • Just big enough… to answer all the questions that users may
     want to ask
      – Query logs are a record of what users want to know as a whole


   • Research questions:
      – How much of this data would ever surface through search?
      – What categories of queries can be answered?
      – What’s the role of large sites?




                                    -6-
Method
  • Simulate the average search behavior of users by replaying
    query logs
     – Reproducible experiments (given query log data)
         • BOSS web search API returns RDF/XML metadata for search
           result URLs
  • Caveats
     – For us, and for the time being, search = document search
         • For this experiment, we assume current bag-of-words document
           retrieval is a reasonable approximation of semantic search
     – For us, search = web search
         • We are dealing with the average search user
         • There are many queries the users have learned not to ask
     – Volume is a rough approximation of value
         • There are rare information needs with high pay-offs, e.g. patent
           search, financial data, biomedical data…

                                     -7-
Data

  • Microformats, eRDF, RDFa data
  • Query log data
       – US query log
       – Random sample of 7k queries
       – Recent query log covering over a month period
  • Query classification data
       – US query log
       – 1000 queries classified into various categories




                                    -8-
Number of queries with a given number of results with
                                           On average, a query has
 particular formats (N=7081)               at least one result
                                                                     with metadata.

                    1      2     3      4      5     6      7        8     9     10
ANY              2127   1164    492   244     85     24    10        5      3      1   7623   1.08

hcard            1457    370     93    11      3      0   Are tags 0 useful as 0 2535
                                                            0      as   0      hCard?         0.36

rel-tag          1317    350     95    44     14      8     6        3      1      1   2681   0.38

adr               456     77     21     6      1      0     0        0      0      0   702    0.10

hatom             450     52      8     1      0      0     0        0      0      0   582    0.08

license           359     21      1     1      0      0 That’s
                                                            0    only 01 in every 16 queries. 0.06
                                                                             0     0   408

xfn               339     26      1     1      0      0     0        1      0      0   406    0.06




Notes:
- Queries with 0 results with metadata not shown
- You cannot add numberss in columns: a query may return documents with different formats
- Assume queries return more than 10 results

                 Impressions                  -9-   Average impressions per query
The influence of head sites (N=7081)

                  1      2     3     4      5       6     7     8    9    10
ANY             2127   1164   492   244   85        24   10     5     3     1   7623     1.08

hcard           1457    370    93    11     3        0    0     0     0     0   2535     0.36

rel-tag         1317    350    95    44   14         8    6     3     1     1   2681     0.38
                                             If YouTube came up with a microformat,
wikipedia.org   1676      1     0     0     0      0      0      0     1     0 1687 0.24
                                             it would be the fifth most important.
adr              456     77    21     6     1        0    0     0     0     0      702   0.10

hatom            450     52     8     1     0        0    0     0     0     0      582   0.08

youtube.com      475      1     0     0     0        0    0     2     0     0      493   0.07

license          359     21     1     1     0        0    0     0     0     0      408   0.06

xfn              339     26     1     1     0        0    0     1     0     0      406   0.06

amazon.com       345      3     0     0     0        0    1     0     0     0      358   0.05




                Impressions               - 10 -   Average impressions per query
Restricted by category: local queries (N=129)

                            1     2    3    4        5   6   7    8    9 10
 ANY                        36    16   10   0        4   1    0    0    0    0   124   0.96

              The query category largely determines
 hcard                    31    7    5    1    0    0         0    0    0    0    64   0.50
              which sites are important.
 adr                        15     8    2   1        0   0    0    0    0    0    41   0.32

 local.yahoo.com            24     0    0   0        0   0    0    0    0    0    24   0.19

 en.wikipedia.org           24     0    0   0        0   0    0    0    0    0    24   0.19

 rel-tag                    19     2    0   0        0   0    0    0    0    0    23   0.18

 geo                        16     5    0   0        0   0    0    0    0    0    26   0.20

 www.yelp.com               16     0    0   0        0   0    0    0    0    0    16   0.12

 www.yellowpages.com        14     0    0   0        0   0    0    0    0    0    14   0.11




                    Impressions             - 11 -   Average impressions per query
Summary

  • Time to start looking at the demand side of semantic search
     – Size is not a measure of usefulness
  • For us, and for now, it’s a matter of who is looking for it
         • “We would trade a gene bank for fajita recipes any day”
         • Reality of web monetization: pay per eyeball
  • Measure different aspects of usefulness
     – Usefulness for improving presentation but also usefulness for
       ranking, reasoning, disambiguation…
  • Site-based analysis
  • Linked Data will need to be studied separately




                                     - 12 -
The Ontology Gap
Investigating the ontology gap through query logs
   •   Does the language of users match the ontology of the data?
         – Initial step: what is the language of users?
   •   Observation: the same type of objects often have the same query
       context
         – Users asking for the same aspect of the type

       Query                    Entity               Context         Class
       aspirin side effects     ASPIRIN              +side effects   Anti-inflammatory drugs
       ibuprofen side effects   IBUPROFEN            +side effects   Anti-inflammatory drugs
       how to take aspirin      ASPIRIN              -how to take    Anti-inflammatory drugs
       britney spears video     BRITNEY SPEARS       +video          American film actors
       britney spears shaves    BRITNEY SPEARS       +shaves her     American film actors
       her head                                      head


   •   Idea: mine the context words (prefixes and postfixes) that are
       common to a class of objects
         – These are potential attributes or relationships
                                            - 14 -
Models

  • Desirable properties:
     – P1: Fix is frequent within type
     – P2: Fix has frequencies well-distributed across entities
     – P3: Fix is infrequent outside of the type
  • Models:
                                                         type: product


                                                        entity       fix

                                                   apple ipod nano review




                                   - 15 -
Models cont.




               - 16 -
Demo




       - 17 -
Qualitative evaluation

   • Four wikipedia templates of different sizes
   • Bold are the information needs that would be actually
     fulfilled by infobox data

   Settlement         Musical artist            Drug              Football club
   hotels             lyrics                    buy               forum
   map                buy                       what is           news
   map of             pictures of               tablets           website
   weather            what is                   what is           homepage
   weather in         video                     side effects of   tickets
   flights to         download                  hydrochloride     official website
   weather            hotel                     online            badge
   hotel              dvd                       overdose          fixtures
   property in        mp3                       capsules          free
   cheap flights to   best                      addiction         logo
                                       - 18 -
Evaluation by query prediction
   •   Idea: use this method for type-based query completion
       –   Expectation is that it improves infrequent queries
   •   Three days of UK query log for training, three days of testing
   •   Entity-based frequency as baseline (~current search suggest)
   •   Measures
       –   Recall at K, MRR (also per type)
   •   Variables
       –   models (M1-M6)
       –   number of fixes (1, 5, 10)
       –   mapping (templates vs. categories)
       –   type to use for a given entity
            •   Random
            •   Most frequent type
            •   Best type
            •   Combination
       –   To do: number of days of training
                                        - 19 -
Results: success rate (binned)




                            - 20 -
Summary

  •   Most likely fix given type (M1) works best
  •   Some improvement on query completion task
       – Win for rare queries
       – Raw frequency wins quickly because of entity-specific completions
  •   Potentially highly valuable resource for other applications
       – Facets
       – Automated or semi-automated construction of query-trigger patterns
       – Query classification
       – Characterizing and measuring the similarity of websites based on the
         entities, fixes and types that lead to the site
  •   Further work needed to turn this into a vocabulary engineering
      method



                                       - 21 -
Open Questions

  • Measure information utility, not just volume
     – Looking at the demand side of information retrieval and how
       well it matches the supply of information
  • Data
     – Is the Semantic Web just growing or becoming more useful?
         • How well does it match the information needs of users?
         • Other measures of utility?
  • Ontologies
     – Are the properties of objects that we capture match what users
       are looking for?
         • Mismatch of language? Mismatch of needs?




                                        - 22 -

More Related Content

Similar to Investigating the Semantic Gap through Query Log Analysis

Generic Framework for Knowledge Classification-1
Generic Framework  for Knowledge Classification-1Generic Framework  for Knowledge Classification-1
Generic Framework for Knowledge Classification-1Venkata Vineel
 
Adapting Alax Solr to Compare different sets of documents - Joan Codina
Adapting Alax Solr to Compare different sets of documents - Joan CodinaAdapting Alax Solr to Compare different sets of documents - Joan Codina
Adapting Alax Solr to Compare different sets of documents - Joan Codinalucenerevolution
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesRoi Blanco
 
CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databasessjwoodman
 
Linked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceLinked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceBarry Norton
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Ivo Andreev
 
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiHadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiCloudera, Inc.
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Lucas Jellema
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data ChallengeKnud Möller
 
On the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataOn the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataAnisa Rula
 
CARLI Usage Stats Keynote 20130325
CARLI Usage Stats Keynote 20130325CARLI Usage Stats Keynote 20130325
CARLI Usage Stats Keynote 20130325Jason Price, PhD
 
Parse.ly: Inside a modern RIA built with Solr
Parse.ly: Inside a modern RIA built with SolrParse.ly: Inside a modern RIA built with Solr
Parse.ly: Inside a modern RIA built with SolrAndrew Montalenti
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopKevin Crawley
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatternsgrepalex
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.orgJoshua Shinavier
 
Optique - poster
Optique - posterOptique - poster
Optique - posterDBOnto
 

Similar to Investigating the Semantic Gap through Query Log Analysis (20)

Generic Framework for Knowledge Classification-1
Generic Framework  for Knowledge Classification-1Generic Framework  for Knowledge Classification-1
Generic Framework for Knowledge Classification-1
 
Adapting Alax Solr to Compare different sets of documents - Joan Codina
Adapting Alax Solr to Compare different sets of documents - Joan CodinaAdapting Alax Solr to Compare different sets of documents - Joan Codina
Adapting Alax Solr to Compare different sets of documents - Joan Codina
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
 
CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
 
Linked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceLinked Data, Ontologies and Inference
Linked Data, Ontologies and Inference
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
 
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiHadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data Challenge
 
On the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataOn the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open data
 
CARLI Usage Stats Keynote 20130325
CARLI Usage Stats Keynote 20130325CARLI Usage Stats Keynote 20130325
CARLI Usage Stats Keynote 20130325
 
Parse.ly: Inside a modern RIA built with Solr
Parse.ly: Inside a modern RIA built with SolrParse.ly: Inside a modern RIA built with Solr
Parse.ly: Inside a modern RIA built with Solr
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Piano rubyslava final
Piano rubyslava finalPiano rubyslava final
Piano rubyslava final
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
 
Optique - poster
Optique - posterOptique - poster
Optique - poster
 

More from Peter Mika

What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through EntitiesPeter Mika
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Peter Mika
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchablePeter Mika
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialPeter Mika
 
Making things findable
Making things findableMaking things findable
Making things findablePeter Mika
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic WebPeter Mika
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011Peter Mika
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009Peter Mika
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin YahooPeter Mika
 

More from Peter Mika (12)

What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Making things findable
Making things findableMaking things findable
Making things findable
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Investigating the Semantic Gap through Query Log Analysis

  • 1. Investigating the Semantic Gap through Query Log Analysis Peter Mika Yahoo! Research Edgar Meij University of Amsterdam Hugo Zaragoza Yahoo! Research
  • 2. The Semantic Gap • Significant efforts have focused on creating data – See Linking Open Data • Time to consider the extent to which data serves a purpose – Our purpose is fulfilling the information needs of our users • Using query logs to study the mismatch between – Data on the Web and information needs – Ontologies on the Web and expressions of information needs Demand Supply = needs = information -2-
  • 4. Lot’s of data or very little? Linked Data cloud (Mar, 2009) -4-
  • 5. Another five-fold RDFa on the rise increase between October 2010 and January, 2012 510% increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats -5-
  • 6. Investigating the data gap through query logs • How big a Semantic Web do we need? • Just big enough… to answer all the questions that users may want to ask – Query logs are a record of what users want to know as a whole • Research questions: – How much of this data would ever surface through search? – What categories of queries can be answered? – What’s the role of large sites? -6-
  • 7. Method • Simulate the average search behavior of users by replaying query logs – Reproducible experiments (given query log data) • BOSS web search API returns RDF/XML metadata for search result URLs • Caveats – For us, and for the time being, search = document search • For this experiment, we assume current bag-of-words document retrieval is a reasonable approximation of semantic search – For us, search = web search • We are dealing with the average search user • There are many queries the users have learned not to ask – Volume is a rough approximation of value • There are rare information needs with high pay-offs, e.g. patent search, financial data, biomedical data… -7-
  • 8. Data • Microformats, eRDF, RDFa data • Query log data – US query log – Random sample of 7k queries – Recent query log covering over a month period • Query classification data – US query log – 1000 queries classified into various categories -8-
  • 9. Number of queries with a given number of results with On average, a query has particular formats (N=7081) at least one result with metadata. 1 2 3 4 5 6 7 8 9 10 ANY 2127 1164 492 244 85 24 10 5 3 1 7623 1.08 hcard 1457 370 93 11 3 0 Are tags 0 useful as 0 2535 0 as 0 hCard? 0.36 rel-tag 1317 350 95 44 14 8 6 3 1 1 2681 0.38 adr 456 77 21 6 1 0 0 0 0 0 702 0.10 hatom 450 52 8 1 0 0 0 0 0 0 582 0.08 license 359 21 1 1 0 0 That’s 0 only 01 in every 16 queries. 0.06 0 0 408 xfn 339 26 1 1 0 0 0 1 0 0 406 0.06 Notes: - Queries with 0 results with metadata not shown - You cannot add numberss in columns: a query may return documents with different formats - Assume queries return more than 10 results Impressions -9- Average impressions per query
  • 10. The influence of head sites (N=7081) 1 2 3 4 5 6 7 8 9 10 ANY 2127 1164 492 244 85 24 10 5 3 1 7623 1.08 hcard 1457 370 93 11 3 0 0 0 0 0 2535 0.36 rel-tag 1317 350 95 44 14 8 6 3 1 1 2681 0.38 If YouTube came up with a microformat, wikipedia.org 1676 1 0 0 0 0 0 0 1 0 1687 0.24 it would be the fifth most important. adr 456 77 21 6 1 0 0 0 0 0 702 0.10 hatom 450 52 8 1 0 0 0 0 0 0 582 0.08 youtube.com 475 1 0 0 0 0 0 2 0 0 493 0.07 license 359 21 1 1 0 0 0 0 0 0 408 0.06 xfn 339 26 1 1 0 0 0 1 0 0 406 0.06 amazon.com 345 3 0 0 0 0 1 0 0 0 358 0.05 Impressions - 10 - Average impressions per query
  • 11. Restricted by category: local queries (N=129) 1 2 3 4 5 6 7 8 9 10 ANY 36 16 10 0 4 1 0 0 0 0 124 0.96 The query category largely determines hcard 31 7 5 1 0 0 0 0 0 0 64 0.50 which sites are important. adr 15 8 2 1 0 0 0 0 0 0 41 0.32 local.yahoo.com 24 0 0 0 0 0 0 0 0 0 24 0.19 en.wikipedia.org 24 0 0 0 0 0 0 0 0 0 24 0.19 rel-tag 19 2 0 0 0 0 0 0 0 0 23 0.18 geo 16 5 0 0 0 0 0 0 0 0 26 0.20 www.yelp.com 16 0 0 0 0 0 0 0 0 0 16 0.12 www.yellowpages.com 14 0 0 0 0 0 0 0 0 0 14 0.11 Impressions - 11 - Average impressions per query
  • 12. Summary • Time to start looking at the demand side of semantic search – Size is not a measure of usefulness • For us, and for now, it’s a matter of who is looking for it • “We would trade a gene bank for fajita recipes any day” • Reality of web monetization: pay per eyeball • Measure different aspects of usefulness – Usefulness for improving presentation but also usefulness for ranking, reasoning, disambiguation… • Site-based analysis • Linked Data will need to be studied separately - 12 -
  • 14. Investigating the ontology gap through query logs • Does the language of users match the ontology of the data? – Initial step: what is the language of users? • Observation: the same type of objects often have the same query context – Users asking for the same aspect of the type Query Entity Context Class aspirin side effects ASPIRIN +side effects Anti-inflammatory drugs ibuprofen side effects IBUPROFEN +side effects Anti-inflammatory drugs how to take aspirin ASPIRIN -how to take Anti-inflammatory drugs britney spears video BRITNEY SPEARS +video American film actors britney spears shaves BRITNEY SPEARS +shaves her American film actors her head head • Idea: mine the context words (prefixes and postfixes) that are common to a class of objects – These are potential attributes or relationships - 14 -
  • 15. Models • Desirable properties: – P1: Fix is frequent within type – P2: Fix has frequencies well-distributed across entities – P3: Fix is infrequent outside of the type • Models: type: product entity fix apple ipod nano review - 15 -
  • 16. Models cont. - 16 -
  • 17. Demo - 17 -
  • 18. Qualitative evaluation • Four wikipedia templates of different sizes • Bold are the information needs that would be actually fulfilled by infobox data Settlement Musical artist Drug Football club hotels lyrics buy forum map buy what is news map of pictures of tablets website weather what is what is homepage weather in video side effects of tickets flights to download hydrochloride official website weather hotel online badge hotel dvd overdose fixtures property in mp3 capsules free cheap flights to best addiction logo - 18 -
  • 19. Evaluation by query prediction • Idea: use this method for type-based query completion – Expectation is that it improves infrequent queries • Three days of UK query log for training, three days of testing • Entity-based frequency as baseline (~current search suggest) • Measures – Recall at K, MRR (also per type) • Variables – models (M1-M6) – number of fixes (1, 5, 10) – mapping (templates vs. categories) – type to use for a given entity • Random • Most frequent type • Best type • Combination – To do: number of days of training - 19 -
  • 20. Results: success rate (binned) - 20 -
  • 21. Summary • Most likely fix given type (M1) works best • Some improvement on query completion task – Win for rare queries – Raw frequency wins quickly because of entity-specific completions • Potentially highly valuable resource for other applications – Facets – Automated or semi-automated construction of query-trigger patterns – Query classification – Characterizing and measuring the similarity of websites based on the entities, fixes and types that lead to the site • Further work needed to turn this into a vocabulary engineering method - 21 -
  • 22. Open Questions • Measure information utility, not just volume – Looking at the demand side of information retrieval and how well it matches the supply of information • Data – Is the Semantic Web just growing or becoming more useful? • How well does it match the information needs of users? • Other measures of utility? • Ontologies – Are the properties of objects that we capture match what users are looking for? • Mismatch of language? Mismatch of needs? - 22 -

Editor's Notes

  1. First intro, and then work with many different people… and I’ve learned a lot.
  2. First intro, and then work with many different people… and I’ve learned a lot.
  3. Entity-independent measures: M1: probability of fix given type M2: probability of fix given type, normalized by probability of fix (the more uncommon the fix, the better) M3: binary entropy function
  4. Entity-dependent measures: M4: