SlideShare une entreprise Scribd logo
1  sur  109
Télécharger pour lire hors ligne
Semantic Search Tutorial
     Introduction


      Peter Mika|Yahoo! Research, Spain
            pmika@yahoo-inc.com

   Thanh Tran | Institute AIFB, KIT, Germany
         Tran@aifb.uni-karlsruhe.de
About the speakers
Peter Mika
  Senior Research Scientist
  Head of Semantic Search group at Yahoo! Research in Barcelona
  Semantic Search, Web Object Retrieval, Natural Language Processing
Tran Duc Thanh
  Assistant Professor at AIFB, Karlsruhe Institute ofTechnology
  Head of Semantic Search group
  Semantic Search, Semantic / Linked Data Management
Agenda

Introduction (10 min)
Semantic Web data (50 min)
  The RDF data model
  Publishing RDF
  Crawling and indexing RDF data
Query processing (40 min)
Ranking (30 min)
Result presentation (15 min)
Semantic Search evaluation (15 min)
Questions (5 min)
Why Semantic Search? I.
“We are at the beginning of search.“ (Marissa Mayer)
  Solved large classes of queries, e.g. navigational
  Heavy investment in computational power
  Remaining queries are hard, not solvable by brute force, and require a
  deep understanding of the world and human cognition
Background knowledge and metadata can help to address poorly
solved queries
Poorly solved information needs
Ambiguous searches                                     Many of these queries would not
  paris hilton                                         be asked by users, who learned
Long tail queries                                      over time what search technology
                                                       can and can not do.
  george bush (and I mean the beer brewer in Arizona)
Multimedia search
  paris hilton sexy
Imprecise or overly precise searches
  jim hendler
  pictures of strong adventures people
Precise searches for descriptions
  countries in africa
  32 year old computer scientist living in barcelona
  reliable digital camera under 300 dollars
Example: multiple interpretations
Why Semantic Search? II.
The Semantic Web is now a reality
  Large amounts of data published in RDF
  Heterogeneous data of varying quality
  Users who are not skilled in writing complex queries (e.g. SPARQL)
  and may not be experts in the domain
Searching data instead or in addition to searching documents
  Direct answers
  Novel search tasks
Example: direct answers in search




  Points of                   Faceted search
                    Information
  interest in
                    from the for Shopping Information box with
  Vienna, Austria             results
                    Knowledge               content from and links
                    Graph                   to Yahoo! Travel

                                  Since Aug, 2010,
                                  ‘regular’ search
                                  results are
                                  ‘Powered by Bing’
Novel search tasks
Aggregation of search results
  e.g. price comparison across websites
Analysis and prediction
  e.g. world temperature by 2020
Semantic profiling
  recommendations based on particular interests
Semantic log analysis
  understanding user behavior in terms of objects
Support for complex tasks
  e.g. booking a vacation using a combination of services
Contextual (pervasive, ambient) search
                                            Yahoo! Connected TV:
                                            Widget engine
                                            embedded into the TV




           Yahoo! IntoNow: recognize
           audio and show related content
Interactive search and task completion
Document retrieval and data retrieval
Information Retrieval (IR) support the retrieval of documents
(document retrieval)
  Representation based on lightweight syntax-centric models
  Work well for topical search
  Not so well for more complex information needs
  Web scale
Database (DB) and Knowledge-based Systems (KB) deliver more precise
answers (data retrieval)
  More expressive models
  Allow for complex queries
  Retrieve concrete answers that precisely match queries
  Not just matching and filtering, but also joins
  Limitations in scalability
Combination of document and data retrieval
Documents with metadata
  Metadata may be embedded inside the document
  I’m looking for documents that mention countries in Africa.
Data retrieval
  Structured data, but searchable text fields
  I’m looking for directors, who have directed movies where the synopsis
  mentions dinosaurs.
Semantic Search
Target (combination of) document and data retrieval
Semantic search is a retrieval paradigm that
  Exploits the structure/semantics of the data or explicit background
  knowledge to understand user intent and the meaning of content
  Incorporates the intent of the query and the meaning of content into
  the search process (semantic models)
Wide range of semantic search systems
  Employ different semantic models, possibly at different steps of the
  search process and in order to support different tasks
Semantic models
Semantics is concerned with the meaning of the resources made
available for search
  Various representations of meaning
Linguistic models: models of relationships among words
  Taxonomies, thesauri, dictionaries of entity names
  Inference along linguistic relations, e.g. broader/narrower terms
Conceptual models: models of relationships among objects
  Ontologies capture entities in the world and their relationships
  Inference along domain-specific relations
We will focus on conceptual models in this tutorial
  In particular, the RDF/OWL conceptual model for representing
  classes in a domain, and describing their instances
Semantic Search – a process view

                                                          Knowledge Representation


                                                 Semantic Models
               • Keywords                                                            Resources
               • Forms
  Query        • NL
Construction   • Formal language



               •IR-style matching & ranking
               •DB-style precise matching
   Query       •KB-style matching & inferences
 Processing



               •Query visualization
               •Document and data presentation                                        Documents
   Result      •Summarization
Presentation


               •Implicit feedback
               •Explicit feedback
  Query        •Incentives
Refinement



                                                           Document Representation
Semantic Search systems
For data / document retrieval, semantic search systems might
combine a range of techniques, ranging from statistics-based IR
methods for ranking, database methods for efficient
indexing and query processing, up to complex reasoning
techniques for making inferences!
Example: Information Workbench
Addressing the lifecycle of
interacting with the Web of Data
  Integration of data sources
  Content generation by the end user
  Search and Exploration
  Visualization
  Publishing
                                                User-
                                              generated   Wikipedia
Integrated management of
heterogeneous data sources                                DBpedia, Yago
  Structured and unstructured
                                                          Earthquake (Data.gov)
  Published and user-generated
  Static and dynamic
  Open domain
                                 Structured                Dynamic
Data Sources in the Application
Entire English Wikipedia

Data from Linked Open Data
  DBpedia
  YAGO
  …


Data from Data.gov (US Government)
  E.g. live data about earthquakes

Many more
Semantic Search
Hybrid Search: Structured queries combined with keywords
across structured and unstructured data sources

Query interpretation: Translation of keywords into hybrid
queries

Keyword search/query interpretation combined with
faceted search: iterative refinement process based on keywords
and operations on facets
Search, Refinement and Navigation
                                                     Keywords




                                                                Query
                                                                Translations

                                                                  Term
                                                                  Completions




     Vorlesung Knowledge Discovery - Institut AIFB
                   Facets

21
Result Inspection, Analysis and Browsing
Semantic Web data
Data on the Web
Data on the Web is not directly accessible
  Most web pages are generated from databases, but formatted for
  human consumption
  APIs offer limited views over data
Two solutions
  Extraction using Information Extraction (IE) techniques
    Out of scope for this tutorial
  Relying on publishers to expose structured data using standard
  Semantic Web formats
Information extraction
Natural Language Processing
  Named entity recognition and disambiguation, sentiment analysis etc.
Extraction of information about entities
  Suchanek et al. YAGO: A Core of Semantic Knowledge UnifyingWordNet and
  Wikipedia,WWW, 2007.
  Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.
Extraction from HTML tables
  Cafarella et al. WebTables: Exploring the Power of Tables on theWeb.VLDB 2008
Wrapper induction
  Kushmerick et al.Wrapper Induction for Information ExtractionText extraction. IJCAI
  2007
Filling web forms automatically (form-filling)
  Madhavan et al. Google's Deep-Web Crawl.VLDB 2008
Semantic Web
Sharing data across the Web
  Standard data model
    RDF
  A number of syntaxes (file formats)
    RDF/XML, RDFa
  Powerful, logic-based languages for schemas
    OWL, RIF
  Query languages and protocols
    HTTP, SPARQL
Resource Description Framework (RDF)
Each resource (thing, entity) is identified by a URI
  Globally unique identifiers
  URLs a subset of URIs
  Often abbreviated using namespaces
     e.g. example:roi = http://example.org/roi
RDF represents knowledge as a set of triples
  Each triple is a single fact about the entity (an attribute or a relationship)
A set of triples forms an RDF graph

  RDF document
                                 type        foaf:Person

                example:roi         name

                                            “Roi Blanco”
Linking resources
Roi’s homepage                                Friend-of-a-Friend ontology
                                  type
   example:roi                                      foaf:Person
                     name


                            “Roi Blanco”                     knows
          sameAs



Yahoo!’s website                             type

                   worksWith
  #roi2                         #peter

                                           email

                                              “pmika@yahoo-inc.com”
Ontologies
Ontologies are the schemas for RDF graphs
  Define the intended meaning of certain classes and relationships in a
  domain
    e.g. the FOAF ontology defines the foaf:Person class and relationships such as
    foaf:knows
Ontologies are published in standard languages such as OWL
  Language defines the constructs for modeling and their meaning
    e.g. subClassOf, sameAs
  Tools for editing ontologies such as Protégé, TopBraid Composer
Reusing and extending well-known ontologies helps to interpret
unknown data
  e.g. if X is subClassOf foaf:Person, and A is of type X, then we know
  that A is a person
Example: schema.org
Agreement on a shared set of schemas for common types of web
content
  Bing, Google, and Yahoo! as initial supporters
  Similar in intent to sitemaps.org (2006)
    Use a single format to communicate the same information to all three search
    engines
Support for microdata
schema.org covers areas of interest to all search engines
  Business listings (local), creative works (video), recipes, reviews
  User defined extensions
Each search engine continues to develop its products
Documentation and OWL ontology
Publishing RDF

Interlinked RDF documents (Linked Data)
  Each document describes a single resource with URIs pointing to
  related resources
  Common RDF file formats are RDF/XML and Turtle
  Mostly implemented as a wrapper around a database or Web service
Embedding RDF inside HTML
  RDFa, microdata
SPARQL endpoints
  Triple stores are databases for managing RDF data
  SPARQL is a standard protocol and query language for accessing
  triple stores using HTTP
Linked Data
Grass-roots community effort to (re)publish open datasets in RDF
  Centered around the Dbpedia project
  Directory of datasets at linkeddata.org, thedatahub.org
Metadata in HTML anno 1995

                      What does this term mean?
<HTML>
<HEAD profile="http://dublincore.org/documents/dcq-
  html/">
<META name="DC.author" content="Peter Mika">
<LINK rel="DC.rights copyright"
      href="http://www.example.org/rights.html" />
<LINK rel="meta" type="application/rdf+xml" title="FOAF"
     href= "http://www.cs.vu.nl/~pmika/foaf.rdf">
</HEAD>
…
</HTML>

                        How is this data related?
Microformats (anno 2003)
  Mark-up for data in HTML pages
     Reuse existing HTML elements (class, rel)
  Microformats exist for a limited set of objects
     Persons, organizations, reviews, events etc., see microformats.org
  No formal schemas
     Limited reuse, extensibility of schemas
     Unclear which combinations are allowed
  No identifiers for entities
     No interlinking between entities

<div class="vcard">
 <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>
 <div class="tel">+1-919-555-7878</div>
 <div class="title">Area Administrator, Assistant</div>
</div>
RDFa and RDFa Lite (2008-2012)
W3C standards for embedding RDF data in HTML documents
  A set of new HTML attributes to be used in head or body
  A specification of how to extract the data from these attributes
RDFa is just a syntax, you have to choose a vocabulary separately
RDFa family
  RDFa 1.0
    Recommendation since October, 2008
  RDFa 1.1 is a small update on RDFa to make it easier to use
    RDFa Primer (Working Draft, May 8, 2012)
  RDFa 1.1 Lite is a subset of RDFa 1.1 with the most commonly used
  features
    Proposed Recommendation (May 8, 2012)
Example: Facebook’s Open Graph Protocol
The ‘Like’ button provides publishers with a way to promote their
content on Facebook and build communities
  Shows up in profiles and news feed
  Site owners can later reach users who have liked an object
  Facebook Graph API allows 3rd party developers to access the data
Open Graph Protocol is an RDFa-based format that allows to
describe the object that the user ‘Likes’
Example: Facebook’s Open Graph Protocol
 RDF vocabulary to be used in conjunction with RDFa
    Simplify the work of developers by restricting the freedom in RDFa
 Activities, Businesses, Groups, Organizations, People, Places, Products and
 Entertainment
 Only HTML <head> accepted

<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
   <title>The Rock (1996)</title>
   <meta property="og:title" content="The Rock" />
   <meta property="og:type" content="movie" />
   <meta property="og:url" content="http://www.imdb.com/title/tt0117500/"
   />
   <meta property="og:image" content="http://ia.media-
   imdb.com/images/rock.jpg" /> …
</head> ...
Microdata
 HTML5
    Currently under standardization at the W3C
    Originally part of the HTML5 spec, but now a separate document
 Comparable to RDFa 1.1 Lite
    Key extensibility features (such as multiple types) missing
 HTML5 also has a number of “semantic” elements such as <time>,
 <video>, <article>…

<div itemscope itemid=“http://www.yahoo.com/resource/person”>
<p>My name is <span itemprop="name">Neil</span>.</p>
<p>My band is called <span itemprop="band">Four Parts Water</span>.
   I was born on <time itemprop="birthday" datetime="2009-05-10"> May 10th
   2009</time>.
   <img itemprop="image" src=”me.png" alt=”me”></p>
</div>
Current state of metadata on the Web
31% of webpages, 5% of domains contain some metadata
  Analysis of the Bing Crawl (US crawl, January, 2012)
  RDFa is most common format
    By URL: 25% RDFa, 7% microdata, 9% microformat
    By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
  Adoption is stronger among large publishers
    Especially for RDFa and microdata
See also
  P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW
  2012
  H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured
  Data from Two Large Web Corpora, LDOW 2012
Exponential growth in RDFa data

                    Another five-fold increase between
                    October 2010 and January, 2012




                Five-fold increase between
                March, 2009 and October, 2010




Percentage of URLs with embedded metadata in various formats
Crawling and indexing
Crawling the Semantic Web
Linked Data
  Similar to HTML crawling, but the the crawler needs to parse
  RDF/XML (and others) to extract URIs to be crawled
  Semantic Sitemap/VOID descriptions
RDFa
  Same as HTML crawling, but data is extracted after crawling
  Mika et al. Investigating the Semantic Gap through Query Log
  Analysis, ISWC 2010.
SPARQL endpoints
  Endpoints are not linked, need to be discovered by other means
  Semantic Sitemap/VOID descriptions
Data fusion
Ontology matching
  Widely studied in Semantic Web research, see e.g. list of publications at
  ontologymatching.org
     Unfortunately, not much of it is applicable in a Web context due to the quality of
     ontologies
Entity resolution
  Logic-based approaches in the Semantic Web
  Studied as record linkage in the database literature
     Machine learning based approaches, focusing on attributes
  Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to
  RDF data
     Improvements over only attribute based matching
Blending
  Merging objects that represent the same real world entity and reconciling
  information from multiple sources
Data quality assessment and curation
Heterogeneity, quality of data is an even larger issue
  Quality ranges from well-curated data sets (e.g. Freebase) to microformats
     In the worst of cases, the data becomes a graph of words
  Short amounts of text: prone to mistakes in data entry or extraction
     Example: mistake in a phone number or state code
Quality assessment and data curation
  Quality varies from data created by experts to user-generated content
  Automated data validation
     Against known-good data or using triangulation
     Validation against the ontology or using probabilistic models
  Data validation by trained professionals or crowdsourcing
     Sampling data for evaluation
  Curation based on user feedback
Indexing
Search requires matching and ranking
  Matching selects a subset of the elements to be scored
The goal of indexing is to speed up matching
  Retrieval needs to be performed in milliseconds
  Without an index, retrieval would require streaming through the
  collection
The type of index depends on the query model to support
  DB-style indexing
  IR-style indexing
IR-style indexing
Index data as text
  Create virtual documents from data
  One virtual document per subgraph, resource or triple
    typically: resource
Key differences to Text Retrieval
  RDF data is structured
  Minimally, queries on property values are required
Horizontal index structure

Two fields (indices): one for terms, one for properties
For each term, store the property on the same position in the property
index
  Positions are required even without phrase queries
Query engine needs to support the alignment operator
✓ Dictionary is number of unique terms + number of properties
Occurrences is number of tokens * 2
Vertical index structure

One field (index) per property
Positions are not required
  But useful for phrase queries
Query engine needs to support fields
Dictionary is number of unique terms
Occurrences is number of tokens
✗ Number of fields is a problem for merging, query performance
Distributed indexing
MapReduce is ideal for building inverted indices
  Map creates (term, {doc1}) pairs
  Reduce collects all docs for the same term: (term, {doc1, doc2…}
  Sub-indices are merged separately
    Term-partitioned indices
Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
Query Processing / Matching
Structure
Taxonomy of search approaches
Query processing / matching techniques for Semantic Search
Types of semantic data
Formalisms for querying semantic data
Approaches
  General task: hybrid graph pattern matching
  Matching keyword query against text
  Matching structured query against structured data
  Matching keyword query against structured data
  Matching structured query against text (a hybrid case)
Main tasks, challenges and opportunities
Taxonomy of search approaches (1)
The search problem
  A collection of resources, called data
  Information needs expressed as queries
  Search is the task of efficiently computing results from data that
  are relevant to queries
Document data retrieval vs. structured data retrieval
  Differences in query and data representation and matching
  Efficiently retrieve structured data that exactly match formal
  information needs expressed as structured queries
  Effectively rank textual results that match ambiguous NL / keyword
  queries to a certain degree (notions of relevance)
Semantic search: ranked retrieval of document and structured
data (given ambiguous queries / data)
Taxonomy of search approaches (2)
                              Query engines (of databases)
                                      Exact
                                         Complete
    Query
                                         Sound

                                   • Approximate
      Matching




                                      • Not complete
                                      • Not sound

                                   • Ranked
     Data
                                      • Best effort
                                      • Top-k
                              Search engines (stand-alone/database extensons)
Query processing mainly focuses on efficiency of matching whereas ranking
deals with degree of matching (relevance)!
Query processing for Semantic Search (1)
Resources represented by semantic data ranging from
  Structured data with well defined schemas
  Semi-structured data with incomplete or no schemas
  Data that largely comprise text
  Hybrid / embedded data
Information needs of varying complexity, captured using different
formalisms and querying paradigms
  Natural language texts and keywords
  Form-based inputs
  Formal structured queries
(Search is end-user oriented paradigm, requires “natural”, intuitive querying interfaces)
Semantic search: efficiently computing results (query processing)
from data that are relevant to queries (ranking)
Query processing for Semantic Search (2)
                            NL                    Form- / facet-   Structured Queries
        Keywords
                          Questions                based Inputs        (SPARQL)
Ambiquities


                                          Query




                                             Matching
                                           Data



    RDF data
                        Semi-Structured          Structured        OWL ontologies with
 embedded in text
                           RDF data               RDF data         rich, formal semantics
     (RDFa)
Ambiquities: confidence degree, truth/trust value…
Query processing for Semantic Search (3)
                                   Textual Data



                                      Structured query
               Keyword query on
                                       on textual data ,
                textual data, e.g.
                                        e.g. querying
                  Web search
                                        extension for
                    systems Search target different
                  Semantic             search systems?
                 group of users, information needs,
Unstructured    and types of data. Query processing                Structured
Query                                                              Query
                   for semantic searchStructured query
                                       is hybrid
               Keyword query on
                  combination of techniques!
                                      on structured data
                structured data,
                                                e.g. standard
                   e.g. search
                                              querying interface
                 extensions for
                                               for databases /
                    databases
                                                 RDF stores


                               Structured Data
Types of data models (1)
Textual
  Bag-of-words
  Represent documents, text in structured data,…, real-world
  objects (captured as structured data)
  Lacks “structure”
      in text, e.g. linguistic structure, hyperlinks, (positional information)
      Structure in structured data representation



   term (statistics)
    combination
    Cloud
    In combination with Cloud
    Computing technologies,
    Technologies
    promising solutions for the
    management of `big data'
    solutions
    management
    have emerged. Existing
    `big data'
    industry solutions are able to
    support complex queries and
    industry
    solutions
    analytics tasks with terabytes
    of data. For example, using a
    support
    complex
    Greenplum.
    ……
Types of data models (2)
Textual
Structured
  Resource Description Framework (RDF)
  Represent real-world objects, services, applications, …. documents
  Resource attribute values and relationships between resources
  Schema

                    creator   Picture
           Person




     Bob
Types of data models (3)
Textual
Structured
Hybrid
  RDF data embedded in text (RDFa)
Types of data models – RDFa (1)
…
<div about="/alice/posts/trouble_with_bob">
   <h2 property="dc:title">The trouble with Bob</h2>
   <h3 property="dc:creator">Alice</h3>

          Bob is a good friend of mine. We went to the same university, and
          also shared an apartment in Berlin in 2008. The trouble with Bob is               that
he takes much better photos than I do:

  <div about="http://example.com/bob/photos/sunset.jpg">
    <img src="http://example.com/bob/photos/sunset.jpg" />
    <span property="dc:title">Beautiful Sunset</span>
    by <span property="dc:creator">Bob</span>.
  </div>
</div>
…
                                       adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
Types of semantic data – RDFa (2)

Bob is a good friend of mine. We went to        content
the same university, and also shared an
apartment in Berlin in 2008. The trouble
with Bob is that he takes much better
photos than I do:

                                      content




                                                adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
Types of semantic data - conclusion




Semantic data in general can be conceived as a graph
with text and structured data items as nodes, and
   edges represent different types of relationships
  including explicit semantic relationships and
     vaguely specified ones such as hyperlinks!
Formalisms for querying semantic data (1)

Example information need
“Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone working at KIT.”


  Unstructured queries
  Fully-structured queries
  Hybrid queries: unstructured + structured
Formalisms for querying semantic data (2)

Example information need
“Information about a friend of Alice, who shared an apartment
with her in Berlin and knows someone working at KIT.”


  Unstructured
    NL
    Keywords
                       shared   apartment   Berlin   Alice
Formalisms for querying semantic data (3)

Example information need
“Information about a friend of Alice, who shared an apartment
with her in Berlin and knows someone working at KIT.”


  Unstructured
  Fully-structured
    SPARQL: BGP, filter, optional, union, select, construct, ask, describe
       PREFIX ns: <http://example.org/ns#>
       SELECT ?x
       WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
          ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
Formalisms for querying semantic data (4)
Fully-structured
Unstructured
Hybrid: content and structure constraints




                                                  “shared apartment Berlin Alice”

           ?x ns:knows ? y. ?y ns:name “Alice”.
           ?x ns:knows ?z. ?z ns: works ?v.
           ?v ns:name “KIT”
Formalisms for querying semantic data (5)
Fully-structured
Unstructured
Hybrid: content and structure constraints




                                                  “shared apartment Berlin Alice”

           ?x ns:knows ? y. ?y ns:name “Alice”.
           ?x ns:knows ?z. ?z ns: works ?v.
           ?v ns:name “KIT”
Formalisms for querying semantic data - conclusion




Semantic search queries can be conceived as graph
    patterns with nodes referring to text and
  structured data items, and edges referring to
        relationships between these items!
Processing hybrid graph patterns (1)
 Example information need
 “Information about a friend of Alice, who shared an apartment with her in Berlin and
 knows someone working at KIT.”


apartment shared Berlin Alice                       ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                                 ?y ns:name “Alice”. ?x ns:knows ? y

                        trouble with bob                                       FluidOps                              34
                                                                                                   Peter
                                                       sunset.jpg
                        Bob is a good friend of
                                                                          Beautiful
                        mine. We went to the                              Sunset
                        same university, and                                                  Germany         Semantic
                                                       Alice                                                  Search
                        also shared an apartment
                        in Berlin in 2008. The
                        trouble with Bob is that
                        he takes much better                                                               Germany       2009
                                                                    Bob
                        photos than I do:                                             Thanh


                                                                                                  KIT
Processing hybrid graph patterns (2)
Matching hybrid graph patterns against data
Matching keyword query against text
 • Retrieve documents
    • Inverted list (inverted index)
            keyword       {<doc1, pos, score, ...>,
                              <doc2, pos, score, ...>, ...}
 • AND-semantics: top-k join

shared          Berlin          Alice                         shared   Berlin   Alice


 D1             D1            D1


  shared    =    berlin   =    alice




  shared
Matching structured query against structured data
    • Retrieve data for triple patterns
       • Index on tables
       • Multiple “redundant” indexes to cover different access patterns
    • Join (conjunction of triples)
       • Blocking, e.g. linear merge join (required sorted input)
       • Non-blocking, e.g. symmetric hash-join
       • Materialized join indexes
                                                                        ?x ns:knows ?y. ?x ns:knows ?z.
Per1 ns:works ?v     ?v ns:name “KIT”
       SP-index      PO-index                                           ?z ns: works ?v. ?v ns:name “KIT”




                                                         =
                                               =                  =

                                Per1 ns:works Ins1   Ins1 ns:name KIT
Per1 ns:works Ins1    Ins1 ns:name KIT
Matching keyword query against structured data
     • Retrieve keyword elements
        • Using inverted index
            keyword {<el1, score, ...>, <el2, score, ...>,…}
     • Exploration / “Join”
        • Data indexes for triple lookup
        • Materialized index (paths up to graphs)
        • Top-k Steiner tree search, top-k subgraph exploration


           Alice          Bob             KIT                 Alice   Bob   KIT

                     ↔            ↔


                                     =
                         =
Alice ns:knows Bob                        Inst1 ns:name KIT
                     Bob ns:works Inst1
Matching structured query against text
• Based on offline IE (offline see Peter’s slides)
• Based on online IE, i.e., “retrieve “ is as follows
   • Derive keywords to retrieve relevant documents
   • On-the-fly information extraction, i.e., phrase pattern matching “X name Y”
   • Retrieve extracted data for structured part
   • Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.

                                                              ?x ns:knows ?y. ?x ns:knows ?z.
                                                              ?z ns: works ?v. ?v ns:name “KIT”

                                                          knows


                                                                                        name   KIT
Matching structured query against text
• Index
   • Inverted index for document retrieval and pattern matching
   • Join index inverted index for storing materialized joins between keywords
   • Neighborhood indexes for phrase patterns




                                                           ?x ns:knows ?y. ?x ns:knows ?z.
                                                           ?z ns: works ?v. ?v ns:name “KIT”
 KIT   name
                                                       knows


                                                                                     name   KIT
Query processing – main tasks
                Retrieval
                  Documents , data elements, triples, paths,
 Query            graphs
                  Inverted index,…, but also other (B+ tree)
                  Index documents, triples, materialized paths
  Matching




                Join
                  Different join implementations, efficiency
                  depends on availability of indexes
                  Non-blocking join good for early result
  Data            reporting and for “unpredictable” Linked
                  Data / data streams scenario
Query processing – more tasks
               More complex queries: disjunction,
               aggregation, grouping, analytics…
 Query         Join order optimization
               Approximate
                 Approximate the search space
                 Approximate the results (matching, join)
  Matching




               Parallelization
               Top-k
                 Use only some entries in the input streams to
                 produce k results
  Data         Multiple sources
                 Federation, routing
                 On-the-fly mapping, similarity join
               Hybrid
                 Join text and data
Query processing on the Web -
 research challenges and opportunities

Large amount of semantic
data
                               • Optimization, parallelization
Data inconsistent,
                               • Approximation
redundant, and low quality
                               • Hybrid querying and data
Large amount of data             management
embedded in text
                               • Federation, routing
Large amount of sources        • Online schema mappings
Large amount of links          • Similarity join
between sources
Ranking
Structure
Problem definition
Types of ambiguities
Ranking paradigms
Model construction
  Content-based
  Structure-based
Ranking – problem definition
  Query
                      • Ambiguities arise when representation is
                        incomplete / imprecise
                      • Ambiguities at the level of
   Matching




                         • elements (content ambiguity)
                         • structure between elements (structure
                            ambiguity)
  Data

Due to ambiguities in the representation of the information needs and
the underlying resources, the results cannot be guaranteed to exactly
match the query. Ranking is the problem of determining the degree
of matching using some notions of relevance.
Content ambiguity
apartment shared Berlin Alice                       ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                                 ?y ns:name “Alice”. ?x ns:knows ? y

                        trouble with bob                                       FluidOps                              34
                                                                                                   Peter
                                                       sunset.jpg
                        Bob is a good friend of
                                                                          Beautiful
                        mine. We went to the                              Sunset
                        same university, and                                                  Germany         Semantic
                                                       Alice                                                  Search
                        also shared an apartment
                        in Berlin in 2008. The
                        trouble with Bob is that
                        he takes much better                                                               Germany       2009
                                                                    Bob
                        photos than I do:                                             Thanh


                                                                                                  KIT

 What is meant by “Berlin” in the query?                 What is meant by “KIT” in the query?
 What is meant by “Berlin” in the data?                  What is meant by “KIT” in the data?
 A city with the name Berlin? a person?                  A research group? a university? a location?
Structure ambiguity
apartment shared Berlin Alice                       ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                                 ?y ns:name “Alice”. ?x ns:knows ? y

                        trouble with bob                                       FluidOps                              34
                                                                                                   Peter
                                                       sunset.jpg
                        Bob is a good friend of
                                                                          Beautiful
                        mine. We went to the                              Sunset
                        same university, and                                                  Germany         Semantic
                                                       Alice                                                  Search
                        also shared an apartment
                        in Berlin in 2008. The
                        trouble with Bob is that
                        he takes much better                                                               Germany       2009
                                                                    Bob
                        photos than I do:                                             Thanh


                                                                                                  KIT


 What is the connection between “Berlin” and             What is meant by “works”?
 “Alice”?                                                Works at? employed?
 Friend? Co-worker?
Ambiguity
Recall: query processing is matching at the level of syntax and
semantics
Ambiguities arise when data or query allow for multiple
interpretations, i.e. multiple matches
  Syntactic, e.g. works vs. works at
  Semantic, e.g. works vs. employ
“Aboutness”, i.e., contain some elements which represent the
correct interpretation
  Ambiguities arise when matching elements of different granularities
  Does i contains the interpretation for j, given some part(s) of i
  (syntactically/semantically) match j
  E.g. Berlin vs. “…we went to the same university, and also, we shared an
  apartment in Berlin in 2008…”
Strictly speaking, ranking is performed after syntactic / semantic matching is
done!
Features: What to use to deal with ambiguities?
What is meant by “Berlin”? What is the connection between
“Berlin” and “Alice”?
  Content features
    Frequencies of terms: d more likely to be “about” a query term k
    when d more often, mentions k (probabilistic IR)
    Co-occurrences: terms K that often co-occur form a contextual
    interpretation, i.e., topics (cluster hypothesis)
  Structure features
    Consider relevance at level of fields
    Linked-based popularity
Ranking paradigms
Explicit relevance model
  Foundation: probability ranking principle
  Ranking results by the posterior probability (odds) of being
  observed in the relevant class:
  P(w|R) varies in different approaches, e.g., binary independence
  model, 2-poisson model, relevance model




                           P( D | R ) = ∏ P ( w | R)∏ (1 − P( w | N ))
       P(D|R)
       P(D/N)                             w∈D      w∉D

                                                           k
   P( w | R) ≈ P ( w | q1 ,..., qk ) =   ∑ P(m) P(w | m)∑ P(q     k   | m)
                                         m∈M              i =1
Ranking paradigms
   No explicit notion of relevance: similarity between the query and
   the document model
     Vector space model (cosine similarity)
     Language models (KL divergence)




   Sim ( q , d ) = Cos ( ( w1, d ,..., w t , d ), ( w1, q ,..., w k , q ))

                                                                     P (t | θ q )
Sim ( q , d ) = − KL (θ q ||θ d ) = − ∑ P (t | θ q ) log(
                                            t ∈V                     P (t | θ d )
Model construction
How to obtain
  Relevance models?
  Weights for query / document terms?
  Language models for document / queries?
Content-based model construction
   Document statistics, e.g.            • An object is more likely about
     Term frequency                     “Berlin”?
     Document length                        • When it contains a
   Collection statistics, e.g.              relatively high number of
                                            mentions of the term
     Inverse document frequency
                                            “Berlin”
     Background language models
                                            • When the number of
                                            mentions of this term in the
                                            overall collection is relatively
                tf
     wt ,d   =      ∗ idf                   low
               |d |
                  tf
P (t | θ d ) = λ      + (1 − λ ) P ( t | C )
                 |d |
Structure-based model construction
 Consider structure of objects during content-based modeling,
 i.e., to obtain structured content-based model
    Content-based model for structured objects, documents and for
    general tuples




        P (t | θ d ) =   ∑α       f   P (t | θ f )
                         f ∈ Fd


• An object is more likely about “Berlin”?
    • When one of its (important) fields contains a relatively high
    number of mentions of the term “Berlin”
Structure-based model construction
   PageRank
      Link analysis algorithm
      Measuring relative importance of nodes
      Link counts as a vote of support
      The PageRank of a node recursively depends on the number and
      PageRank of all nodes that link to it (incoming links)
   ObjectRank
      Types and semantics of links vary in structured data setting
      Authority transfer schema graph specifies connection strengths
      Recursively compute authority transfer data graph
• An object about “Berlin” is more important than one another?
    •When a relatively large number of objects are linked to it
Taxonomy of ranking approaches
Explicitly vs. non-explicitly relevance-based
Content-based ranking
Structure-based ranking
Content- and-structure-based ranking
Result Presentation
Search interface
Input and output functionality
  helping the user to formulate complex queries
  presenting the results in an intelligent manner
Semantic Search brings improvements in
  Query formulation
  Snippet generation
  Suggesting related entities
  Adaptive and interactive presentation
    Presentation adapts to the kind of query and results presented
    Object results can be actionable, e.g. buy this product
  Aggregated search
    Grouping similar items, summarizing results in various ways
    Filtering (facets), possibly across different dimensions
  Task completion
    Help the user to fulfill the task by placing the query in a task context
Query formulation
“Snap-to-grid”: suggest the most likely interpretation of the query
  Given the ontology or a summary of the data
  While the user is typing or after issuing the query
  Example: Freebase suggest, TrueKnowledge
Enhanced results/Rich Snippets
Use mark-up from the webpage to generate search snippets
  Originally invented at Yahoo! (SearchMonkey)
Google,Yahoo!, Bing, Yandex now consume schema.org markup
  Validators available from Google and Bing
Other result presentation tasks
Select the most relevant resources within an RDF document
  Penin et al. Snippet Generation for Semantic Web Search Engines,
  ASWC 2010
For each resource, rank the properties to be displayed
Natural Language Generation (NLG)
  Verbalize, explain results
Aggregated search: facets
Aggregated search: Sig.ma
Related entities




                   Related actors and
                   movies
Adaptive presentation:
housing search
Evaluation
Semantic Search challenge (2010/2011)
Two tasks
  Entity Search
    Queries where the user is looking for a single real world object
    Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.
  List search (new in 2011)
    Queries where the user is looking for a class of objects
Billion Triples Challenge 2009 dataset
Evaluated using Amazon’s Mechanical Turk
  Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010
  Blanco et al. Repeatable and Reliable Search System Evaluation using
  Crowd-Sourcing, SIGIR2011
Evaluation form
Catching the bad guys
Payment can be rejected for workers who try to game the system
  An explanation is commonly expected, though cheaters rarely complain
We opted to mix control questions into the real results
  Gold-win cases that are known to be perfect
  Gold-loose cases that are known to be bad
Metrics
  Avg. and std. dev on gold-win and gold-loose results
  Time to complete
Worker     Known          Real               Known Good      Total       Time to
           bad                                               N           complete
           N     Mean     N      Mean        N      Mean                 (sec)


badguy     20 2.556       200    2.738       20     2.684    240         29.6
goodguy    13 1           130    2.038       13     3        156         95
whoknows 1       1        21     1.571       2      3        24          83.5
Other evaluations

TREC Entity Track
  Related Entity Finding
    Entities related to a given entity through a particular relationship
    Retrieval over documents (ClueWeb 09 collection)
    Example: (Homepages of) airlines that fly Boeing 747
  Entity List Completion
    Given some elements of a list of entities, complete the list
Question Answering over Linked Data
  Retrieval over specific datasets (Dbpedia and MusicBrainz)
  Full natural language questions of different forms
  Correct results defined by an equivalent SPARQL query
  Example: Give me all actors starring in Batman Begins.
Resources
Resources
Books
  Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information
  Retrieval. ACM Press. 2011
Survey papers
  ThanhTran, Peter Mika. Survey of Semantic Search Approaches.
  Under submission, 2012.
Conferences and workshops
  ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
  Semantic Search workshop series
  Exploiting Semantic Annotations in Information Retrieval (ESAIR)
  Entity-oriented Search (EOS) workshop

Contenu connexe

Tendances

Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Semantic Search
Semantic SearchSemantic Search
Semantic Searchsssw2012
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications sathish sak
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Controlled Vocabularies & Cataloging
Controlled Vocabularies & Cataloging Controlled Vocabularies & Cataloging
Controlled Vocabularies & Cataloging robin fay
 
Better Search With Structured Knowledge
Better Search With Structured KnowledgeBetter Search With Structured Knowledge
Better Search With Structured KnowledgeMichel Dumontier
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genrobin fay
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
 
Corrib.org - OpenSource and Research
Corrib.org - OpenSource and ResearchCorrib.org - OpenSource and Research
Corrib.org - OpenSource and Researchadameq
 
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
SDA2013 Pundit: Creating, Exploring and Consuming AnnotationsSDA2013 Pundit: Creating, Exploring and Consuming Annotations
SDA2013 Pundit: Creating, Exploring and Consuming AnnotationsMarco Grassi
 
How to be successful with search in your organisation
How to be successful with search in your organisationHow to be successful with search in your organisation
How to be successful with search in your organisationvoginip
 
Tovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio CostantiniTovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio Costantinimaxfalc
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeDan Brickley
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Cataloging101 foundations: Authorities
Cataloging101 foundations: AuthoritiesCataloging101 foundations: Authorities
Cataloging101 foundations: Authoritiesrobin fay
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
 

Tendances (20)

Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Controlled Vocabularies & Cataloging
Controlled Vocabularies & Cataloging Controlled Vocabularies & Cataloging
Controlled Vocabularies & Cataloging
 
Better Search With Structured Knowledge
Better Search With Structured KnowledgeBetter Search With Structured Knowledge
Better Search With Structured Knowledge
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services gen
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
 
Corrib.org - OpenSource and Research
Corrib.org - OpenSource and ResearchCorrib.org - OpenSource and Research
Corrib.org - OpenSource and Research
 
Web Mining
Web MiningWeb Mining
Web Mining
 
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
SDA2013 Pundit: Creating, Exploring and Consuming AnnotationsSDA2013 Pundit: Creating, Exploring and Consuming Annotations
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
 
How to be successful with search in your organisation
How to be successful with search in your organisationHow to be successful with search in your organisation
How to be successful with search in your organisation
 
Tovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio CostantiniTovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio Costantini
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
Cataloging101 foundations: Authorities
Cataloging101 foundations: AuthoritiesCataloging101 foundations: Authorities
Cataloging101 foundations: Authorities
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 

En vedette

Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsThanh Tran
 
Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в ИталиюEasyWays
 
Lifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswcLifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswcThanh Tran
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриковAndronovaAnna
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebThanh Tran
 
Summary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesSummary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesThanh Tran
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...Thanh Tran
 
Semantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the WebSemantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the WebThanh Tran
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesThanh Tran
 
Searching Linked Data
Searching Linked DataSearching Linked Data
Searching Linked DataThanh Tran
 
ESSIR 2011 Semantic Search Tutorial
ESSIR 2011 Semantic Search TutorialESSIR 2011 Semantic Search Tutorial
ESSIR 2011 Semantic Search TutorialThanh Tran
 
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web Thanh Tran
 
Big data search
Big data search Big data search
Big data search Thanh Tran
 
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...Thanh Tran
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesThanh Tran
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing StrategiesThanh Tran
 
Graphinder semantic search
Graphinder semantic searchGraphinder semantic search
Graphinder semantic searchThanh Tran
 
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Thanh Tran
 

En vedette (18)

Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance Models
 
Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в Италию
 
Lifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswcLifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswc
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриков
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the Web
 
Summary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesSummary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data Sources
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
 
Semantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the WebSemantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the Web
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
Searching Linked Data
Searching Linked DataSearching Linked Data
Searching Linked Data
 
ESSIR 2011 Semantic Search Tutorial
ESSIR 2011 Semantic Search TutorialESSIR 2011 Semantic Search Tutorial
ESSIR 2011 Semantic Search Tutorial
 
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
 
Big data search
Big data search Big data search
Big data search
 
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search Databases
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing Strategies
 
Graphinder semantic search
Graphinder semantic searchGraphinder semantic search
Graphinder semantic search
 
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
 

Similaire à Semantic Search Tutorial at SemTech 2012

Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Bradley Allen
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011sssw2011
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentationurvics
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...NALESVPMEngg
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Stuart Weibel
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Finalguestcaef1d
 
Making things findable
Making things findableMaking things findable
Making things findablePeter Mika
 
Index nominum to ontology
Index nominum to ontologyIndex nominum to ontology
Index nominum to ontologyGautier Poupeau
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanJISC CETIS
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorialThengo Kim
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0John Breslin
 
Slawek Korea
Slawek KoreaSlawek Korea
Slawek KoreaSlawek
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Chapter 1 Introduction to Information Storage and Retrieval.pdfChapter 1 Introduction to Information Storage and Retrieval.pdf
Chapter 1 Introduction to Information Storage and Retrieval.pdfHabtamu100
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with searchJean Graef
 
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...innovatics
 
Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Bradley Allen
 

Similaire à Semantic Search Tutorial at SemTech 2012 (20)

Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Final
 
Making things findable
Making things findableMaking things findable
Making things findable
 
Index nominum to ontology
Index nominum to ontologyIndex nominum to ontology
Index nominum to ontology
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles Duncan
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorial
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
 
Slawek Korea
Slawek KoreaSlawek Korea
Slawek Korea
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Chapter 1 Introduction to Information Storage and Retrieval.pdfChapter 1 Introduction to Information Storage and Retrieval.pdf
Chapter 1 Introduction to Information Storage and Retrieval.pdf
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with search
 
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...
Humanidades digitales por Ryan Shaw (University of North Carolina at Chapel H...
 
Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)
 

Semantic Search Tutorial at SemTech 2012

  • 1. Semantic Search Tutorial Introduction Peter Mika|Yahoo! Research, Spain pmika@yahoo-inc.com Thanh Tran | Institute AIFB, KIT, Germany Tran@aifb.uni-karlsruhe.de
  • 2. About the speakers Peter Mika Senior Research Scientist Head of Semantic Search group at Yahoo! Research in Barcelona Semantic Search, Web Object Retrieval, Natural Language Processing Tran Duc Thanh Assistant Professor at AIFB, Karlsruhe Institute ofTechnology Head of Semantic Search group Semantic Search, Semantic / Linked Data Management
  • 3. Agenda Introduction (10 min) Semantic Web data (50 min) The RDF data model Publishing RDF Crawling and indexing RDF data Query processing (40 min) Ranking (30 min) Result presentation (15 min) Semantic Search evaluation (15 min) Questions (5 min)
  • 4. Why Semantic Search? I. “We are at the beginning of search.“ (Marissa Mayer) Solved large classes of queries, e.g. navigational Heavy investment in computational power Remaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognition Background knowledge and metadata can help to address poorly solved queries
  • 5. Poorly solved information needs Ambiguous searches Many of these queries would not paris hilton be asked by users, who learned Long tail queries over time what search technology can and can not do. george bush (and I mean the beer brewer in Arizona) Multimedia search paris hilton sexy Imprecise or overly precise searches jim hendler pictures of strong adventures people Precise searches for descriptions countries in africa 32 year old computer scientist living in barcelona reliable digital camera under 300 dollars
  • 7. Why Semantic Search? II. The Semantic Web is now a reality Large amounts of data published in RDF Heterogeneous data of varying quality Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain Searching data instead or in addition to searching documents Direct answers Novel search tasks
  • 8. Example: direct answers in search Points of Faceted search Information interest in from the for Shopping Information box with Vienna, Austria results Knowledge content from and links Graph to Yahoo! Travel Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’
  • 9. Novel search tasks Aggregation of search results e.g. price comparison across websites Analysis and prediction e.g. world temperature by 2020 Semantic profiling recommendations based on particular interests Semantic log analysis understanding user behavior in terms of objects Support for complex tasks e.g. booking a vacation using a combination of services
  • 10. Contextual (pervasive, ambient) search Yahoo! Connected TV: Widget engine embedded into the TV Yahoo! IntoNow: recognize audio and show related content
  • 11. Interactive search and task completion
  • 12. Document retrieval and data retrieval Information Retrieval (IR) support the retrieval of documents (document retrieval) Representation based on lightweight syntax-centric models Work well for topical search Not so well for more complex information needs Web scale Database (DB) and Knowledge-based Systems (KB) deliver more precise answers (data retrieval) More expressive models Allow for complex queries Retrieve concrete answers that precisely match queries Not just matching and filtering, but also joins Limitations in scalability
  • 13. Combination of document and data retrieval Documents with metadata Metadata may be embedded inside the document I’m looking for documents that mention countries in Africa. Data retrieval Structured data, but searchable text fields I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
  • 14. Semantic Search Target (combination of) document and data retrieval Semantic search is a retrieval paradigm that Exploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of content Incorporates the intent of the query and the meaning of content into the search process (semantic models) Wide range of semantic search systems Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
  • 15. Semantic models Semantics is concerned with the meaning of the resources made available for search Various representations of meaning Linguistic models: models of relationships among words Taxonomies, thesauri, dictionaries of entity names Inference along linguistic relations, e.g. broader/narrower terms Conceptual models: models of relationships among objects Ontologies capture entities in the world and their relationships Inference along domain-specific relations We will focus on conceptual models in this tutorial In particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
  • 16. Semantic Search – a process view Knowledge Representation Semantic Models • Keywords Resources • Forms Query • NL Construction • Formal language •IR-style matching & ranking •DB-style precise matching Query •KB-style matching & inferences Processing •Query visualization •Document and data presentation Documents Result •Summarization Presentation •Implicit feedback •Explicit feedback Query •Incentives Refinement Document Representation
  • 17. Semantic Search systems For data / document retrieval, semantic search systems might combine a range of techniques, ranging from statistics-based IR methods for ranking, database methods for efficient indexing and query processing, up to complex reasoning techniques for making inferences!
  • 18. Example: Information Workbench Addressing the lifecycle of interacting with the Web of Data Integration of data sources Content generation by the end user Search and Exploration Visualization Publishing User- generated Wikipedia Integrated management of heterogeneous data sources DBpedia, Yago Structured and unstructured Earthquake (Data.gov) Published and user-generated Static and dynamic Open domain Structured Dynamic
  • 19. Data Sources in the Application Entire English Wikipedia Data from Linked Open Data DBpedia YAGO … Data from Data.gov (US Government) E.g. live data about earthquakes Many more
  • 20. Semantic Search Hybrid Search: Structured queries combined with keywords across structured and unstructured data sources Query interpretation: Translation of keywords into hybrid queries Keyword search/query interpretation combined with faceted search: iterative refinement process based on keywords and operations on facets
  • 21. Search, Refinement and Navigation Keywords Query Translations Term Completions Vorlesung Knowledge Discovery - Institut AIFB Facets 21
  • 24. Data on the Web Data on the Web is not directly accessible Most web pages are generated from databases, but formatted for human consumption APIs offer limited views over data Two solutions Extraction using Information Extraction (IE) techniques Out of scope for this tutorial Relying on publishers to expose structured data using standard Semantic Web formats
  • 25. Information extraction Natural Language Processing Named entity recognition and disambiguation, sentiment analysis etc. Extraction of information about entities Suchanek et al. YAGO: A Core of Semantic Knowledge UnifyingWordNet and Wikipedia,WWW, 2007. Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007. Extraction from HTML tables Cafarella et al. WebTables: Exploring the Power of Tables on theWeb.VLDB 2008 Wrapper induction Kushmerick et al.Wrapper Induction for Information ExtractionText extraction. IJCAI 2007 Filling web forms automatically (form-filling) Madhavan et al. Google's Deep-Web Crawl.VLDB 2008
  • 26. Semantic Web Sharing data across the Web Standard data model RDF A number of syntaxes (file formats) RDF/XML, RDFa Powerful, logic-based languages for schemas OWL, RIF Query languages and protocols HTTP, SPARQL
  • 27. Resource Description Framework (RDF) Each resource (thing, entity) is identified by a URI Globally unique identifiers URLs a subset of URIs Often abbreviated using namespaces e.g. example:roi = http://example.org/roi RDF represents knowledge as a set of triples Each triple is a single fact about the entity (an attribute or a relationship) A set of triples forms an RDF graph RDF document type foaf:Person example:roi name “Roi Blanco”
  • 28. Linking resources Roi’s homepage Friend-of-a-Friend ontology type example:roi foaf:Person name “Roi Blanco” knows sameAs Yahoo!’s website type worksWith #roi2 #peter email “pmika@yahoo-inc.com”
  • 29. Ontologies Ontologies are the schemas for RDF graphs Define the intended meaning of certain classes and relationships in a domain e.g. the FOAF ontology defines the foaf:Person class and relationships such as foaf:knows Ontologies are published in standard languages such as OWL Language defines the constructs for modeling and their meaning e.g. subClassOf, sameAs Tools for editing ontologies such as Protégé, TopBraid Composer Reusing and extending well-known ontologies helps to interpret unknown data e.g. if X is subClassOf foaf:Person, and A is of type X, then we know that A is a person
  • 30. Example: schema.org Agreement on a shared set of schemas for common types of web content Bing, Google, and Yahoo! as initial supporters Similar in intent to sitemaps.org (2006) Use a single format to communicate the same information to all three search engines Support for microdata schema.org covers areas of interest to all search engines Business listings (local), creative works (video), recipes, reviews User defined extensions Each search engine continues to develop its products
  • 32. Publishing RDF Interlinked RDF documents (Linked Data) Each document describes a single resource with URIs pointing to related resources Common RDF file formats are RDF/XML and Turtle Mostly implemented as a wrapper around a database or Web service Embedding RDF inside HTML RDFa, microdata SPARQL endpoints Triple stores are databases for managing RDF data SPARQL is a standard protocol and query language for accessing triple stores using HTTP
  • 33. Linked Data Grass-roots community effort to (re)publish open datasets in RDF Centered around the Dbpedia project Directory of datasets at linkeddata.org, thedatahub.org
  • 34. Metadata in HTML anno 1995 What does this term mean? <HTML> <HEAD profile="http://dublincore.org/documents/dcq- html/"> <META name="DC.author" content="Peter Mika"> <LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF" href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> … </HTML> How is this data related?
  • 35. Microformats (anno 2003) Mark-up for data in HTML pages Reuse existing HTML elements (class, rel) Microformats exist for a limited set of objects Persons, organizations, reviews, events etc., see microformats.org No formal schemas Limited reuse, extensibility of schemas Unclear which combinations are allowed No identifiers for entities No interlinking between entities <div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div>
  • 36. RDFa and RDFa Lite (2008-2012) W3C standards for embedding RDF data in HTML documents A set of new HTML attributes to be used in head or body A specification of how to extract the data from these attributes RDFa is just a syntax, you have to choose a vocabulary separately RDFa family RDFa 1.0 Recommendation since October, 2008 RDFa 1.1 is a small update on RDFa to make it easier to use RDFa Primer (Working Draft, May 8, 2012) RDFa 1.1 Lite is a subset of RDFa 1.1 with the most commonly used features Proposed Recommendation (May 8, 2012)
  • 37. Example: Facebook’s Open Graph Protocol The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities Shows up in profiles and news feed Site owners can later reach users who have liked an object Facebook Graph API allows 3rd party developers to access the data Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’
  • 38. Example: Facebook’s Open Graph Protocol RDF vocabulary to be used in conjunction with RDFa Simplify the work of developers by restricting the freedom in RDFa Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment Only HTML <head> accepted <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media- imdb.com/images/rock.jpg" /> … </head> ...
  • 39. Microdata HTML5 Currently under standardization at the W3C Originally part of the HTML5 spec, but now a separate document Comparable to RDFa 1.1 Lite Key extensibility features (such as multiple types) missing HTML5 also has a number of “semantic” elements such as <time>, <video>, <article>… <div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10"> May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”></p> </div>
  • 40. Current state of metadata on the Web 31% of webpages, 5% of domains contain some metadata Analysis of the Bing Crawl (US crawl, January, 2012) RDFa is most common format By URL: 25% RDFa, 7% microdata, 9% microformat By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat Adoption is stronger among large publishers Especially for RDFa and microdata See also P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012
  • 41. Exponential growth in RDFa data Another five-fold increase between October 2010 and January, 2012 Five-fold increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats
  • 43. Crawling the Semantic Web Linked Data Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled Semantic Sitemap/VOID descriptions RDFa Same as HTML crawling, but data is extracted after crawling Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010. SPARQL endpoints Endpoints are not linked, need to be discovered by other means Semantic Sitemap/VOID descriptions
  • 44. Data fusion Ontology matching Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies Entity resolution Logic-based approaches in the Semantic Web Studied as record linkage in the database literature Machine learning based approaches, focusing on attributes Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data Improvements over only attribute based matching Blending Merging objects that represent the same real world entity and reconciling information from multiple sources
  • 45. Data quality assessment and curation Heterogeneity, quality of data is an even larger issue Quality ranges from well-curated data sets (e.g. Freebase) to microformats In the worst of cases, the data becomes a graph of words Short amounts of text: prone to mistakes in data entry or extraction Example: mistake in a phone number or state code Quality assessment and data curation Quality varies from data created by experts to user-generated content Automated data validation Against known-good data or using triangulation Validation against the ontology or using probabilistic models Data validation by trained professionals or crowdsourcing Sampling data for evaluation Curation based on user feedback
  • 46. Indexing Search requires matching and ranking Matching selects a subset of the elements to be scored The goal of indexing is to speed up matching Retrieval needs to be performed in milliseconds Without an index, retrieval would require streaming through the collection The type of index depends on the query model to support DB-style indexing IR-style indexing
  • 47. IR-style indexing Index data as text Create virtual documents from data One virtual document per subgraph, resource or triple typically: resource Key differences to Text Retrieval RDF data is structured Minimally, queries on property values are required
  • 48. Horizontal index structure Two fields (indices): one for terms, one for properties For each term, store the property on the same position in the property index Positions are required even without phrase queries Query engine needs to support the alignment operator ✓ Dictionary is number of unique terms + number of properties Occurrences is number of tokens * 2
  • 49. Vertical index structure One field (index) per property Positions are not required But useful for phrase queries Query engine needs to support fields Dictionary is number of unique terms Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance
  • 50. Distributed indexing MapReduce is ideal for building inverted indices Map creates (term, {doc1}) pairs Reduce collects all docs for the same term: (term, {doc1, doc2…} Sub-indices are merged separately Term-partitioned indices Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
  • 51. Query Processing / Matching
  • 52. Structure Taxonomy of search approaches Query processing / matching techniques for Semantic Search Types of semantic data Formalisms for querying semantic data Approaches General task: hybrid graph pattern matching Matching keyword query against text Matching structured query against structured data Matching keyword query against structured data Matching structured query against text (a hybrid case) Main tasks, challenges and opportunities
  • 53. Taxonomy of search approaches (1) The search problem A collection of resources, called data Information needs expressed as queries Search is the task of efficiently computing results from data that are relevant to queries Document data retrieval vs. structured data retrieval Differences in query and data representation and matching Efficiently retrieve structured data that exactly match formal information needs expressed as structured queries Effectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance) Semantic search: ranked retrieval of document and structured data (given ambiguous queries / data)
  • 54. Taxonomy of search approaches (2) Query engines (of databases) Exact Complete Query Sound • Approximate Matching • Not complete • Not sound • Ranked Data • Best effort • Top-k Search engines (stand-alone/database extensons) Query processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!
  • 55. Query processing for Semantic Search (1) Resources represented by semantic data ranging from Structured data with well defined schemas Semi-structured data with incomplete or no schemas Data that largely comprise text Hybrid / embedded data Information needs of varying complexity, captured using different formalisms and querying paradigms Natural language texts and keywords Form-based inputs Formal structured queries (Search is end-user oriented paradigm, requires “natural”, intuitive querying interfaces) Semantic search: efficiently computing results (query processing) from data that are relevant to queries (ranking)
  • 56. Query processing for Semantic Search (2) NL Form- / facet- Structured Queries Keywords Questions based Inputs (SPARQL) Ambiquities Query Matching Data RDF data Semi-Structured Structured OWL ontologies with embedded in text RDF data RDF data rich, formal semantics (RDFa) Ambiquities: confidence degree, truth/trust value…
  • 57. Query processing for Semantic Search (3) Textual Data Structured query Keyword query on on textual data , textual data, e.g. e.g. querying Web search extension for systems Search target different Semantic search systems? group of users, information needs, Unstructured and types of data. Query processing Structured Query Query for semantic searchStructured query is hybrid Keyword query on combination of techniques! on structured data structured data, e.g. standard e.g. search querying interface extensions for for databases / databases RDF stores Structured Data
  • 58. Types of data models (1) Textual Bag-of-words Represent documents, text in structured data,…, real-world objects (captured as structured data) Lacks “structure” in text, e.g. linguistic structure, hyperlinks, (positional information) Structure in structured data representation term (statistics) combination Cloud In combination with Cloud Computing technologies, Technologies promising solutions for the management of `big data' solutions management have emerged. Existing `big data' industry solutions are able to support complex queries and industry solutions analytics tasks with terabytes of data. For example, using a support complex Greenplum. ……
  • 59. Types of data models (2) Textual Structured Resource Description Framework (RDF) Represent real-world objects, services, applications, …. documents Resource attribute values and relationships between resources Schema creator Picture Person Bob
  • 60. Types of data models (3) Textual Structured Hybrid RDF data embedded in text (RDFa)
  • 61. Types of data models – RDFa (1) … <div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div> </div> … adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
  • 62. Types of semantic data – RDFa (2) Bob is a good friend of mine. We went to content the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: content adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
  • 63. Types of semantic data - conclusion Semantic data in general can be conceived as a graph with text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
  • 64. Formalisms for querying semantic data (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” Unstructured queries Fully-structured queries Hybrid queries: unstructured + structured
  • 65. Formalisms for querying semantic data (2) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” Unstructured NL Keywords shared apartment Berlin Alice
  • 66. Formalisms for querying semantic data (3) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” Unstructured Fully-structured SPARQL: BGP, filter, optional, union, select, construct, ask, describe PREFIX ns: <http://example.org/ns#> SELECT ?x WHERE { ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
  • 67. Formalisms for querying semantic data (4) Fully-structured Unstructured Hybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  • 68. Formalisms for querying semantic data (5) Fully-structured Unstructured Hybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  • 69. Formalisms for querying semantic data - conclusion Semantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
  • 70. Processing hybrid graph patterns (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend of Beautiful mine. We went to the Sunset same university, and Germany Semantic Alice Search also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better Germany 2009 Bob photos than I do: Thanh KIT
  • 71. Processing hybrid graph patterns (2) Matching hybrid graph patterns against data
  • 72. Matching keyword query against text • Retrieve documents • Inverted list (inverted index) keyword {<doc1, pos, score, ...>, <doc2, pos, score, ...>, ...} • AND-semantics: top-k join shared Berlin Alice shared Berlin Alice D1 D1 D1 shared = berlin = alice shared
  • 73. Matching structured query against structured data • Retrieve data for triple patterns • Index on tables • Multiple “redundant” indexes to cover different access patterns • Join (conjunction of triples) • Blocking, e.g. linear merge join (required sorted input) • Non-blocking, e.g. symmetric hash-join • Materialized join indexes ?x ns:knows ?y. ?x ns:knows ?z. Per1 ns:works ?v ?v ns:name “KIT” SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT” = = = Per1 ns:works Ins1 Ins1 ns:name KIT Per1 ns:works Ins1 Ins1 ns:name KIT
  • 74. Matching keyword query against structured data • Retrieve keyword elements • Using inverted index keyword {<el1, score, ...>, <el2, score, ...>,…} • Exploration / “Join” • Data indexes for triple lookup • Materialized index (paths up to graphs) • Top-k Steiner tree search, top-k subgraph exploration Alice Bob KIT Alice Bob KIT ↔ ↔ = = Alice ns:knows Bob Inst1 ns:name KIT Bob ns:works Inst1
  • 75. Matching structured query against text • Based on offline IE (offline see Peter’s slides) • Based on online IE, i.e., “retrieve “ is as follows • Derive keywords to retrieve relevant documents • On-the-fly information extraction, i.e., phrase pattern matching “X name Y” • Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp. ?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” knows name KIT
  • 76. Matching structured query against text • Index • Inverted index for document retrieval and pattern matching • Join index inverted index for storing materialized joins between keywords • Neighborhood indexes for phrase patterns ?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” KIT name knows name KIT
  • 77. Query processing – main tasks Retrieval Documents , data elements, triples, paths, Query graphs Inverted index,…, but also other (B+ tree) Index documents, triples, materialized paths Matching Join Different join implementations, efficiency depends on availability of indexes Non-blocking join good for early result Data reporting and for “unpredictable” Linked Data / data streams scenario
  • 78. Query processing – more tasks More complex queries: disjunction, aggregation, grouping, analytics… Query Join order optimization Approximate Approximate the search space Approximate the results (matching, join) Matching Parallelization Top-k Use only some entries in the input streams to produce k results Data Multiple sources Federation, routing On-the-fly mapping, similarity join Hybrid Join text and data
  • 79. Query processing on the Web - research challenges and opportunities Large amount of semantic data • Optimization, parallelization Data inconsistent, • Approximation redundant, and low quality • Hybrid querying and data Large amount of data management embedded in text • Federation, routing Large amount of sources • Online schema mappings Large amount of links • Similarity join between sources
  • 81. Structure Problem definition Types of ambiguities Ranking paradigms Model construction Content-based Structure-based
  • 82. Ranking – problem definition Query • Ambiguities arise when representation is incomplete / imprecise • Ambiguities at the level of Matching • elements (content ambiguity) • structure between elements (structure ambiguity) Data Due to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some notions of relevance.
  • 83. Content ambiguity apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend of Beautiful mine. We went to the Sunset same university, and Germany Semantic Alice Search also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better Germany 2009 Bob photos than I do: Thanh KIT What is meant by “Berlin” in the query? What is meant by “KIT” in the query? What is meant by “Berlin” in the data? What is meant by “KIT” in the data? A city with the name Berlin? a person? A research group? a university? a location?
  • 84. Structure ambiguity apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend of Beautiful mine. We went to the Sunset same university, and Germany Semantic Alice Search also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better Germany 2009 Bob photos than I do: Thanh KIT What is the connection between “Berlin” and What is meant by “works”? “Alice”? Works at? employed? Friend? Co-worker?
  • 85. Ambiguity Recall: query processing is matching at the level of syntax and semantics Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matches Syntactic, e.g. works vs. works at Semantic, e.g. works vs. employ “Aboutness”, i.e., contain some elements which represent the correct interpretation Ambiguities arise when matching elements of different granularities Does i contains the interpretation for j, given some part(s) of i (syntactically/semantically) match j E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…” Strictly speaking, ranking is performed after syntactic / semantic matching is done!
  • 86. Features: What to use to deal with ambiguities? What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”? Content features Frequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR) Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis) Structure features Consider relevance at level of fields Linked-based popularity
  • 87. Ranking paradigms Explicit relevance model Foundation: probability ranking principle Ranking results by the posterior probability (odds) of being observed in the relevant class: P(w|R) varies in different approaches, e.g., binary independence model, 2-poisson model, relevance model P( D | R ) = ∏ P ( w | R)∏ (1 − P( w | N )) P(D|R) P(D/N) w∈D w∉D k P( w | R) ≈ P ( w | q1 ,..., qk ) = ∑ P(m) P(w | m)∑ P(q k | m) m∈M i =1
  • 88. Ranking paradigms No explicit notion of relevance: similarity between the query and the document model Vector space model (cosine similarity) Language models (KL divergence) Sim ( q , d ) = Cos ( ( w1, d ,..., w t , d ), ( w1, q ,..., w k , q )) P (t | θ q ) Sim ( q , d ) = − KL (θ q ||θ d ) = − ∑ P (t | θ q ) log( t ∈V P (t | θ d )
  • 89. Model construction How to obtain Relevance models? Weights for query / document terms? Language models for document / queries?
  • 90. Content-based model construction Document statistics, e.g. • An object is more likely about Term frequency “Berlin”? Document length • When it contains a Collection statistics, e.g. relatively high number of mentions of the term Inverse document frequency “Berlin” Background language models • When the number of mentions of this term in the overall collection is relatively tf wt ,d = ∗ idf low |d | tf P (t | θ d ) = λ + (1 − λ ) P ( t | C ) |d |
  • 91. Structure-based model construction Consider structure of objects during content-based modeling, i.e., to obtain structured content-based model Content-based model for structured objects, documents and for general tuples P (t | θ d ) = ∑α f P (t | θ f ) f ∈ Fd • An object is more likely about “Berlin”? • When one of its (important) fields contains a relatively high number of mentions of the term “Berlin”
  • 92. Structure-based model construction PageRank Link analysis algorithm Measuring relative importance of nodes Link counts as a vote of support The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links) ObjectRank Types and semantics of links vary in structured data setting Authority transfer schema graph specifies connection strengths Recursively compute authority transfer data graph • An object about “Berlin” is more important than one another? •When a relatively large number of objects are linked to it
  • 93. Taxonomy of ranking approaches Explicitly vs. non-explicitly relevance-based Content-based ranking Structure-based ranking Content- and-structure-based ranking
  • 95. Search interface Input and output functionality helping the user to formulate complex queries presenting the results in an intelligent manner Semantic Search brings improvements in Query formulation Snippet generation Suggesting related entities Adaptive and interactive presentation Presentation adapts to the kind of query and results presented Object results can be actionable, e.g. buy this product Aggregated search Grouping similar items, summarizing results in various ways Filtering (facets), possibly across different dimensions Task completion Help the user to fulfill the task by placing the query in a task context
  • 96. Query formulation “Snap-to-grid”: suggest the most likely interpretation of the query Given the ontology or a summary of the data While the user is typing or after issuing the query Example: Freebase suggest, TrueKnowledge
  • 97. Enhanced results/Rich Snippets Use mark-up from the webpage to generate search snippets Originally invented at Yahoo! (SearchMonkey) Google,Yahoo!, Bing, Yandex now consume schema.org markup Validators available from Google and Bing
  • 98. Other result presentation tasks Select the most relevant resources within an RDF document Penin et al. Snippet Generation for Semantic Web Search Engines, ASWC 2010 For each resource, rank the properties to be displayed Natural Language Generation (NLG) Verbalize, explain results
  • 101. Related entities Related actors and movies
  • 104. Semantic Search challenge (2010/2011) Two tasks Entity Search Queries where the user is looking for a single real world object Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010. List search (new in 2011) Queries where the user is looking for a class of objects Billion Triples Challenge 2009 dataset Evaluated using Amazon’s Mechanical Turk Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
  • 106. Catching the bad guys Payment can be rejected for workers who try to game the system An explanation is commonly expected, though cheaters rarely complain We opted to mix control questions into the real results Gold-win cases that are known to be perfect Gold-loose cases that are known to be bad Metrics Avg. and std. dev on gold-win and gold-loose results Time to complete Worker Known Real Known Good Total Time to bad N complete N Mean N Mean N Mean (sec) badguy 20 2.556 200 2.738 20 2.684 240 29.6 goodguy 13 1 130 2.038 13 3 156 95 whoknows 1 1 21 1.571 2 3 24 83.5
  • 107. Other evaluations TREC Entity Track Related Entity Finding Entities related to a given entity through a particular relationship Retrieval over documents (ClueWeb 09 collection) Example: (Homepages of) airlines that fly Boeing 747 Entity List Completion Given some elements of a list of entities, complete the list Question Answering over Linked Data Retrieval over specific datasets (Dbpedia and MusicBrainz) Full natural language questions of different forms Correct results defined by an equivalent SPARQL query Example: Give me all actors starring in Batman Begins.
  • 109. Resources Books Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press. 2011 Survey papers ThanhTran, Peter Mika. Survey of Semantic Search Approaches. Under submission, 2012. Conferences and workshops ISWC, ESWC, WWW, SIGIR, CIKM, SemTech Semantic Search workshop series Exploiting Semantic Annotations in Information Retrieval (ESAIR) Entity-oriented Search (EOS) workshop