SlideShare une entreprise Scribd logo
1  sur  118
Peter Mika| Yahoo Research, Spain pmika@yahoo-inc.com Thanh Tran | Institute AIFB, KIT, Germany Tran@aifb.uni-karlsruhe.de Semantic Search TutorialIntroduction
About the speakers Peter Mika Semantic Search group at Yahoo! Barcelona Semantic Search, Web Object Retrieval, Natural Language Processing Tran Duc Thanh Semantic Search group at AIFB Semantic Search, Semantic Data Management, Linked Data Query Processing
Agenda Introduction (5 min) Semantic Web data (50 min) The RDF data model Publishing RDF Crawling and indexing RDF data Query processing (35 min) Ranking (25 min)  Result presentation (15 min) Semantic Search evaluation (15 min) Demos (15 min) Questions (5 min)
Why Semantic Search? I. “We are at the beginning of search.“ (Marissa Mayer) Solved large classes of queries, e.g. navigational Heavy investment in computational power Remaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognition Background knowledge and metadata can help to address poorly solved queries
Poorly solved information needs Ambiguous searches parishilton Long tail queries george bush (and I mean the beer brewer in Arizona) Multimedia search parishilton sexy Imprecise or overly precise searches  jimhendler pictures of strong adventures people Precise searches for descriptions countries in africa 32 year old computer scientist living in barcelona reliable digital camera under 300 dollars Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
Example: multiple interpretations
Why Semantic Search? II. The Semantic Web is now a reality Large amounts of data published in RDF Heterogeneous data of varying quality Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain Searching data instead or in addition to searching documents Direct answers Novel search tasks
Information box with content from and links to Yahoo! Travel Example: direct answers in search Points of interest in Vienna, Austria Shopping results from  Yahoo! Shopping Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’
Novel search tasks Aggregation of search results e.g. price comparison across websites Analysis and prediction e.g. world temperature by 2020 Semantic profiling recommendations based on particular interests Semantic log analysis understanding user behavior in terms of objects  Support for complex tasks e.g. booking a vacation using a combination of services
Document retrieval and data retrieval Information Retrieval (IR) support the retrieval of documents (document retrieval) Representation based on lightweight syntax-centric models  Work well for topical search Not so well for more complex information needs Web scale Database (DB)  and Knowledge-based Systems (KB) deliver more precise answers (data retrieval) More expressive models  Allow for complex queries Retrieve concrete answers that precisely match queries Not just matching and filtering, but also joins Limitations in scalability
Combination of document and data retrieval  Documents with metadata Metadata may be embeddedinside the document I’m looking for documents that mention countries in Africa. Data retrieval Structured data, but searchable text fields I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
Semantic Search Target (combination of) document and data retrieval Semantic search is a retrieval paradigm that Exploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of content Incorporates the intent of the query and the meaning of content into the search process (semantic models) Wide range of semantic search systems Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
Semantic models Semantics is concerned with the meaning of the resources made available for search Various representations of meaning Linguistic models: models of relationships among words Taxonomies, thesauri, dictionaries of entity names Inference along linguistic relations, e.g. broader/narrower terms Natural language search Conceptual models: models of relationships among objects Ontologies capture entities in the world and their relationships Inference along domain-specific relations Knowledge-based search We will focus on conceptual models in this tutorial In particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
Semantic Search –  a process view Knowledge Representation Semantic Models Resources Documents DocumentRepresentation
Semantic Search systems 	For data / document retrieval, semantic search systems might combine a range of techniques, ranging from statistics-based IR methods for ranking, database methods for efficient indexing and query processing, up to complex reasoning techniques for making inferences!
Semantic Web data
Semantic Web Sharing data across the Web Standard data model RDF A number of syntaxes (file formats) RDF/XML, RDFa Powerful, logic-based languages for schemas OWL, RIF Query languages and protocols HTTP, SPARQL
Resource Description Framework (RDF) Each resource (thing, entity) is identified by a URI Globally unique identifiers Locators of information Data is broken down into individual facts Triples of (subject, predicate, object) A set of triples (an RDF graph) is published together in an RDF document RDF document foaf:Person type example:roi name “Roi Blanco”
Linking resources Friend-of-a-Friend ontology Roi’s homepage type example:roi foaf:Person name “Roi Blanco” sameAs Yahoo type worksWith example:roi2 example:peter email “pmika@yahoo-inc.com”
Publishing RDF Linked Data Data published as RDF documents linked to other RDF documents Community effort to re-publish large public datasets (e.g. Dbpedia, open government data) RDFa Data embedded inside HTML pages Recommended for site owners by Yahoo, Google, Facebook SPARQL endpoints Triple stores (RDF databases) that can be queried through the web
Linked Data A web of RDF documents in parallel to the current Web often implemented as wrappers around databases or APIs The four rules of Linked Data: Use URIs to identify things. Use HTTPURIs so that these things can be referred to and looked up ("dereference") by people and user agents. Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
Linked Data Advantages:  No change to the publishing of the HTML documents Data can be published by third party (e.g. Dbpedia) Disadvantages: Web servers need to be configured to properly handle URIs that identify concepts instead of documents Not favored by search engines  Lack of use cases Crawling needs to be changed Authority is difficult to determine Tools Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby) RDB-to-RDF mappers (e.g. D2RQ, Triplify) Validators (Vapour) Linked Data browsers (many)
The state of Linked Data Rapidly growing community effort to (re)publish open datasets as Linked Data In particular, scientific and government datasets see linkeddata.org Less commercial interest, real usage
Metadata in HTML 1995: HTML meta tags 1996: Simple HTML Ontology Extensions (SHOE) 1998: RDF/XML RDF/XML in HTML RDF linked from HTML 2003: Web 2.0 Tagging Microformats Metadata in Wikipedia Machine tags in Flickr 2005: eRDF  2008: RDFa 1.0 2011: RDFa 1.1 2012: Microdata?
HTML meta tags <HTML> <HEAD profile="http://dublincore.org/documents/dcq-html/"> <META name="DC.author" content="Peter Mika"> <LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" />  <LINK rel="meta" type="application/rdf+xml" title="FOAF"  	   href= "http://www.cs.vu.nl/~pmika/foaf.rdf">  </HEAD>  … </HTML>
Microformats (μf) Agreements on the way to encode certain kinds metadata in HTML Reuse of semantic-bearing HTML elements Based on existing standards Minimality Microformats exist for a limited set of objects hCard (persons and organizations) hCalendar (events) hResume hProduct hRecipe Varying degrees of support and stability hCard and rel-tag are widely supported Community centered around microformats.org Specifications and discussions are hosted there
Microformats: limitations No shared syntax Each microformat has a separate syntax tailored to the vocabulary  No formal schemas Limited reuse, extensibility of schemas Unclear which combinations are allowed No datatypes No namespaces, unique identifiers (URIs)  no interlinking mapping between instances is required Always appears in the HTML <body>
Example: the hCard microformat <div class="vcard">    <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>      <div class="tel">+1-919-555-7878</div>    <div class="title">Area Administrator, Assistant</div>  </div>  <cite class="vcard"> <a class="fn url" rel="friend colleague met” href="http://meyerweb.com/"> Eric Meyer</a> </cite> wrote a post (<cite> <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/"> Tax Relief</a></cite>) about an unintentionally humorous letter he received from  the <span class="vcard”> <a class="fn org url" href="http://irs.gov/"> Internal Revenue Service</a>  </span>.
RDFa W3C standard for embedding RDF data in HTML documents A set of new HTML attributes to be used in head or body A specification of how to extract the data from these attributes  RDFa is just a syntax, you have to choose a vocabulary separately RDFa 1.0 is a W3C Recommendation since October, 2008 RDFa Primer RDFa 1.1 is a small update on RDFa to make it easier to use Currently Working Draft (March 31, 2011) Updated version of the RDFa Primer (April 19, 2011) RDFa API for accessing RDFa data in a webpage in the browser from JavaScript Currently Working Draft (April 19, 2011)
RDFa 1.1 Changes New vocab attribute to define the default namespace for the document or subtree Profile documents to define multiple namespace prefixes The prefix attribute as a recommended replacement of xmlns You can use URIs even where only CURIEs where allowed before RDFa 1.1 is backward compatible with RDFa 1.0 RDFa 1.1 is recommended if you want to use HTML5
When to use RDFa Choose microformats when you find a microformat that fits your needs and supported by your consumers Microformats are first option because they are simple Yahoo supports all major microformats, see the documentation It’s a common misconception that RDFa requires XHTML or that it’s compatible with HTML5 It’s compatible with HTML4, HTML5, XHTML If you find none that perfectly fits your needs then you need RDFa Microformats have a fixed schema: you can not add your own attributes Example: a social networking site with user profiles VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections You either live without this, or go with RDFa
Example: Facebook’s Like and the Open Graph Protocol The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities  Shows up in profiles and news feed Site owners can later reach users who have liked an object Facebook Graph API allows 3rd party developers to access the data  Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’
Example: Facebook’s Open Graph Protocol RDF vocabulary to be used in conjunction with RDFa Simplify the work of developers by restricting the freedom in RDFa Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment Only HTML <head> accepted <html xmlns:og="http://opengraphprotocol.org/schema/">  <head>  	<title>The Rock (1996)</title>  	<meta property="og:title" content="The Rock" />  	<meta property="og:type" content="movie" />  	<meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />  	<meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> … </head> ...
Example: Yahoo! Enhanced Results (was: SearchMonkey) Guide for publishers to mark-up their pages for common types of objects Product, Local, News, Video, Events, Documents, Discussion, Games Using popular microformats and RDF vocabularies Copy-paste code  Validator Yahoo as a consumer See later
Example: Google’s Rich Snippets Google accepts popular microformats and its own RDFa vocabulary Similar approach to RDFa as Facebook Validator to check if the markup is correct Google displays enhanced results based on this metadata Rich Snippets
RDFa on the rise 510% increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats
Microdata Currently under standardization at the W3C Originally part of the HTML5 spec, but now a separate document Similar to microformats, but with the extensibility of RDFa Introduce new terms using reverse domain names or full URIs HTML5 also has a number of “semantic” elements such as <time>, <video>, <article>…
Microdata example <div itemscope itemid=“http://www.yahoo.com/resource/person”>  <p>My name is <span itemprop="name">Neil</span>.</p>  <p>My band is called   <span itemprop="band">Four Parts Water</span>.  I was born on   <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>.  <img itemprop="image" src=”me.png" alt=”me”>  </p> </div
The state of metadata in HTML 5-10% of webpages contain some explicit metadata Depending on how you count… Too many competing approaches Too many formats: microformats vs RDFa vs Microdata When using RDFa, publishers may need to use multiple different vocabularies to satisfy everyone
Crawling the Semantic Web Linked Data Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled Semantic Sitemap/VOID descriptions RDFa Same as HTML crawling, but data is extracted after crawling Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010. SPARQL endpoints Endpoints are not linked, need to be discovered by other means Semantic Sitemap/VOID descriptions
Data fusion Ontology matching Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies Entity resolution Logic-based approaches in the Semantic Web Studied as record linkage in the database literature  Machine learning based approaches, focusing on attributes Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data Improvements over only attribute based matching Blending Merging objects that represent the same real world entity and reconciling information from multiple sources
Data quality assessment and curation Heterogeneity, quality of data is an even larger issue Quality ranges from well-curated data sets (e.g. Freebase) to microformats  In the worst of cases, the data becomes a graph of words Short amounts of text: prone to mistakes in data entry or extraction Example: mistake in a phone number or state code Quality assessment and data curation Quality varies from data created by experts to user-generated content Automated data validation Against known-good data or using triangulation Validation against the ontology or using probabilistic models Data validation by trained professionals or crowdsourcing Sampling data for evaluation Curation based on user feedback
Indexing Search requires matching and ranking Matching selects a subset of the elements to be scored The goal of indexing is to speed up matching Retrieval needs to be performed in milliseconds Without an index, retrieval would require streaming through the collection The type of index depends on the query model to support DB-style indexing IR-style indexing
IR-style indexing Index data as text Create virtual documents from data One virtual document per subgraph, resource or triple typically: resource Key differences to Text Retrieval RDF data is structured Minimally, queries on property values are required
Horizontal index structure Two fields (indices): one for terms, one for properties For each term, store the property on the same position in the property index Positions are required even without phrase queries Query engine needs to support the alignment operator ✓  Dictionary is number of unique terms + number of properties Occurrences is number of tokens * 2
Vertical index structure One field (index) per property Positions are not required But useful for phrase queries Query engine needs to support fields Dictionary is number of unique terms Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance
Indexing using MapReduce MapReduce is the perfect model for building inverted indices Map creates (term, {doc1}) pairs Reduce collects all docs for the same term: (term, {doc1, doc2…} Sub-indices are merged separately Term-partitioned indices Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
Query Processing
Structure Taxonomy of retrieval approaches Query processing for semantic search Types of semantic data Formalisms for querying semantic data Approaches General task: hybrid graph pattern matching Matching keyword query against text Matching structured query against structured data Matching keyword query against structured data Matching structured query against text (a hybrid case) Main tasks, challenges and opportunities
Taxonomy of retrieval approaches (1) Data retrieval problem A collection of resources represented by the data model G Information needs expressed as queries in Q Retrieval is the task of efficiently computing results from G that are relevant to the queries in Q Document retrieval vs. data retrieval  Differences in query and data representation and matching Efficiently retrieve structured data that exactly match formal information needs expressed as structured queries Effectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)
Taxonomy of retrieval approaches (2) Exact Complete  Sound  Query Matching ,[object Object]
Not complete
Not sound
Ranked
Best effort
Top-kData Query processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!
Query processing for Semantic Search (1) The underlying collection of resources is represented by semantic data G ranging from Structured data with well defined schemas Semi-structured data with incomplete or no schemas Data that largely comprise text Hybrid / embedded data Targeted information needs Q are of varying complexity, captured using different formalisms and querying paradigms  Natural language texts and keywords  Form-based inputs  Formal structured queries Semantic search, mainly, is the task of efficiently computing results(query processing) from G that are relevantto the queries in Q (ranking)
Query processing for Semantic Search (2) Keywords NL Questions Form- / facet-based Inputs Structured Queries (SPARQL) Query Matching Data OWL ontologies with rich, formal semantics Structured RDF data Semi-Structured RDF data RDF data embedded in text (RDFa)
Query processing for Semantic Search (3) Textual Data Keyword query on textual data  (IR/document retrieval)  Structured query on textual data  Semantic Search target different group of users, information needs, and types of data. Query processing for semantic search is hybrid combination of techniques! Unstructured Query Structured Query Keyword query on structured data  Structured query on structured data  (DB/data retrieval) Structured Data
Types of data models (1) Textual Bag-of-words Represent documents, text in structured data,…, real-world objects (captured as structured data) Lacks “structure”  in text, e.g. linguistic structure, hyperlinks, (positional information) Structure in structured data representation term (statistics) In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum. combination  Cloud  Computing  Technologies solutions  management  `big data'  industry  solutions  support  complex  ……
Types of data models (2) Textual Structured  Resource Description Framework (RDF)  Represent real-world objects, services, applications, …. documents Resource attribute values and relationships between resources Schema Picture creator Person Bob
Types of data models (3) Textual Structured Hybrid RDF data embedded in text (RDFa)
Types of data models – RDFa (1) … <div about="/alice/posts/trouble_with_bob">       <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and   	also shared an apartment in Berlin in 2008. The trouble with Bob is 	that he takes much better photos than I do:       <div about="http://example.com/bob/photos/sunset.jpg"> <imgsrc="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span>         by <span property="dc:creator">Bob</span>. </div> </div> … adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
Types of semantic data – RDFa (2)  Bob is a good friend of mine. We went to the same university, and  also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: content content adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
Types of semantic data - conclusion Semantic data in general can be conceived as a graph with text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
Formalisms for querying semantic data (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” Unstructured queries Fully-structured queries Hybrid queries: unstructured + structured
Formalisms for querying semantic data (2) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” Unstructured NL Keywords  apartment Berlin Alice shared
Formalisms for querying semantic data (3) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlinand knows someone working at KIT.” Unstructured Fully-structured SPARQL: BGP, filter, optional, union, select, construct, ask, describe  PREFIX ns: <http://example.org/ns#>  	SELECT ?x  	WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.   		     ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT” }
Formalisms for querying semantic data (4) Fully-structured Unstructured  Hybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”.   ?x ns:knows ?z.  ?z ns: works ?v.  ?v ns:name “KIT”
Formalisms for querying semantic data (5) Fully-structured Unstructured  Hybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”.   ?x ns:knows ?z.  ?z ns: works ?v.  ?v ns:name “KIT”
Formalisms for querying semantic data - conclusion Semantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
Processing hybrid graph patterns (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT” apartment shared Berlin Alice ?y ns:name “Alice”. ?x ns:knows ? y age works 34 trouble with bob FluidOps Peter sunset.jpg Bob is a good friend of mine. We went to the same university, and  also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: Beautiful  Sunset author title Semantic Search Germany Alice creator author creator knows year Germany 2009 Bob Thanh knows located works KIT
Processing hybrid graph patterns (2) Matching hybrid graph patterns against data
Matching keyword query against text ,[object Object]
Inverted list (inverted index)keyword  {<doc1, pos, score, ...>,  <doc2, pos, score, ...>, ...} ,[object Object],shared Berlin Alice shared Berlin Alice D1 D1 D1 shared berlin alice = = shared
Matching structured query against structured data ,[object Object]
Index on tables
Multiple “redundant” indexes to cover different access patterns
Join (conjunction of triples)
Blocking, e.g.  linear merge join (required sorted input)
Non-blocking, e.g. symmetric hash-join
Materialized join indexes?x ns:knows ?y. ?x ns:knows ?z.   ?z ns: works ?v. ?v ns:name “KIT” Per1 ns:works?v ?vns:name “KIT” SP-index PO-index = = = Per1 ns:worksIns1 Ins1ns:name KIT Per1 ns:works Ins1 Ins1 ns:name KIT
Matching keyword query against structured data ,[object Object]
Using inverted index	keyword  {<el1, score, ...>, <el2, score, ...>,…}  ,[object Object]
Data indexes for triple lookup
Materialized index (paths up to graphs)
Top-k Steiner tree search, top-k subgraph exploration Alice Bob Bob KIT Alice KIT ↔ ↔ = = Alice ns:knowsBob Inst1ns:name KIT Bobns:worksInst1
Matching structured query against text ,[object Object]
Based on online IE, i.e., “retrieve “ is as follows
Derive keywords to retrieve relevant documents
On-the-fly information extraction, i.e., phrase pattern matching  “X title Y”
Retrieve extracted data for structured part
Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.
Index
Inverted index for document retrieval and pattern matching
Join index  inverted index for storing materialized  joins between keywords
Neighborhood indexes for phrase patterns?x ns:knows ?y. ?x ns:knows ?z.   ?z ns: works ?v. ?v ns:name “KIT” KIT name knows name KIT Hybrid case
Query processing – main tasks Retrieval Documents , data elements, triples, paths, graphs Inverted index,…, but also other indexes (B+ tree) Index documents, triples materialized join paths Join Different join implementations, efficiency depends on availability of indexes Non-blocking join good for early result reporting and for “unpredictable” linked data scenario Query Matching Data
Query processing – more tasks Disjunction, aggregation, grouping Join order optimization Approximate  Approximate the search space  Approximate the results (matching, join) Parallelization Top-k  Use only some entries in the input streams to produce k results Multiple sources Federation, routing  On-the-fly mapping, similarity join  Hybrid Join text and data Query Matching Data
Query processing on the Web - research challenges and opportunities  Large amount of semantic data Data inconsistent, redundant, and low quality  Large amount of data embedded in text  Large amount of sources Large amount of links between sources ,[object Object]
Approximation
Hybrid querying and data management
Federation, routing
Online schema mappings
Similarity join ,[object Object]
Structure Problem definition Types of ambiguities Ranking paradigms Model construction Content-based Structure-based
Ranking – problem definition Query ,[object Object]
Ambiguities at the level of
elements (content ambiguity)
structure between elements (structure ambiguity)Matching Data Due to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some notions of relevance.
Content ambiguity ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT” apartment shared Berlin Alice ?y ns:name “Alice”. ?x ns:knows ? y age works 34 trouble with bob FluidOps Peter sunset.jpg Bob is a good friend of mine. We went to the same university, and  also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: Beautiful  Sunset author title Semantic Search Germany Alice creator author creator knows year Germany 2009 Bob Thanh knows located works KIT What is meant by “Berlin” in the query? What is meant by “Berlin” in the data? A city with the name Berlin?  a person?  What is meant by “KIT” in the query? What is meant by “KIT” in the data? A research group?  a university? a location?
Structure ambiguity ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT” apartment shared Berlin Alice ?y ns:name “Alice”. ?x ns:knows ? y age works 34 trouble with bob FluidOps Peter sunset.jpg Bob is a good friend of mine. We went to the same university, and  also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: Beautiful  Sunset author title Semantic Search Germany Alice creator author creator knows year Germany 2009 Bob Thanh knows located works KIT What is the connection between “Berlin” and “Alice”?  Friend? Co-worker?  What is meant by “works”?  Works at? employed?
Ambiguity Recall: query processing is matching at the level of syntax and semantics  Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matches Syntactic, e.g. works vs. works at Semantic, e.g. works vs. employ “Aboutness”, i.e., contain some elements which represent the correct interpretation  Ambiguities arise when matching elements of different granularities Does icontains the interpretation for j, given some part(s) of i (syntactically/semantically) match j E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…” Strictly speaking, ranking is performed after syntactic / semantic matching is done!
Features: What to use to deal with ambiguities? What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”?  Content features Frequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR) Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis)  Structure features Consider relevance at level of fields Linked-based popularity
Ranking paradigms Explicit relevance model  Foundation: probability ranking principle Ranking results by the posterior probability (odds) of being observed in the relevant class: P(w|R) varies in different approaches, e.g., binary independence model, 2-poisson model, relevancemodel
Ranking paradigms No explicit notion of relevance: similarity between the query and the document model Vector space model (cosine similarity) Language models (KL divergence)
Model construction How to obtain Relevance models? Weights for query / document terms? Language models for document / queries?
Content-based model construction Document statistics, e.g.  Term frequency  Document length  Collection statistics, e.g.  Inverse document frequency Background language models ,[object Object]

Contenu connexe

Tendances

Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsPeter Mika
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through EntitiesPeter Mika
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the WebPeter Mika
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the WebRoi Blanco
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchablePeter Mika
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Trey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Websamar_slideshare
 
Semantic Search
Semantic SearchSemantic Search
Semantic Searchsssw2012
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessOntotext
 

Tendances (20)

Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining Process
 

En vedette

Boolean Matching in Logic Synthesis
Boolean Matching in Logic SynthesisBoolean Matching in Logic Synthesis
Boolean Matching in Logic SynthesisIffat Anjum
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 
Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016Aldo Gangemi
 
PhD Dissertation Supporting tools for automated generation and visual editing...
PhD Dissertation Supporting tools for automated generation and visual editing...PhD Dissertation Supporting tools for automated generation and visual editing...
PhD Dissertation Supporting tools for automated generation and visual editing...Álvaro Sicilia
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?Irfan Ullah
 
WOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsWOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsAndreas Kamilaris
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Semantic Search Over The Web
Semantic Search Over The WebSemantic Search Over The Web
Semantic Search Over The Webalierkan
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 

En vedette (13)

Boolean Matching in Logic Synthesis
Boolean Matching in Logic SynthesisBoolean Matching in Logic Synthesis
Boolean Matching in Logic Synthesis
 
A Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF GraphsA Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF Graphs
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016
 
PhD Dissertation Supporting tools for automated generation and visual editing...
PhD Dissertation Supporting tools for automated generation and visual editing...PhD Dissertation Supporting tools for automated generation and visual editing...
PhD Dissertation Supporting tools for automated generation and visual editing...
 
School intro
School introSchool intro
School intro
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?
 
WOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsWOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of Things
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Semantic Search Over The Web
Semantic Search Over The WebSemantic Search Over The Web
Semantic Search Over The Web
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

Similaire à SemTech 2011 Semantic Search tutorial

Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011sssw2011
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Bradley Allen
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorialThengo Kim
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesThanh Tran
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with searchJean Graef
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic WebPeter Mika
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0John Breslin
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin YahooPeter Mika
 
Semantic Web, Cataloging, & Metadata
Semantic Web, Cataloging, & MetadataSemantic Web, Cataloging, & Metadata
Semantic Web, Cataloging, & Metadatarobin fay
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data TutorialSören Auer
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011Peter Mika
 
SharePoint 2010 Findability
SharePoint 2010 FindabilitySharePoint 2010 Findability
SharePoint 2010 FindabilityDave Maskell
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech LegislationMartin Necasky
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataJuan Sequeda
 
Open Conceptual Data Models
Open Conceptual Data ModelsOpen Conceptual Data Models
Open Conceptual Data Modelsrumito
 
Yahoo Making The Web Searchable
Yahoo  Making The  Web  SearchableYahoo  Making The  Web  Searchable
Yahoo Making The Web Searchablekksst
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009Peter Mika
 

Similaire à SemTech 2011 Semantic Search tutorial (20)

Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorial
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with search
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
 
Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 
Semantic Web, Cataloging, & Metadata
Semantic Web, Cataloging, & MetadataSemantic Web, Cataloging, & Metadata
Semantic Web, Cataloging, & Metadata
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
SharePoint 2010 Findability
SharePoint 2010 FindabilitySharePoint 2010 Findability
SharePoint 2010 Findability
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech Legislation
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
Open Conceptual Data Models
Open Conceptual Data ModelsOpen Conceptual Data Models
Open Conceptual Data Models
 
Quick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & MicroformatsQuick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & Microformats
 
Yahoo Making The Web Searchable
Yahoo  Making The  Web  SearchableYahoo  Making The  Web  Searchable
Yahoo Making The Web Searchable
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009
 

Dernier

EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 

Dernier (20)

EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 

SemTech 2011 Semantic Search tutorial

  • 1. Peter Mika| Yahoo Research, Spain pmika@yahoo-inc.com Thanh Tran | Institute AIFB, KIT, Germany Tran@aifb.uni-karlsruhe.de Semantic Search TutorialIntroduction
  • 2. About the speakers Peter Mika Semantic Search group at Yahoo! Barcelona Semantic Search, Web Object Retrieval, Natural Language Processing Tran Duc Thanh Semantic Search group at AIFB Semantic Search, Semantic Data Management, Linked Data Query Processing
  • 3. Agenda Introduction (5 min) Semantic Web data (50 min) The RDF data model Publishing RDF Crawling and indexing RDF data Query processing (35 min) Ranking (25 min) Result presentation (15 min) Semantic Search evaluation (15 min) Demos (15 min) Questions (5 min)
  • 4. Why Semantic Search? I. “We are at the beginning of search.“ (Marissa Mayer) Solved large classes of queries, e.g. navigational Heavy investment in computational power Remaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognition Background knowledge and metadata can help to address poorly solved queries
  • 5. Poorly solved information needs Ambiguous searches parishilton Long tail queries george bush (and I mean the beer brewer in Arizona) Multimedia search parishilton sexy Imprecise or overly precise searches jimhendler pictures of strong adventures people Precise searches for descriptions countries in africa 32 year old computer scientist living in barcelona reliable digital camera under 300 dollars Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
  • 7. Why Semantic Search? II. The Semantic Web is now a reality Large amounts of data published in RDF Heterogeneous data of varying quality Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain Searching data instead or in addition to searching documents Direct answers Novel search tasks
  • 8. Information box with content from and links to Yahoo! Travel Example: direct answers in search Points of interest in Vienna, Austria Shopping results from Yahoo! Shopping Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’
  • 9. Novel search tasks Aggregation of search results e.g. price comparison across websites Analysis and prediction e.g. world temperature by 2020 Semantic profiling recommendations based on particular interests Semantic log analysis understanding user behavior in terms of objects Support for complex tasks e.g. booking a vacation using a combination of services
  • 10. Document retrieval and data retrieval Information Retrieval (IR) support the retrieval of documents (document retrieval) Representation based on lightweight syntax-centric models Work well for topical search Not so well for more complex information needs Web scale Database (DB) and Knowledge-based Systems (KB) deliver more precise answers (data retrieval) More expressive models Allow for complex queries Retrieve concrete answers that precisely match queries Not just matching and filtering, but also joins Limitations in scalability
  • 11. Combination of document and data retrieval Documents with metadata Metadata may be embeddedinside the document I’m looking for documents that mention countries in Africa. Data retrieval Structured data, but searchable text fields I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
  • 12. Semantic Search Target (combination of) document and data retrieval Semantic search is a retrieval paradigm that Exploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of content Incorporates the intent of the query and the meaning of content into the search process (semantic models) Wide range of semantic search systems Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
  • 13. Semantic models Semantics is concerned with the meaning of the resources made available for search Various representations of meaning Linguistic models: models of relationships among words Taxonomies, thesauri, dictionaries of entity names Inference along linguistic relations, e.g. broader/narrower terms Natural language search Conceptual models: models of relationships among objects Ontologies capture entities in the world and their relationships Inference along domain-specific relations Knowledge-based search We will focus on conceptual models in this tutorial In particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
  • 14. Semantic Search – a process view Knowledge Representation Semantic Models Resources Documents DocumentRepresentation
  • 15. Semantic Search systems For data / document retrieval, semantic search systems might combine a range of techniques, ranging from statistics-based IR methods for ranking, database methods for efficient indexing and query processing, up to complex reasoning techniques for making inferences!
  • 17. Semantic Web Sharing data across the Web Standard data model RDF A number of syntaxes (file formats) RDF/XML, RDFa Powerful, logic-based languages for schemas OWL, RIF Query languages and protocols HTTP, SPARQL
  • 18. Resource Description Framework (RDF) Each resource (thing, entity) is identified by a URI Globally unique identifiers Locators of information Data is broken down into individual facts Triples of (subject, predicate, object) A set of triples (an RDF graph) is published together in an RDF document RDF document foaf:Person type example:roi name “Roi Blanco”
  • 19. Linking resources Friend-of-a-Friend ontology Roi’s homepage type example:roi foaf:Person name “Roi Blanco” sameAs Yahoo type worksWith example:roi2 example:peter email “pmika@yahoo-inc.com”
  • 20. Publishing RDF Linked Data Data published as RDF documents linked to other RDF documents Community effort to re-publish large public datasets (e.g. Dbpedia, open government data) RDFa Data embedded inside HTML pages Recommended for site owners by Yahoo, Google, Facebook SPARQL endpoints Triple stores (RDF databases) that can be queried through the web
  • 21. Linked Data A web of RDF documents in parallel to the current Web often implemented as wrappers around databases or APIs The four rules of Linked Data: Use URIs to identify things. Use HTTPURIs so that these things can be referred to and looked up ("dereference") by people and user agents. Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
  • 22. Linked Data Advantages: No change to the publishing of the HTML documents Data can be published by third party (e.g. Dbpedia) Disadvantages: Web servers need to be configured to properly handle URIs that identify concepts instead of documents Not favored by search engines Lack of use cases Crawling needs to be changed Authority is difficult to determine Tools Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby) RDB-to-RDF mappers (e.g. D2RQ, Triplify) Validators (Vapour) Linked Data browsers (many)
  • 23. The state of Linked Data Rapidly growing community effort to (re)publish open datasets as Linked Data In particular, scientific and government datasets see linkeddata.org Less commercial interest, real usage
  • 24. Metadata in HTML 1995: HTML meta tags 1996: Simple HTML Ontology Extensions (SHOE) 1998: RDF/XML RDF/XML in HTML RDF linked from HTML 2003: Web 2.0 Tagging Microformats Metadata in Wikipedia Machine tags in Flickr 2005: eRDF 2008: RDFa 1.0 2011: RDFa 1.1 2012: Microdata?
  • 25. HTML meta tags <HTML> <HEAD profile="http://dublincore.org/documents/dcq-html/"> <META name="DC.author" content="Peter Mika"> <LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF" href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> … </HTML>
  • 26. Microformats (μf) Agreements on the way to encode certain kinds metadata in HTML Reuse of semantic-bearing HTML elements Based on existing standards Minimality Microformats exist for a limited set of objects hCard (persons and organizations) hCalendar (events) hResume hProduct hRecipe Varying degrees of support and stability hCard and rel-tag are widely supported Community centered around microformats.org Specifications and discussions are hosted there
  • 27. Microformats: limitations No shared syntax Each microformat has a separate syntax tailored to the vocabulary No formal schemas Limited reuse, extensibility of schemas Unclear which combinations are allowed No datatypes No namespaces, unique identifiers (URIs) no interlinking mapping between instances is required Always appears in the HTML <body>
  • 28. Example: the hCard microformat <div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div> <cite class="vcard"> <a class="fn url" rel="friend colleague met” href="http://meyerweb.com/"> Eric Meyer</a> </cite> wrote a post (<cite> <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/"> Tax Relief</a></cite>) about an unintentionally humorous letter he received from the <span class="vcard”> <a class="fn org url" href="http://irs.gov/"> Internal Revenue Service</a> </span>.
  • 29. RDFa W3C standard for embedding RDF data in HTML documents A set of new HTML attributes to be used in head or body A specification of how to extract the data from these attributes RDFa is just a syntax, you have to choose a vocabulary separately RDFa 1.0 is a W3C Recommendation since October, 2008 RDFa Primer RDFa 1.1 is a small update on RDFa to make it easier to use Currently Working Draft (March 31, 2011) Updated version of the RDFa Primer (April 19, 2011) RDFa API for accessing RDFa data in a webpage in the browser from JavaScript Currently Working Draft (April 19, 2011)
  • 30. RDFa 1.1 Changes New vocab attribute to define the default namespace for the document or subtree Profile documents to define multiple namespace prefixes The prefix attribute as a recommended replacement of xmlns You can use URIs even where only CURIEs where allowed before RDFa 1.1 is backward compatible with RDFa 1.0 RDFa 1.1 is recommended if you want to use HTML5
  • 31. When to use RDFa Choose microformats when you find a microformat that fits your needs and supported by your consumers Microformats are first option because they are simple Yahoo supports all major microformats, see the documentation It’s a common misconception that RDFa requires XHTML or that it’s compatible with HTML5 It’s compatible with HTML4, HTML5, XHTML If you find none that perfectly fits your needs then you need RDFa Microformats have a fixed schema: you can not add your own attributes Example: a social networking site with user profiles VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections You either live without this, or go with RDFa
  • 32. Example: Facebook’s Like and the Open Graph Protocol The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities Shows up in profiles and news feed Site owners can later reach users who have liked an object Facebook Graph API allows 3rd party developers to access the data Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’
  • 33. Example: Facebook’s Open Graph Protocol RDF vocabulary to be used in conjunction with RDFa Simplify the work of developers by restricting the freedom in RDFa Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment Only HTML <head> accepted <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> … </head> ...
  • 34. Example: Yahoo! Enhanced Results (was: SearchMonkey) Guide for publishers to mark-up their pages for common types of objects Product, Local, News, Video, Events, Documents, Discussion, Games Using popular microformats and RDF vocabularies Copy-paste code Validator Yahoo as a consumer See later
  • 35. Example: Google’s Rich Snippets Google accepts popular microformats and its own RDFa vocabulary Similar approach to RDFa as Facebook Validator to check if the markup is correct Google displays enhanced results based on this metadata Rich Snippets
  • 36. RDFa on the rise 510% increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats
  • 37. Microdata Currently under standardization at the W3C Originally part of the HTML5 spec, but now a separate document Similar to microformats, but with the extensibility of RDFa Introduce new terms using reverse domain names or full URIs HTML5 also has a number of “semantic” elements such as <time>, <video>, <article>…
  • 38. Microdata example <div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”> </p> </div
  • 39. The state of metadata in HTML 5-10% of webpages contain some explicit metadata Depending on how you count… Too many competing approaches Too many formats: microformats vs RDFa vs Microdata When using RDFa, publishers may need to use multiple different vocabularies to satisfy everyone
  • 40. Crawling the Semantic Web Linked Data Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled Semantic Sitemap/VOID descriptions RDFa Same as HTML crawling, but data is extracted after crawling Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010. SPARQL endpoints Endpoints are not linked, need to be discovered by other means Semantic Sitemap/VOID descriptions
  • 41. Data fusion Ontology matching Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies Entity resolution Logic-based approaches in the Semantic Web Studied as record linkage in the database literature Machine learning based approaches, focusing on attributes Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data Improvements over only attribute based matching Blending Merging objects that represent the same real world entity and reconciling information from multiple sources
  • 42. Data quality assessment and curation Heterogeneity, quality of data is an even larger issue Quality ranges from well-curated data sets (e.g. Freebase) to microformats In the worst of cases, the data becomes a graph of words Short amounts of text: prone to mistakes in data entry or extraction Example: mistake in a phone number or state code Quality assessment and data curation Quality varies from data created by experts to user-generated content Automated data validation Against known-good data or using triangulation Validation against the ontology or using probabilistic models Data validation by trained professionals or crowdsourcing Sampling data for evaluation Curation based on user feedback
  • 43. Indexing Search requires matching and ranking Matching selects a subset of the elements to be scored The goal of indexing is to speed up matching Retrieval needs to be performed in milliseconds Without an index, retrieval would require streaming through the collection The type of index depends on the query model to support DB-style indexing IR-style indexing
  • 44. IR-style indexing Index data as text Create virtual documents from data One virtual document per subgraph, resource or triple typically: resource Key differences to Text Retrieval RDF data is structured Minimally, queries on property values are required
  • 45. Horizontal index structure Two fields (indices): one for terms, one for properties For each term, store the property on the same position in the property index Positions are required even without phrase queries Query engine needs to support the alignment operator ✓ Dictionary is number of unique terms + number of properties Occurrences is number of tokens * 2
  • 46. Vertical index structure One field (index) per property Positions are not required But useful for phrase queries Query engine needs to support fields Dictionary is number of unique terms Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance
  • 47. Indexing using MapReduce MapReduce is the perfect model for building inverted indices Map creates (term, {doc1}) pairs Reduce collects all docs for the same term: (term, {doc1, doc2…} Sub-indices are merged separately Term-partitioned indices Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
  • 49. Structure Taxonomy of retrieval approaches Query processing for semantic search Types of semantic data Formalisms for querying semantic data Approaches General task: hybrid graph pattern matching Matching keyword query against text Matching structured query against structured data Matching keyword query against structured data Matching structured query against text (a hybrid case) Main tasks, challenges and opportunities
  • 50. Taxonomy of retrieval approaches (1) Data retrieval problem A collection of resources represented by the data model G Information needs expressed as queries in Q Retrieval is the task of efficiently computing results from G that are relevant to the queries in Q Document retrieval vs. data retrieval Differences in query and data representation and matching Efficiently retrieve structured data that exactly match formal information needs expressed as structured queries Effectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)
  • 51.
  • 56. Top-kData Query processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!
  • 57. Query processing for Semantic Search (1) The underlying collection of resources is represented by semantic data G ranging from Structured data with well defined schemas Semi-structured data with incomplete or no schemas Data that largely comprise text Hybrid / embedded data Targeted information needs Q are of varying complexity, captured using different formalisms and querying paradigms Natural language texts and keywords Form-based inputs Formal structured queries Semantic search, mainly, is the task of efficiently computing results(query processing) from G that are relevantto the queries in Q (ranking)
  • 58. Query processing for Semantic Search (2) Keywords NL Questions Form- / facet-based Inputs Structured Queries (SPARQL) Query Matching Data OWL ontologies with rich, formal semantics Structured RDF data Semi-Structured RDF data RDF data embedded in text (RDFa)
  • 59. Query processing for Semantic Search (3) Textual Data Keyword query on textual data (IR/document retrieval) Structured query on textual data Semantic Search target different group of users, information needs, and types of data. Query processing for semantic search is hybrid combination of techniques! Unstructured Query Structured Query Keyword query on structured data Structured query on structured data (DB/data retrieval) Structured Data
  • 60. Types of data models (1) Textual Bag-of-words Represent documents, text in structured data,…, real-world objects (captured as structured data) Lacks “structure” in text, e.g. linguistic structure, hyperlinks, (positional information) Structure in structured data representation term (statistics) In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum. combination Cloud Computing Technologies solutions management `big data' industry solutions support complex ……
  • 61. Types of data models (2) Textual Structured Resource Description Framework (RDF) Represent real-world objects, services, applications, …. documents Resource attribute values and relationships between resources Schema Picture creator Person Bob
  • 62. Types of data models (3) Textual Structured Hybrid RDF data embedded in text (RDFa)
  • 63. Types of data models – RDFa (1) … <div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"> <imgsrc="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div> </div> … adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
  • 64. Types of semantic data – RDFa (2) Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: content content adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
  • 65. Types of semantic data - conclusion Semantic data in general can be conceived as a graph with text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
  • 66. Formalisms for querying semantic data (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” Unstructured queries Fully-structured queries Hybrid queries: unstructured + structured
  • 67. Formalisms for querying semantic data (2) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” Unstructured NL Keywords apartment Berlin Alice shared
  • 68. Formalisms for querying semantic data (3) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlinand knows someone working at KIT.” Unstructured Fully-structured SPARQL: BGP, filter, optional, union, select, construct, ask, describe PREFIX ns: <http://example.org/ns#> SELECT ?x WHERE { ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
  • 69. Formalisms for querying semantic data (4) Fully-structured Unstructured Hybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  • 70. Formalisms for querying semantic data (5) Fully-structured Unstructured Hybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  • 71. Formalisms for querying semantic data - conclusion Semantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
  • 72. Processing hybrid graph patterns (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” apartment shared Berlin Alice ?y ns:name “Alice”. ?x ns:knows ? y age works 34 trouble with bob FluidOps Peter sunset.jpg Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: Beautiful Sunset author title Semantic Search Germany Alice creator author creator knows year Germany 2009 Bob Thanh knows located works KIT
  • 73. Processing hybrid graph patterns (2) Matching hybrid graph patterns against data
  • 74.
  • 75.
  • 76.
  • 78. Multiple “redundant” indexes to cover different access patterns
  • 80. Blocking, e.g. linear merge join (required sorted input)
  • 82. Materialized join indexes?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” Per1 ns:works?v ?vns:name “KIT” SP-index PO-index = = = Per1 ns:worksIns1 Ins1ns:name KIT Per1 ns:works Ins1 Ins1 ns:name KIT
  • 83.
  • 84.
  • 85. Data indexes for triple lookup
  • 87. Top-k Steiner tree search, top-k subgraph exploration Alice Bob Bob KIT Alice KIT ↔ ↔ = = Alice ns:knowsBob Inst1ns:name KIT Bobns:worksInst1
  • 88.
  • 89. Based on online IE, i.e., “retrieve “ is as follows
  • 90. Derive keywords to retrieve relevant documents
  • 91. On-the-fly information extraction, i.e., phrase pattern matching “X title Y”
  • 92. Retrieve extracted data for structured part
  • 93. Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.
  • 94. Index
  • 95. Inverted index for document retrieval and pattern matching
  • 96. Join index  inverted index for storing materialized joins between keywords
  • 97. Neighborhood indexes for phrase patterns?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” KIT name knows name KIT Hybrid case
  • 98. Query processing – main tasks Retrieval Documents , data elements, triples, paths, graphs Inverted index,…, but also other indexes (B+ tree) Index documents, triples materialized join paths Join Different join implementations, efficiency depends on availability of indexes Non-blocking join good for early result reporting and for “unpredictable” linked data scenario Query Matching Data
  • 99. Query processing – more tasks Disjunction, aggregation, grouping Join order optimization Approximate Approximate the search space Approximate the results (matching, join) Parallelization Top-k Use only some entries in the input streams to produce k results Multiple sources Federation, routing On-the-fly mapping, similarity join Hybrid Join text and data Query Matching Data
  • 100.
  • 102. Hybrid querying and data management
  • 105.
  • 106. Structure Problem definition Types of ambiguities Ranking paradigms Model construction Content-based Structure-based
  • 107.
  • 108. Ambiguities at the level of
  • 110. structure between elements (structure ambiguity)Matching Data Due to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some notions of relevance.
  • 111. Content ambiguity ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” apartment shared Berlin Alice ?y ns:name “Alice”. ?x ns:knows ? y age works 34 trouble with bob FluidOps Peter sunset.jpg Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: Beautiful Sunset author title Semantic Search Germany Alice creator author creator knows year Germany 2009 Bob Thanh knows located works KIT What is meant by “Berlin” in the query? What is meant by “Berlin” in the data? A city with the name Berlin? a person? What is meant by “KIT” in the query? What is meant by “KIT” in the data? A research group? a university? a location?
  • 112. Structure ambiguity ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” apartment shared Berlin Alice ?y ns:name “Alice”. ?x ns:knows ? y age works 34 trouble with bob FluidOps Peter sunset.jpg Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: Beautiful Sunset author title Semantic Search Germany Alice creator author creator knows year Germany 2009 Bob Thanh knows located works KIT What is the connection between “Berlin” and “Alice”? Friend? Co-worker? What is meant by “works”? Works at? employed?
  • 113. Ambiguity Recall: query processing is matching at the level of syntax and semantics Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matches Syntactic, e.g. works vs. works at Semantic, e.g. works vs. employ “Aboutness”, i.e., contain some elements which represent the correct interpretation Ambiguities arise when matching elements of different granularities Does icontains the interpretation for j, given some part(s) of i (syntactically/semantically) match j E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…” Strictly speaking, ranking is performed after syntactic / semantic matching is done!
  • 114. Features: What to use to deal with ambiguities? What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”? Content features Frequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR) Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis) Structure features Consider relevance at level of fields Linked-based popularity
  • 115. Ranking paradigms Explicit relevance model Foundation: probability ranking principle Ranking results by the posterior probability (odds) of being observed in the relevant class: P(w|R) varies in different approaches, e.g., binary independence model, 2-poisson model, relevancemodel
  • 116. Ranking paradigms No explicit notion of relevance: similarity between the query and the document model Vector space model (cosine similarity) Language models (KL divergence)
  • 117. Model construction How to obtain Relevance models? Weights for query / document terms? Language models for document / queries?
  • 118.
  • 119. When it contains a relatively high number of mentions of the term “Berlin”
  • 120.
  • 121.
  • 122.
  • 124. Search interface Input and output functionality helping the user to formulate complex queries presenting the results in an intelligent manner Semantic Search brings improvements in Query formulation Snippet generation Adaptive and interactive presentation Presentation adapts to the kind of query and results presented Object results can be actionable, e.g. buy this product Aggregated search Grouping similar items, summarizing results in various ways Filtering (facets), possibly across different dimensions Task completion Help the user to fulfill the task by placing the query in a task context
  • 125. Query interpretation “Snap-to-grid”: find the most likely interpretation of the query given the ontology or a summary of the data See Query Processing Display the system’s interpretation of the user query Offer one or more interpretations, possibly while the user is typing
  • 127. Example: TrueKnowledge Q: “How many people live in Shanghai?” I: What is the population of Shanghai (Shanghainese: Zånhae), the metropolis in eastern China and a direct-controlled municipality of the People's Republic of China? A: The population of Shanghai on November 7th 2010 is approximately 19,300,389. (Extrapolated from a population of 18,884,600 in 2008 and a population of 19,210,000 on June 6th 2010.)
  • 128. Snippet generation using metadata Yahoo displays enriched search results for pages that contain microformat or RDFa markup using recognized ontologies Displaying data, images, video Example: GoodRelations for products Enhanced results also appear for sites from which we extract information ourselves Also used for generating facets that can be used to restrict search results by object type Example: “Shopping sites” facet for products Documentation and validator for developers http://developer.search.yahoo.com Formerly: SearchMonkey allowed developers to customize the result presentation and create new ones for any object type
  • 129. Example: Yahoo! Enhanced Results Enhanced result with deep links, rating, address.
  • 130. Automated snippet summarization Generate search result snippets given a query and a search result Penin et al. Snippet Generation for Semantic Web Search Engines, ASWC 2010 Search results are ontologies
  • 131. Example: Facets in Yahoo! Search Click to restrict results to shopping sites
  • 132. Example: Yahoo! Vertical Intent Search Related actors and movies
  • 133. Adaptive presentation: semantic bookmarking Extract objects from pages tagged/bookmarked by a user Visualize the extracted objects Tabular display Sorting on attributes Map Tracking changes in data Alert me when the price drops below… Prototype: house search application Delicious profiles Extracting housing data from popular Spanish real-estate sites
  • 135. Interactive presentation: Time Explorer Deliverable of the LivingKnowledge European Project Not a Yahoo product http://fbmya01.barcelonamedia.org:8080/future/ Won the HCIR 2010 challenge Tool for understanding current news stories what are the events that led to a particular situation? what are the important entities for a given topic? (people,places,dates, etc.) what entities are important at a given time? How do their relationships change? what are the predictions made of a given topic?
  • 136. Interactive presentation: Time Explorer Technology Named Entity Recognition (persons, organizations) Temporal expression mining Inverted (sentence and document) index Forward index (archive) for retrieving relevant entities Ranking of both documents and relevant entities Display Two synchronized timelines showing relevant documents and the volume of documents Entity relationships Sentiments (future work)
  • 139. Semantic Search evaluation at SemSearch 2010/2011 Started at SemSearch 2010 Two tasks Entity Search Queries where the user is looking for a single real world object Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010. List search (new in 2011) Queries where the user is looking for a class of objects Billion Triples Challenge 2009 dataset Evaluated using Amazon’s Mechanical Turk See Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 Prizes sponsored by Yahoo! Labs (2011)
  • 140. Entity Search Track Entity Search retrieval of data related to a single entity Queries Selected from the Search Query Tiny Sample v1.0 dataset, provided by the Yahoo! Webscope program Real web search queries sampled from the US query log of January, 2009 Queries asked by at least three different users and with long number sequence removed (privacy reasons) 50 selected queries that name an entity explicitly (but may also provide context) Last year: same type of queries, but a mix of Microsoft and Yahoo! Logs
  • 141. List query track List queries Queries that describe a set of entities The answer is a closed set Relatively small number of possible answers The answer is not likely to change Hand-picked but not hand-written Yahoo! Search logs Queries from the Tiny Sample v1.0 dataset Queries with clicks on Wikipedia TrueKnowledge Recent queries
  • 142. Data set Same as Billion Triples Challenge 2009 data set Blank nodes are encoded as URIs A data set combining crawls of multiple semantic search engines doesn’t necessarily match the current state of the Web doesn’t necessarily match the coverage of any particular search engine Final dataset
  • 143. Collecting the results Submissions via semsearch.yahoo.com NTNU, IIIT Hyderabad, DERI, U of Delaware, Daiict max. 3 submissions per team per track Pooling of results Top 20 results are evaluated Despite validation, still problems e.g. N-Triples encoded URIs, lowercased URIs Collecting triples for each result All triples where the URI is the subject Discarded URIs that didn’t appear as subject Rendering result display Values are clipped at 300 chars (last # or / for object-properties) RDF built-ins shown first Preference to English language values
  • 145. Assessment with Amazon Mechanical Turk Evaluation using non-expert judges Paid $0.2 per 12 results Typically done in 1-2 minutes $6-$12 an hour Sponsored by the European SEALS project Each result is evaluated by 5 workers Workers are free to choose how many tasks they do Makes agreement difficult to compute Number of tasks completed per worker (2010)
  • 148. Catching the bad guys Payment can be rejected for workers who try to game the system An explanation is commonly expected, though cheaters rarely complain We opted to mix control questions into the real results Gold-win cases that are known to be perfect Gold-loose cases that are known to be bad Metrics Avg. and std. dev on gold-win and gold-loose results Time to complete
  • 149. Lessons learned Finding complex queries is not easy Query logs from web search engines contain simple queries Semantic Web search engines are not used by the general public Computing agreement is difficult Each judge evaluates a different number of items Result rendering is critically important The Semantic Web is not necessarily all DBpedia sub30-RES.3 40.6% DBpedia sub31-run3 93.8% Dbpedia Follow up experiments validated the Mechanical Turk approach Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
  • 150. Next steps Achieve repeatability Simplify our process and publish our tools Automate as much as possible… except the Turks ;) Web site for evaluation Continuous submission? Positioning compared to other evaluation campaigns TREC Entity Track Question Answering over Linked Data SEALS campaigns Join the discussion at semsearcheval@yahoogroups.com
  • 151. Demos

Notes de l'éditeur

  1. Search is a form of content aggregation
  2. - In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. In this regard, technologies for Information Retrieval (IR) can be distinguished from solutions in the field of Database (DB) and Knowledge-based Expert Systems (KB). Whereas IR applications support the retrieval of documents (document retrieval), DB and KB systems deliver more precise answers (data retrieval). The technical differences between these two paradigms can be broken down into three main dimensions, i.e. the representation of the user need (query model), the underlying resources (data model) and the matching technique. The representation of user need and resource content in current IR systems is still almost exclusively achieved by the lightweight syntax-centric models such as the predominant keyword paradigm (i.e. keyword queries matched against bag-of-words document representation). While these systems have shown to work well for topical search, i.e. retrieve documents based on a topic, they usually fail to address more complex information needs. Using more expressive models for the representation of the user need, and leveraging the structure and semantics inherent in the data, DB and KB systems allow for complex queries, and for the retrieval of concrete answers that precisely match them.
  3. Semantic search can be seen as a retrieval paradigm Centered on the use of semanticsIncorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
  4. Another trend resulting from this convergence of textual, structured and semantic data is the need for hybrid semantic search systems. Whilestandard IR focuses on the retrieval of documents, the DB and KB systems are built for data retrieval. As opposed to these types of systems, hybrid search systems manage the different types of data in a holistic way, and support the retrieval of answers that are integrated units of information, possibly assembled from different types of data. In such a system, there is not only a convergence at the
  5. Close to the topic of keyword-search in databases, except knowledge-bases have a schema-oblivious designDifferent papers assume vastly different query needs even on the same type of data
  6. &gt;&gt; &gt;&gt; Intro Session: motivation, overview etc.: 10 min &gt;&gt; &gt;&gt;1)Representation of the Search Space (M) 45&gt;&gt; &gt;&gt;2)Offline Preprocessing: Crawling and Indexing (P) 60&gt;&gt; &gt;&gt;3)Query Processing (T) 4)Matching (T) 90&gt;&gt; &gt;&gt;5)Ranking 60 &gt;&gt; &gt;&gt;6)Result Presentation (45)&gt;&gt; &gt;&gt;7)Evaluation (45)&gt;&gt; &gt;&gt;Demo Session (30)&gt;&gt; &gt;&gt; Wrap-up Session (10)
  7. Approximate many results  need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  8. Miss structural information in textsHyperlinksLinguistic structurePositional information
  9. - Real world objects
  10. SELECT Returns all, or a subset of, the variables bound in a query pattern match. CONSTRUCT Returns an RDF graph constructed by substituting variables in a set of triple templates. ASK Returns a boolean indicating whether a query pattern matches or not. DESCRIBE Returns an RDF graph that describes the resources found. Graph patterns are defined recursively. A graph pattern may have zero or more optional graph patterns, and any part of a query pattern may have an optional part. In this example, there are two optional graph patterns.Section 6 introduces the ability to make portions of a query optional; Section 7 introduces the ability to express the disjunction of alternative graph patterns; and Section 8 introduces the ability to constrain portions of a query to particular source graphs. Section 8 also presents SPARQL&apos;s mechanism for defining the source graphs for a query.Basic graph patterns allow applications to make queries where the entire query pattern must match for there to be a solution. For every solution of a query containing only group graph patterns with at least one basic graph pattern, every variable is bound to an RDF Term in a solution. However, regular, complete structures cannot be assumed in all RDF graphs. It is useful to be able to have queries that allow information to be added to the solution where the information is available, but do not reject the solution because some part of the query pattern does not match. Optional matching provides this facility: if the optional part does not match, it creates no bindings but does not eliminate the solution.The UNION pattern combines graph patterns; each alternative possibility can contain more than one triple pattern:SPARQL provides a means of combining graph patterns so that one of several alternative graph patterns may match. If more than one of the alternatives matches, all the possible pattern solutions are found.
  11. Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
  12. - Less than 5 percent of IR papers deal with query processing and the aspect of efficiency
  13. Partitioning has impact on performance!)Blocking: iterator-based approachesNon-blocking: good for streaming, good we cannot wait for some parts of the results to be completely worked-offLink data: cannot wait for sources, (some are slower then other) thus better to push data into query processing as the they come instead of pulling data and wait (busy waiting)Top-k:
  14. -phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web.- patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window.For extraction: relation patternsFor text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
  15. Given some materialized indexes  no joins at all Given sorted inputs  sorted merge join
  16. Given some materialized indexes  no joins at all Given sorted inputs  sorted merge joinjoin
  17. Every task is a challenge of itself, some more some less well elaboratedThere are separate challenges for every problems
  18. &gt;&gt; &gt;&gt; Intro Session: motivation, overview etc.: 10 min &gt;&gt; &gt;&gt;1)Representation of the Search Space (M) 45&gt;&gt; &gt;&gt;2)Offline Preprocessing: Crawling and Indexing (P) 60&gt;&gt; &gt;&gt;3)Query Processing (T) 4)Matching (T) 90&gt;&gt; &gt;&gt;5)Ranking 60 &gt;&gt; &gt;&gt;6)Result Presentation (45)&gt;&gt; &gt;&gt;7)Evaluation (45)&gt;&gt; &gt;&gt;Demo Session (30)&gt;&gt; &gt;&gt; Wrap-up Session (10)
  19. Approximate many results  need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  20. Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
  21. Syntactic works vs.works at Semantic works vs. employ
  22. text which contains a large mentions of “Berlin” is likely to be about “Berlin”i is more likely to be the correct interpretation of K when terms in K co-occur in a large number of context (bag of words) associated with i“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people