1. Semantic Search Tutorial
Introduction
Peter Mika|Yahoo! Research, Spain
pmika@yahoo-inc.com
Thanh Tran | Institute AIFB, KIT, Germany
Tran@aifb.uni-karlsruhe.de
2. About the speakers
Peter Mika
Senior Research Scientist
Head of Semantic Search group at Yahoo! Research in Barcelona
Semantic Search, Web Object Retrieval, Natural Language Processing
Tran Duc Thanh
Assistant Professor at AIFB, Karlsruhe Institute ofTechnology
Head of Semantic Search group
Semantic Search, Semantic / Linked Data Management
3. Agenda
Introduction (10 min)
Semantic Web data (50 min)
The RDF data model
Publishing RDF
Crawling and indexing RDF data
Query processing (40 min)
Ranking (30 min)
Result presentation (15 min)
Semantic Search evaluation (15 min)
Questions (5 min)
4. Why Semantic Search? I.
“We are at the beginning of search.“ (Marissa Mayer)
Solved large classes of queries, e.g. navigational
Heavy investment in computational power
Remaining queries are hard, not solvable by brute force, and require a
deep understanding of the world and human cognition
Background knowledge and metadata can help to address poorly
solved queries
5. Poorly solved information needs
Ambiguous searches Many of these queries would not
paris hilton be asked by users, who learned
Long tail queries over time what search technology
can and can not do.
george bush (and I mean the beer brewer in Arizona)
Multimedia search
paris hilton sexy
Imprecise or overly precise searches
jim hendler
pictures of strong adventures people
Precise searches for descriptions
countries in africa
32 year old computer scientist living in barcelona
reliable digital camera under 300 dollars
7. Why Semantic Search? II.
The Semantic Web is now a reality
Large amounts of data published in RDF
Heterogeneous data of varying quality
Users who are not skilled in writing complex queries (e.g. SPARQL)
and may not be experts in the domain
Searching data instead or in addition to searching documents
Direct answers
Novel search tasks
8. Example: direct answers in search
Points of Faceted search
Information
interest in
from the for Shopping Information box with
Vienna, Austria results
Knowledge content from and links
Graph to Yahoo! Travel
Since Aug, 2010,
‘regular’ search
results are
‘Powered by Bing’
9. Novel search tasks
Aggregation of search results
e.g. price comparison across websites
Analysis and prediction
e.g. world temperature by 2020
Semantic profiling
recommendations based on particular interests
Semantic log analysis
understanding user behavior in terms of objects
Support for complex tasks
e.g. booking a vacation using a combination of services
10. Contextual (pervasive, ambient) search
Yahoo! Connected TV:
Widget engine
embedded into the TV
Yahoo! IntoNow: recognize
audio and show related content
12. Document retrieval and data retrieval
Information Retrieval (IR) support the retrieval of documents
(document retrieval)
Representation based on lightweight syntax-centric models
Work well for topical search
Not so well for more complex information needs
Web scale
Database (DB) and Knowledge-based Systems (KB) deliver more precise
answers (data retrieval)
More expressive models
Allow for complex queries
Retrieve concrete answers that precisely match queries
Not just matching and filtering, but also joins
Limitations in scalability
13. Combination of document and data retrieval
Documents with metadata
Metadata may be embedded inside the document
I’m looking for documents that mention countries in Africa.
Data retrieval
Structured data, but searchable text fields
I’m looking for directors, who have directed movies where the synopsis
mentions dinosaurs.
14. Semantic Search
Target (combination of) document and data retrieval
Semantic search is a retrieval paradigm that
Exploits the structure/semantics of the data or explicit background
knowledge to understand user intent and the meaning of content
Incorporates the intent of the query and the meaning of content into
the search process (semantic models)
Wide range of semantic search systems
Employ different semantic models, possibly at different steps of the
search process and in order to support different tasks
15. Semantic models
Semantics is concerned with the meaning of the resources made
available for search
Various representations of meaning
Linguistic models: models of relationships among words
Taxonomies, thesauri, dictionaries of entity names
Inference along linguistic relations, e.g. broader/narrower terms
Conceptual models: models of relationships among objects
Ontologies capture entities in the world and their relationships
Inference along domain-specific relations
We will focus on conceptual models in this tutorial
In particular, the RDF/OWL conceptual model for representing
classes in a domain, and describing their instances
16. Semantic Search – a process view
Knowledge Representation
Semantic Models
• Keywords Resources
• Forms
Query • NL
Construction • Formal language
•IR-style matching & ranking
•DB-style precise matching
Query •KB-style matching & inferences
Processing
•Query visualization
•Document and data presentation Documents
Result •Summarization
Presentation
•Implicit feedback
•Explicit feedback
Query •Incentives
Refinement
Document Representation
17. Semantic Search systems
For data / document retrieval, semantic search systems might
combine a range of techniques, ranging from statistics-based IR
methods for ranking, database methods for efficient
indexing and query processing, up to complex reasoning
techniques for making inferences!
18. Example: Information Workbench
Addressing the lifecycle of
interacting with the Web of Data
Integration of data sources
Content generation by the end user
Search and Exploration
Visualization
Publishing
User-
generated Wikipedia
Integrated management of
heterogeneous data sources DBpedia, Yago
Structured and unstructured
Earthquake (Data.gov)
Published and user-generated
Static and dynamic
Open domain
Structured Dynamic
19. Data Sources in the Application
Entire English Wikipedia
Data from Linked Open Data
DBpedia
YAGO
…
Data from Data.gov (US Government)
E.g. live data about earthquakes
Many more
20. Semantic Search
Hybrid Search: Structured queries combined with keywords
across structured and unstructured data sources
Query interpretation: Translation of keywords into hybrid
queries
Keyword search/query interpretation combined with
faceted search: iterative refinement process based on keywords
and operations on facets
21. Search, Refinement and Navigation
Keywords
Query
Translations
Term
Completions
Vorlesung Knowledge Discovery - Institut AIFB
Facets
21
24. Data on the Web
Data on the Web is not directly accessible
Most web pages are generated from databases, but formatted for
human consumption
APIs offer limited views over data
Two solutions
Extraction using Information Extraction (IE) techniques
Out of scope for this tutorial
Relying on publishers to expose structured data using standard
Semantic Web formats
25. Information extraction
Natural Language Processing
Named entity recognition and disambiguation, sentiment analysis etc.
Extraction of information about entities
Suchanek et al. YAGO: A Core of Semantic Knowledge UnifyingWordNet and
Wikipedia,WWW, 2007.
Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.
Extraction from HTML tables
Cafarella et al. WebTables: Exploring the Power of Tables on theWeb.VLDB 2008
Wrapper induction
Kushmerick et al.Wrapper Induction for Information ExtractionText extraction. IJCAI
2007
Filling web forms automatically (form-filling)
Madhavan et al. Google's Deep-Web Crawl.VLDB 2008
26. Semantic Web
Sharing data across the Web
Standard data model
RDF
A number of syntaxes (file formats)
RDF/XML, RDFa
Powerful, logic-based languages for schemas
OWL, RIF
Query languages and protocols
HTTP, SPARQL
27. Resource Description Framework (RDF)
Each resource (thing, entity) is identified by a URI
Globally unique identifiers
URLs a subset of URIs
Often abbreviated using namespaces
e.g. example:roi = http://example.org/roi
RDF represents knowledge as a set of triples
Each triple is a single fact about the entity (an attribute or a relationship)
A set of triples forms an RDF graph
RDF document
type foaf:Person
example:roi name
“Roi Blanco”
28. Linking resources
Roi’s homepage Friend-of-a-Friend ontology
type
example:roi foaf:Person
name
“Roi Blanco” knows
sameAs
Yahoo!’s website type
worksWith
#roi2 #peter
email
“pmika@yahoo-inc.com”
29. Ontologies
Ontologies are the schemas for RDF graphs
Define the intended meaning of certain classes and relationships in a
domain
e.g. the FOAF ontology defines the foaf:Person class and relationships such as
foaf:knows
Ontologies are published in standard languages such as OWL
Language defines the constructs for modeling and their meaning
e.g. subClassOf, sameAs
Tools for editing ontologies such as Protégé, TopBraid Composer
Reusing and extending well-known ontologies helps to interpret
unknown data
e.g. if X is subClassOf foaf:Person, and A is of type X, then we know
that A is a person
30. Example: schema.org
Agreement on a shared set of schemas for common types of web
content
Bing, Google, and Yahoo! as initial supporters
Similar in intent to sitemaps.org (2006)
Use a single format to communicate the same information to all three search
engines
Support for microdata
schema.org covers areas of interest to all search engines
Business listings (local), creative works (video), recipes, reviews
User defined extensions
Each search engine continues to develop its products
32. Publishing RDF
Interlinked RDF documents (Linked Data)
Each document describes a single resource with URIs pointing to
related resources
Common RDF file formats are RDF/XML and Turtle
Mostly implemented as a wrapper around a database or Web service
Embedding RDF inside HTML
RDFa, microdata
SPARQL endpoints
Triple stores are databases for managing RDF data
SPARQL is a standard protocol and query language for accessing
triple stores using HTTP
33. Linked Data
Grass-roots community effort to (re)publish open datasets in RDF
Centered around the Dbpedia project
Directory of datasets at linkeddata.org, thedatahub.org
34. Metadata in HTML anno 1995
What does this term mean?
<HTML>
<HEAD profile="http://dublincore.org/documents/dcq-
html/">
<META name="DC.author" content="Peter Mika">
<LINK rel="DC.rights copyright"
href="http://www.example.org/rights.html" />
<LINK rel="meta" type="application/rdf+xml" title="FOAF"
href= "http://www.cs.vu.nl/~pmika/foaf.rdf">
</HEAD>
…
</HTML>
How is this data related?
35. Microformats (anno 2003)
Mark-up for data in HTML pages
Reuse existing HTML elements (class, rel)
Microformats exist for a limited set of objects
Persons, organizations, reviews, events etc., see microformats.org
No formal schemas
Limited reuse, extensibility of schemas
Unclear which combinations are allowed
No identifiers for entities
No interlinking between entities
<div class="vcard">
<a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>
<div class="tel">+1-919-555-7878</div>
<div class="title">Area Administrator, Assistant</div>
</div>
36. RDFa and RDFa Lite (2008-2012)
W3C standards for embedding RDF data in HTML documents
A set of new HTML attributes to be used in head or body
A specification of how to extract the data from these attributes
RDFa is just a syntax, you have to choose a vocabulary separately
RDFa family
RDFa 1.0
Recommendation since October, 2008
RDFa 1.1 is a small update on RDFa to make it easier to use
RDFa Primer (Working Draft, May 8, 2012)
RDFa 1.1 Lite is a subset of RDFa 1.1 with the most commonly used
features
Proposed Recommendation (May 8, 2012)
37. Example: Facebook’s Open Graph Protocol
The ‘Like’ button provides publishers with a way to promote their
content on Facebook and build communities
Shows up in profiles and news feed
Site owners can later reach users who have liked an object
Facebook Graph API allows 3rd party developers to access the data
Open Graph Protocol is an RDFa-based format that allows to
describe the object that the user ‘Likes’
38. Example: Facebook’s Open Graph Protocol
RDF vocabulary to be used in conjunction with RDFa
Simplify the work of developers by restricting the freedom in RDFa
Activities, Businesses, Groups, Organizations, People, Places, Products and
Entertainment
Only HTML <head> accepted
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url" content="http://www.imdb.com/title/tt0117500/"
/>
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" /> …
</head> ...
39. Microdata
HTML5
Currently under standardization at the W3C
Originally part of the HTML5 spec, but now a separate document
Comparable to RDFa 1.1 Lite
Key extensibility features (such as multiple types) missing
HTML5 also has a number of “semantic” elements such as <time>,
<video>, <article>…
<div itemscope itemid=“http://www.yahoo.com/resource/person”>
<p>My name is <span itemprop="name">Neil</span>.</p>
<p>My band is called <span itemprop="band">Four Parts Water</span>.
I was born on <time itemprop="birthday" datetime="2009-05-10"> May 10th
2009</time>.
<img itemprop="image" src=”me.png" alt=”me”></p>
</div>
40. Current state of metadata on the Web
31% of webpages, 5% of domains contain some metadata
Analysis of the Bing Crawl (US crawl, January, 2012)
RDFa is most common format
By URL: 25% RDFa, 7% microdata, 9% microformat
By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
Adoption is stronger among large publishers
Especially for RDFa and microdata
See also
P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW
2012
H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured
Data from Two Large Web Corpora, LDOW 2012
41. Exponential growth in RDFa data
Another five-fold increase between
October 2010 and January, 2012
Five-fold increase between
March, 2009 and October, 2010
Percentage of URLs with embedded metadata in various formats
43. Crawling the Semantic Web
Linked Data
Similar to HTML crawling, but the the crawler needs to parse
RDF/XML (and others) to extract URIs to be crawled
Semantic Sitemap/VOID descriptions
RDFa
Same as HTML crawling, but data is extracted after crawling
Mika et al. Investigating the Semantic Gap through Query Log
Analysis, ISWC 2010.
SPARQL endpoints
Endpoints are not linked, need to be discovered by other means
Semantic Sitemap/VOID descriptions
44. Data fusion
Ontology matching
Widely studied in Semantic Web research, see e.g. list of publications at
ontologymatching.org
Unfortunately, not much of it is applicable in a Web context due to the quality of
ontologies
Entity resolution
Logic-based approaches in the Semantic Web
Studied as record linkage in the database literature
Machine learning based approaches, focusing on attributes
Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to
RDF data
Improvements over only attribute based matching
Blending
Merging objects that represent the same real world entity and reconciling
information from multiple sources
45. Data quality assessment and curation
Heterogeneity, quality of data is an even larger issue
Quality ranges from well-curated data sets (e.g. Freebase) to microformats
In the worst of cases, the data becomes a graph of words
Short amounts of text: prone to mistakes in data entry or extraction
Example: mistake in a phone number or state code
Quality assessment and data curation
Quality varies from data created by experts to user-generated content
Automated data validation
Against known-good data or using triangulation
Validation against the ontology or using probabilistic models
Data validation by trained professionals or crowdsourcing
Sampling data for evaluation
Curation based on user feedback
46. Indexing
Search requires matching and ranking
Matching selects a subset of the elements to be scored
The goal of indexing is to speed up matching
Retrieval needs to be performed in milliseconds
Without an index, retrieval would require streaming through the
collection
The type of index depends on the query model to support
DB-style indexing
IR-style indexing
47. IR-style indexing
Index data as text
Create virtual documents from data
One virtual document per subgraph, resource or triple
typically: resource
Key differences to Text Retrieval
RDF data is structured
Minimally, queries on property values are required
48. Horizontal index structure
Two fields (indices): one for terms, one for properties
For each term, store the property on the same position in the property
index
Positions are required even without phrase queries
Query engine needs to support the alignment operator
✓ Dictionary is number of unique terms + number of properties
Occurrences is number of tokens * 2
49. Vertical index structure
One field (index) per property
Positions are not required
But useful for phrase queries
Query engine needs to support fields
Dictionary is number of unique terms
Occurrences is number of tokens
✗ Number of fields is a problem for merging, query performance
50. Distributed indexing
MapReduce is ideal for building inverted indices
Map creates (term, {doc1}) pairs
Reduce collects all docs for the same term: (term, {doc1, doc2…}
Sub-indices are merged separately
Term-partitioned indices
Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
52. Structure
Taxonomy of search approaches
Query processing / matching techniques for Semantic Search
Types of semantic data
Formalisms for querying semantic data
Approaches
General task: hybrid graph pattern matching
Matching keyword query against text
Matching structured query against structured data
Matching keyword query against structured data
Matching structured query against text (a hybrid case)
Main tasks, challenges and opportunities
53. Taxonomy of search approaches (1)
The search problem
A collection of resources, called data
Information needs expressed as queries
Search is the task of efficiently computing results from data that
are relevant to queries
Document data retrieval vs. structured data retrieval
Differences in query and data representation and matching
Efficiently retrieve structured data that exactly match formal
information needs expressed as structured queries
Effectively rank textual results that match ambiguous NL / keyword
queries to a certain degree (notions of relevance)
Semantic search: ranked retrieval of document and structured
data (given ambiguous queries / data)
54. Taxonomy of search approaches (2)
Query engines (of databases)
Exact
Complete
Query
Sound
• Approximate
Matching
• Not complete
• Not sound
• Ranked
Data
• Best effort
• Top-k
Search engines (stand-alone/database extensons)
Query processing mainly focuses on efficiency of matching whereas ranking
deals with degree of matching (relevance)!
55. Query processing for Semantic Search (1)
Resources represented by semantic data ranging from
Structured data with well defined schemas
Semi-structured data with incomplete or no schemas
Data that largely comprise text
Hybrid / embedded data
Information needs of varying complexity, captured using different
formalisms and querying paradigms
Natural language texts and keywords
Form-based inputs
Formal structured queries
(Search is end-user oriented paradigm, requires “natural”, intuitive querying interfaces)
Semantic search: efficiently computing results (query processing)
from data that are relevant to queries (ranking)
56. Query processing for Semantic Search (2)
NL Form- / facet- Structured Queries
Keywords
Questions based Inputs (SPARQL)
Ambiquities
Query
Matching
Data
RDF data
Semi-Structured Structured OWL ontologies with
embedded in text
RDF data RDF data rich, formal semantics
(RDFa)
Ambiquities: confidence degree, truth/trust value…
57. Query processing for Semantic Search (3)
Textual Data
Structured query
Keyword query on
on textual data ,
textual data, e.g.
e.g. querying
Web search
extension for
systems Search target different
Semantic search systems?
group of users, information needs,
Unstructured and types of data. Query processing Structured
Query Query
for semantic searchStructured query
is hybrid
Keyword query on
combination of techniques!
on structured data
structured data,
e.g. standard
e.g. search
querying interface
extensions for
for databases /
databases
RDF stores
Structured Data
58. Types of data models (1)
Textual
Bag-of-words
Represent documents, text in structured data,…, real-world
objects (captured as structured data)
Lacks “structure”
in text, e.g. linguistic structure, hyperlinks, (positional information)
Structure in structured data representation
term (statistics)
combination
Cloud
In combination with Cloud
Computing technologies,
Technologies
promising solutions for the
management of `big data'
solutions
management
have emerged. Existing
`big data'
industry solutions are able to
support complex queries and
industry
solutions
analytics tasks with terabytes
of data. For example, using a
support
complex
Greenplum.
……
59. Types of data models (2)
Textual
Structured
Resource Description Framework (RDF)
Represent real-world objects, services, applications, …. documents
Resource attribute values and relationships between resources
Schema
creator Picture
Person
Bob
60. Types of data models (3)
Textual
Structured
Hybrid
RDF data embedded in text (RDFa)
61. Types of data models – RDFa (1)
…
<div about="/alice/posts/trouble_with_bob">
<h2 property="dc:title">The trouble with Bob</h2>
<h3 property="dc:creator">Alice</h3>
Bob is a good friend of mine. We went to the same university, and
also shared an apartment in Berlin in 2008. The trouble with Bob is that
he takes much better photos than I do:
<div about="http://example.com/bob/photos/sunset.jpg">
<img src="http://example.com/bob/photos/sunset.jpg" />
<span property="dc:title">Beautiful Sunset</span>
by <span property="dc:creator">Bob</span>.
</div>
</div>
…
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
62. Types of semantic data – RDFa (2)
Bob is a good friend of mine. We went to content
the same university, and also shared an
apartment in Berlin in 2008. The trouble
with Bob is that he takes much better
photos than I do:
content
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
63. Types of semantic data - conclusion
Semantic data in general can be conceived as a graph
with text and structured data items as nodes, and
edges represent different types of relationships
including explicit semantic relationships and
vaguely specified ones such as hyperlinks!
64. Formalisms for querying semantic data (1)
Example information need
“Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone working at KIT.”
Unstructured queries
Fully-structured queries
Hybrid queries: unstructured + structured
65. Formalisms for querying semantic data (2)
Example information need
“Information about a friend of Alice, who shared an apartment
with her in Berlin and knows someone working at KIT.”
Unstructured
NL
Keywords
shared apartment Berlin Alice
66. Formalisms for querying semantic data (3)
Example information need
“Information about a friend of Alice, who shared an apartment
with her in Berlin and knows someone working at KIT.”
Unstructured
Fully-structured
SPARQL: BGP, filter, optional, union, select, construct, ask, describe
PREFIX ns: <http://example.org/ns#>
SELECT ?x
WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
67. Formalisms for querying semantic data (4)
Fully-structured
Unstructured
Hybrid: content and structure constraints
“shared apartment Berlin Alice”
?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”
68. Formalisms for querying semantic data (5)
Fully-structured
Unstructured
Hybrid: content and structure constraints
“shared apartment Berlin Alice”
?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”
69. Formalisms for querying semantic data - conclusion
Semantic search queries can be conceived as graph
patterns with nodes referring to text and
structured data items, and edges referring to
relationships between these items!
70. Processing hybrid graph patterns (1)
Example information need
“Information about a friend of Alice, who shared an apartment with her in Berlin and
knows someone working at KIT.”
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
?y ns:name “Alice”. ?x ns:knows ? y
trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend of
Beautiful
mine. We went to the Sunset
same university, and Germany Semantic
Alice Search
also shared an apartment
in Berlin in 2008. The
trouble with Bob is that
he takes much better Germany 2009
Bob
photos than I do: Thanh
KIT
72. Matching keyword query against text
• Retrieve documents
• Inverted list (inverted index)
keyword {<doc1, pos, score, ...>,
<doc2, pos, score, ...>, ...}
• AND-semantics: top-k join
shared Berlin Alice shared Berlin Alice
D1 D1 D1
shared = berlin = alice
shared
73. Matching structured query against structured data
• Retrieve data for triple patterns
• Index on tables
• Multiple “redundant” indexes to cover different access patterns
• Join (conjunction of triples)
• Blocking, e.g. linear merge join (required sorted input)
• Non-blocking, e.g. symmetric hash-join
• Materialized join indexes
?x ns:knows ?y. ?x ns:knows ?z.
Per1 ns:works ?v ?v ns:name “KIT”
SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT”
=
= =
Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT
74. Matching keyword query against structured data
• Retrieve keyword elements
• Using inverted index
keyword {<el1, score, ...>, <el2, score, ...>,…}
• Exploration / “Join”
• Data indexes for triple lookup
• Materialized index (paths up to graphs)
• Top-k Steiner tree search, top-k subgraph exploration
Alice Bob KIT Alice Bob KIT
↔ ↔
=
=
Alice ns:knows Bob Inst1 ns:name KIT
Bob ns:works Inst1
75. Matching structured query against text
• Based on offline IE (offline see Peter’s slides)
• Based on online IE, i.e., “retrieve “ is as follows
• Derive keywords to retrieve relevant documents
• On-the-fly information extraction, i.e., phrase pattern matching “X name Y”
• Retrieve extracted data for structured part
• Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.
?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name “KIT”
knows
name KIT
76. Matching structured query against text
• Index
• Inverted index for document retrieval and pattern matching
• Join index inverted index for storing materialized joins between keywords
• Neighborhood indexes for phrase patterns
?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name “KIT”
KIT name
knows
name KIT
77. Query processing – main tasks
Retrieval
Documents , data elements, triples, paths,
Query graphs
Inverted index,…, but also other (B+ tree)
Index documents, triples, materialized paths
Matching
Join
Different join implementations, efficiency
depends on availability of indexes
Non-blocking join good for early result
Data reporting and for “unpredictable” Linked
Data / data streams scenario
78. Query processing – more tasks
More complex queries: disjunction,
aggregation, grouping, analytics…
Query Join order optimization
Approximate
Approximate the search space
Approximate the results (matching, join)
Matching
Parallelization
Top-k
Use only some entries in the input streams to
produce k results
Data Multiple sources
Federation, routing
On-the-fly mapping, similarity join
Hybrid
Join text and data
79. Query processing on the Web -
research challenges and opportunities
Large amount of semantic
data
• Optimization, parallelization
Data inconsistent,
• Approximation
redundant, and low quality
• Hybrid querying and data
Large amount of data management
embedded in text
• Federation, routing
Large amount of sources • Online schema mappings
Large amount of links • Similarity join
between sources
82. Ranking – problem definition
Query
• Ambiguities arise when representation is
incomplete / imprecise
• Ambiguities at the level of
Matching
• elements (content ambiguity)
• structure between elements (structure
ambiguity)
Data
Due to ambiguities in the representation of the information needs and
the underlying resources, the results cannot be guaranteed to exactly
match the query. Ranking is the problem of determining the degree
of matching using some notions of relevance.
83. Content ambiguity
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
?y ns:name “Alice”. ?x ns:knows ? y
trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend of
Beautiful
mine. We went to the Sunset
same university, and Germany Semantic
Alice Search
also shared an apartment
in Berlin in 2008. The
trouble with Bob is that
he takes much better Germany 2009
Bob
photos than I do: Thanh
KIT
What is meant by “Berlin” in the query? What is meant by “KIT” in the query?
What is meant by “Berlin” in the data? What is meant by “KIT” in the data?
A city with the name Berlin? a person? A research group? a university? a location?
84. Structure ambiguity
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
?y ns:name “Alice”. ?x ns:knows ? y
trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend of
Beautiful
mine. We went to the Sunset
same university, and Germany Semantic
Alice Search
also shared an apartment
in Berlin in 2008. The
trouble with Bob is that
he takes much better Germany 2009
Bob
photos than I do: Thanh
KIT
What is the connection between “Berlin” and What is meant by “works”?
“Alice”? Works at? employed?
Friend? Co-worker?
85. Ambiguity
Recall: query processing is matching at the level of syntax and
semantics
Ambiguities arise when data or query allow for multiple
interpretations, i.e. multiple matches
Syntactic, e.g. works vs. works at
Semantic, e.g. works vs. employ
“Aboutness”, i.e., contain some elements which represent the
correct interpretation
Ambiguities arise when matching elements of different granularities
Does i contains the interpretation for j, given some part(s) of i
(syntactically/semantically) match j
E.g. Berlin vs. “…we went to the same university, and also, we shared an
apartment in Berlin in 2008…”
Strictly speaking, ranking is performed after syntactic / semantic matching is
done!
86. Features: What to use to deal with ambiguities?
What is meant by “Berlin”? What is the connection between
“Berlin” and “Alice”?
Content features
Frequencies of terms: d more likely to be “about” a query term k
when d more often, mentions k (probabilistic IR)
Co-occurrences: terms K that often co-occur form a contextual
interpretation, i.e., topics (cluster hypothesis)
Structure features
Consider relevance at level of fields
Linked-based popularity
87. Ranking paradigms
Explicit relevance model
Foundation: probability ranking principle
Ranking results by the posterior probability (odds) of being
observed in the relevant class:
P(w|R) varies in different approaches, e.g., binary independence
model, 2-poisson model, relevance model
P( D | R ) = ∏ P ( w | R)∏ (1 − P( w | N ))
P(D|R)
P(D/N) w∈D w∉D
k
P( w | R) ≈ P ( w | q1 ,..., qk ) = ∑ P(m) P(w | m)∑ P(q k | m)
m∈M i =1
88. Ranking paradigms
No explicit notion of relevance: similarity between the query and
the document model
Vector space model (cosine similarity)
Language models (KL divergence)
Sim ( q , d ) = Cos ( ( w1, d ,..., w t , d ), ( w1, q ,..., w k , q ))
P (t | θ q )
Sim ( q , d ) = − KL (θ q ||θ d ) = − ∑ P (t | θ q ) log(
t ∈V P (t | θ d )
89. Model construction
How to obtain
Relevance models?
Weights for query / document terms?
Language models for document / queries?
90. Content-based model construction
Document statistics, e.g. • An object is more likely about
Term frequency “Berlin”?
Document length • When it contains a
Collection statistics, e.g. relatively high number of
mentions of the term
Inverse document frequency
“Berlin”
Background language models
• When the number of
mentions of this term in the
overall collection is relatively
tf
wt ,d = ∗ idf low
|d |
tf
P (t | θ d ) = λ + (1 − λ ) P ( t | C )
|d |
91. Structure-based model construction
Consider structure of objects during content-based modeling,
i.e., to obtain structured content-based model
Content-based model for structured objects, documents and for
general tuples
P (t | θ d ) = ∑α f P (t | θ f )
f ∈ Fd
• An object is more likely about “Berlin”?
• When one of its (important) fields contains a relatively high
number of mentions of the term “Berlin”
92. Structure-based model construction
PageRank
Link analysis algorithm
Measuring relative importance of nodes
Link counts as a vote of support
The PageRank of a node recursively depends on the number and
PageRank of all nodes that link to it (incoming links)
ObjectRank
Types and semantics of links vary in structured data setting
Authority transfer schema graph specifies connection strengths
Recursively compute authority transfer data graph
• An object about “Berlin” is more important than one another?
•When a relatively large number of objects are linked to it
93. Taxonomy of ranking approaches
Explicitly vs. non-explicitly relevance-based
Content-based ranking
Structure-based ranking
Content- and-structure-based ranking
95. Search interface
Input and output functionality
helping the user to formulate complex queries
presenting the results in an intelligent manner
Semantic Search brings improvements in
Query formulation
Snippet generation
Suggesting related entities
Adaptive and interactive presentation
Presentation adapts to the kind of query and results presented
Object results can be actionable, e.g. buy this product
Aggregated search
Grouping similar items, summarizing results in various ways
Filtering (facets), possibly across different dimensions
Task completion
Help the user to fulfill the task by placing the query in a task context
96. Query formulation
“Snap-to-grid”: suggest the most likely interpretation of the query
Given the ontology or a summary of the data
While the user is typing or after issuing the query
Example: Freebase suggest, TrueKnowledge
97. Enhanced results/Rich Snippets
Use mark-up from the webpage to generate search snippets
Originally invented at Yahoo! (SearchMonkey)
Google,Yahoo!, Bing, Yandex now consume schema.org markup
Validators available from Google and Bing
98. Other result presentation tasks
Select the most relevant resources within an RDF document
Penin et al. Snippet Generation for Semantic Web Search Engines,
ASWC 2010
For each resource, rank the properties to be displayed
Natural Language Generation (NLG)
Verbalize, explain results
104. Semantic Search challenge (2010/2011)
Two tasks
Entity Search
Queries where the user is looking for a single real world object
Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.
List search (new in 2011)
Queries where the user is looking for a class of objects
Billion Triples Challenge 2009 dataset
Evaluated using Amazon’s Mechanical Turk
Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010
Blanco et al. Repeatable and Reliable Search System Evaluation using
Crowd-Sourcing, SIGIR2011
106. Catching the bad guys
Payment can be rejected for workers who try to game the system
An explanation is commonly expected, though cheaters rarely complain
We opted to mix control questions into the real results
Gold-win cases that are known to be perfect
Gold-loose cases that are known to be bad
Metrics
Avg. and std. dev on gold-win and gold-loose results
Time to complete
Worker Known Real Known Good Total Time to
bad N complete
N Mean N Mean N Mean (sec)
badguy 20 2.556 200 2.738 20 2.684 240 29.6
goodguy 13 1 130 2.038 13 3 156 95
whoknows 1 1 21 1.571 2 3 24 83.5
107. Other evaluations
TREC Entity Track
Related Entity Finding
Entities related to a given entity through a particular relationship
Retrieval over documents (ClueWeb 09 collection)
Example: (Homepages of) airlines that fly Boeing 747
Entity List Completion
Given some elements of a list of entities, complete the list
Question Answering over Linked Data
Retrieval over specific datasets (Dbpedia and MusicBrainz)
Full natural language questions of different forms
Correct results defined by an equivalent SPARQL query
Example: Give me all actors starring in Batman Begins.