This document discusses web-scale semantic search and knowledge graphs. It introduces the concept of semantic search, which deals with understanding the meaning of queries, terms, documents and results. This is achieved by linking text to unambiguous concepts or entities. The document then discusses knowledge graphs, which define entities, attributes, types, relations and more, and form the backbone of semantic search. It also covers tasks involved in semantic search like information extraction, entity linking, query understanding and result ranking.
2. Introduction
2
▪ Web search queries are increasingly entity-oriented
› fact finding
› transactional
› exploratory
› …
▪ Users expect increasingly “smarter” results
› geared towards a user’s personal profile and current context
• device type/user agent, day/time, location, …
› with sub-second response time
› fresh/up to date, buzzy
› more than “just 10 blue links”
3. Semantic search
3
▪ “Plain” IR works well for simple queries and unambiguous terms
▪ But it gets harder when there is additional context, (implicit) intent, …
› “barack obama height”
› “best philips tv” > “sony tv” > “panasonic”
› “pizza”
› “pizza near me”
▪ Semantic search deals with “meaning”
› essential ingredient: linking text to unambiguous “concepts” (~entitites)
› Add:
• query understanding
• ranking
• result presentation
6. Semantic search
6
▪ “Improve search accuracy by understanding searcher intent and the
contextual meaning of queries/terms/documents/results/…”
› “eiffel tower height”
› “brad pitt married”
› “chinese food”
› “obama net worth”
▪ Uncertainties in
› the source(s)
› the query
› the user
› the intent
› …
11. Entities are often not directly observed
11
▪ They appear in different forms, in different types of data
!
▪ Unstructured
› queries, documents, tweets, web pages, snippets, …
▪ Semistructured
› inline XML, RDFa, schema.org, …
▪ Structured
› Relational DBs, RDF, …
Often organized
around entities
Data Retrieval system
collection(s)
Information
need
Result(s)
12. KG vision @ Yahoo
12
▪ A unified knowledge graph for Yahoo
› all entities and topics relevant to Yahoo (users)
› rich information about entities: facts, relationships, features
› identifiers, interlinking across data sources, and links to relevant services
!
▪ To power knowledge-based services at Yahoo
› search: display, and search for, information about entities
› discovery: relate entities, interconnect data sources, link to relevant services
› understanding: recognize entities in queries and text
!
▪ Managed and served by a central knowledge team / platform
13. Steps
13
Knowledge
Acquisition
Knowledge
Integration
Knowledge
Consumption
Ongoing information extraction,
from complementary sources.
Reconciliation into a unified
knowledge repository.
Enrichment and serving…
17. 17
Information Extraction
Blending
Entity Reconciliation
Knowledge Repository
(common ontology)
Data Quality Monitoring
Data Acquisition
Information Extraction
Schema Mapping
Serving Export
Editorial Curation
Enrichment
Knowledge Acquisition Knowledge Integration Knowledge Consumption
18. Information Extraction
18
▪ Extraction of entities, attributes, relationships, features
› deal w/ scale, volatility, heterogeneity, inconsistency, schema complexity, breakage
› expensive to build and maintain (i.e. declarative rules, expert’s knowledge, ML…)
› being able to measure and monitor data quality is key
!
▪ Mixed approach
› parsing of large data feeds and online data APIs
› structured data extraction on the Web: markup, Web scraping, Wrapper induction,
› Wikipedia mining, Web mining, News mining, open information extraction
20. 20
Entity Reconciliation & Blending
▪ Disambiguate and merge entities across/within data sources
Blocking Select candidates most likely to refer to the same real world entity Fast approx. similarity, hashing
Scoring Compute similarity score between all pair of candidates ML classifier or heuristics
Clustering Decide which candidates refer to the same entity and interlink them ML clustering or heuristics
Merging Build a unified object for each cluster. Populate with best properties ML selection or heuristics
▪ Challenges
› not trivial!
› scale and adapt to new entity types, data sources, data sizes, update frequencies…
› ongoing reconciliation/blending/evaluation. Need for consistent entity IDs. Provenance.
28. Entity linking
28
▪ Typical steps
› mention detection – which part(s) should we link?
› candidate ranking/selection – where should they be linked to?
› disambiguation – maximize a “global” objective function
▪ Entity linking for web search queries
› pre-retrieval
• needs to be fast, space-efficient, and accurate
› queries are short and noisy
• high level of ambiguity
• limited context
29. Entity linking for web search
29
▪ Approach
› probabilistic model
• unsupervised
• large-scale set of aliases from Wikipedia and from click logs
› contextual relevance model based on neural language models
› state-of-the-art hashing and bit encoding techniques
30. from the set S, s ⇠ Multinomial(✓s)
e a set of entities e 2 e, where each e is
drawn from the set E, e ⇠ Multinomial(✓e)
as ⇠ Bernoulli(✓sa
) indicates if s is an alias
Entity linking for web search
30
as,e ⇠ Bernoulli(✓s,e
a ) indicates if s is an
alias pointing (linking/clicked) to e
c indicates which collection acts as a source
of information—query logs or Wikipedia (cq or cw)
n(s, c) count of s in c
n(e, c) count of e in c
▪ Idea: jointly model mention detection (segmentation) and entity
selection
Let q be the input query, which we represent with the set
Sq of all possible segmentations of its tokens t1 · · · tk. The
algorithm will return the set of entities e, along with their
scores, that maximizes:
› compute probabilistic score for each segment-entity pair
› optimize the score of the whole query
argmax
e2E
log P(e|q) = argmax
e2E,s2Sq
P
e2e log P(e|s) (1)
s.t. s 2 s ,
S
s ✓ s ,
T
s = ;. (2)
In Eq. 1 we assume independence of the entities given a query
segment, and in Eq. 2 we impose that the segmentations are
disjoint. Each individual entity/segment probability is then
segmentation that optimizes the entity:
argmax
e2E,s2Sq
max
e2e,s2s
Both Eq. 1 and Eq. 10 are instances segmentation problem, defined as terms t = t1 · · · tk, denote any segment [titi+1 . . . ti+j−1] 8i, j ( 0. Let "that maps segments to real numbers, score of a segmentation is defined m(t1, t2, . . . , tk) =
✓
max
# (m(t1),m(t2, . . . , tk)) , . . . , # ("([t1 . . . , tk−1]),m(where m(t1) = "([t1]) and #(a, b) function, such as #(a, b) = a + #(a, b) = max(a, b) in the case function s(·) only depends on the the others, the segmentation with computed in O(k2) time using dynamic We instantiate the above problem "(s) = highestscore(s, q) =
31. Main idea
▪ Use dynamic programming to solve
31
estimated as:
P(e|s) =
X
c2{cq,cw}
P(c|s)P(e|c, s)
=
X
c2{cq,cw}
P(c|s)
X
as={0,1}
P(as|c, s)P(e|as, c, s)
=
X
c2{cq,cw}
P(c|s)
P(as = 0|c, s)P(e|as = 0, c, s)
+ P(as = 1|c, s)P(e|as = 1, c, s))
&
.
The maximum likelihood probabilities are (note that in this
case P(e|as = 0, c, s) = 0 and therefore the right hand side
of the summation cancels out):
P(c|s) =
Pn(s, c)
c0 n(s, c0)
(3)
P(as = 1|c, s) =
P
s:as=1 n(s, c)
n(s, c)
(4)
pair and then optimizing the score of the whole query. Note
that we do not employ any supervision and let the model and
data operate in a parameterless fashion; it is however possible
to add an additional layer that makes use of human-labeled
training data in order to enhance the performance of the
model. This is the approach followed in Alley-oop where the
ranking model described in this paper is used to perform a
first-phase ranking, followed by a second phase ranking using
a supervised, machine-learned model.
To describe our model we use the following random vari-ables,
assuming as an event space S ⇥ E where S is the set
of all sequences and E the set of all entities known to the
system:
s a sequence of terms s 2 s drawn
from the set S, s ⇠ Multinomial(✓s)
e a set of entities e 2 e, where each e is
drawn from the set E, e ⇠ Multinomial(✓e)
as ⇠ Bernoulli(✓sa
) indicates if s is an alias
as,e ⇠ Bernoulli(✓s,e
a ) indicates if s is an
alias pointing (linking/clicked) to e
c indicates which collection acts as a source
of information—query logs or Wikipedia (cq or cw)
n(s, c) count of s in c
n(e, c) count of e in c
Let q be the input query, which we represent with the set
Sq of all possible segmentations of its tokens t1 · · · tk. The
algorithm will return the set of entities e, along with their
scores, that maximizes:
argmax
e2E
log P(e|q) = argmax
e2E,s2Sq
P
e2e log P(e|s) (1)
S
T
Those the In 1−smoothing. segmentation entity:
Both segmentation terms [that score where function, plemented eciently using dynamic programming in O(k2),
where k is the number of query terms.
2. MODELING ENTITY LINKING
For our entity linking model we establish a connection
between entities and their aliases (which are their textual
representations, also known as surface forms) by leveraging
anchor text or user queries leading to a click into the Web
page that represents the entity. In the context of this pa-per
we focus on using Wikipedia as KB and therefore only
consider anchor text within Wikipedia and clicks from web
search results on Wikipedia results—although it is general
enough to accommodate for other sources of information. The
problem we address consists of automatically segmenting the
query and simultaneously selecting the right entity for each
segment. Our Fast Entity Linker (FEL) tackles this problem
by computing a probabilistic score for each segment-entity
pair and then optimizing the score of the whole query. Note
that we do not employ any supervision and let the model and
data operate in a parameterless fashion; it is however possible
to add an additional layer that makes use of human-labeled
training data in order to enhance the performance of the
model. This is the approach followed in Alley-oop where the
estimated as:
P(e|s) =
X
c2{cq,cw}
P(c|s)P(e|c, s)
=
X
c2{cq,cw}
P(c|s)
X
as={0,1}
P(as|c, s)P(e|as, c, s)
=
X
c2{cq,cw}
P(c|s)
P(as = 0|c, s)P(e|as = 0, c, s)
+ P(as = 1|c, s)P(e|as = 1, c, s))
.
The maximum likelihood probabilities are (note that in this
case P(e|as = 0, c, s) = 0 and therefore the right hand side
of the summation cancels out):
P(c|s) =
Pn(s, c)
c0 n(s, c0)
(3)
P(as = 1|c, s) =
P
s:as=1 n(s, c)
n(s, c)
(4)
P(e|as = 1, c, s) =
P
s:as,e=1 n(s, c)
P
s:as=1 n(s, c)
(5)
Those maximum likelihood probabilities can be smoothed ap-propriately
using an entity prior. Using Dirichlet smoothing,
32. Contextual relevance model
32
▪ Note: P(e|s) is independent of other s’s in each query
› fast, but might be suboptimal
› e.g., “hollywood lyrics”
33. Contextual relevance model
33
▪ Note: P(e|s) is independent of other s’s in each query
› fast, but might be suboptimal
› e.g., “hollywood lyrics”
▪ Solution: contextual relevance model
› add query “context” t ∈ qs (i.e., the query remainder) into the model: P(e|s, q)
• boils down to calculating Πt P(t|e) and merging this back into the model
› naive implementation: LM or NB-based
› more advanced: use word embeddings from neural language models
34. Evaluation
34
▪ Webscope query-to-entities dataset (publicly available)
› http://webscope.sandbox.yahoo.com/
› 2.6k queries with 6k editorially linked Wikipedia entities
▪ 4.6m candidate Wikipedia entities
▪ Entities and aliases
› 13m aliases from Wikipedia hyperlinks (after filtering)
› 100m aliases from click logs
▪ Baselines
› commonness (most likely sense of an alias)
› IR-based (LM)
› Wikifier
› Bing
35. Results
and the size of
query. If we break
(Figure 2), we
length of the
greater than
aliases that are
another heavy
86%) point to
original 35
word2vec
extracted from
dimensionality D = 200.
query examples
hyperparameters
P@1 MRR MAP R-Prec
LM 0.0394 0.1386 0.1053 0.0365
LM-Click 0.4882 0.5799 0.4264 0.3835
Bing 0.6349 0.7018 0.5388 0.5223
Wikifier 0.2983 0.3201 0.2030 0.2086
Commonness 0.7336 0.7798 0.6418 0.6464
FEL 0.7669 0.8092 0.6528 0.6575
FEL+Centroid 0.8035 0.8366 0.6728 0.6765
FEL+LR 0.8352 0.8684 0.6912 0.6883
Table 4: Entity linking ecacy.
based on cmns and ignore the smaller, constituent n-grams.
Otherwise we recurse and try to match the (n-1)-grams. In-formation
retrieval-based approaches (denoted LM) make
use of a Wikipedia index and can rank the pages using their
36. Optimizations
36
▪ Early stopping
▪ Compressing word embedding vectors
› Golomb coding + Elias-Fano monotone sequence data structure
• allowing retrieval in constant time
› compression:
• word vectors: 3.44 bits per entry
• centroid vectors: 3.42 bits per entry
• LR vectors: 3.83 bits per entry
• overall: ~10 times smaller than 32-bit floating point
▪ Compressing counts
▪ Sub-millisecond response time
38. Query understanding
38
▪ Entails mapping queries to entities, relations, types, attributes, …
▪ Still in its infancy, especially for keyword queries
› QA
› query patterns/templates
› direct displays
› query interpretation
• rank interpretations!
› context context context
target id=4 text=James Dean
qa
q id=4.1 type=FACTOIDWhen was James Dean born?/q
/qa
qa
q id=4.2 type=FACTOIDWhen did James Dean die?/q
/qa
qa
q id=4.3 type=FACTOIDHow did he die?/q
/qa
qa
q id=4.4 type=LISTWhat movies did he appear in?/q
/qa
qa
q id=4.5 type=FACTOIDWhich was the first movie that he was in?/q
/qa
qa
q id=4.6 type=OTHEROther/q
/qa
/target
39. Query understanding
39
▪ Entails mapping queries to entities, relations, types, attributes, …
▪ Still in its infancy, especially for keyword queries
› QA
› query patterns/templates
› direct displays
› query interpretation
• rank interpretations!
› context context context
target id=4 text=James Dean
qa
q id=4.1 type=FACTOIDWhen was James Dean born?/q
/qa
qa
q id=4.2 type=FACTOIDWhen did James Dean die?/q
/qa
qa
q id=4.3 type=FACTOIDHow did he die?/q
/qa
qa
q id=4.4 type=LISTWhat movies did he appear in?/q
/qa
qa
q id=4.5 type=FACTOIDWhich was the first movie that he was in?/q
/qa
qa
q id=4.6 type=OTHEROther/q
/qa
/target
40. Query understanding
40
▪ Entails mapping queries to entities, relations, types, attributes, …
▪ Still in its infancy, especially for keyword queries
› QA
› query patterns/templates
› direct displays
› query interpretation
• rank interpretations!
› context context context
41. Query understanding
41
▪ Entails mapping queries to entities, relations, types, attributes, …
▪ Still in its infancy, especially for keyword queries
› QA
› query patterns/templates
› direct displays
› query interpretation
• rank interpretations!
› context context context
42. Intents?
42
▪ Intent = “need behind the query”, “objective”, “task”, etc.
› used for triggering, reranking, selecting, disambiguation, …
› detection + mapping
▪ Search-oriented intents
› navigational, informational, transactional
› domains/verticals
• autos, local, product, recipe, travel, …
› entity/type-centered intents (“attribute intents”), tied to
• attributes
• specific return type/s
• actions
• facets/refiners
43. How to detect them?
43
▪ language models
▪ editorial
▪ rules/templates
▪ ML + neural LMs
49. Semantic search introduces new information access tasks
49
▪ Users come to expect increasingly
advanced results
› related entity finding
› relationship explanation
• between two “adjacent” entities
• between more than two entities
• for any path between entities in the KG
› relationship ranking
› (contextual) type ranking
› disambiguation
50. Semantic search introduces new information access tasks
50
▪ Users come to expect increasingly
advanced results
› related entity finding
› relationship explanation
• between two “adjacent” entities
• between more than two entities
• for any path between entities in the KG
› relationship ranking
› (contextual) type ranking
› disambiguation
51. Semantic search introduces new information access tasks
51
▪ Users come to expect increasingly
advanced results
› related entity finding
› relationship explanation
• between two “adjacent” entities
• between more than two entities
• for any path between entities in the KG
› relationship ranking
› (contextual) type ranking
› disambiguation
52. Evaluation
52
▪ How to measure success (and train models)?
› clicks
› A/B testing, bucket testing
› interleaving
› dwell time
› eye/mouse tracking
› editorial assessments
• costly
• hard to generalize
• relevance is not objective (but very subjective/contextual)
53. Evaluating semantic search
53
▪ Semantic search aims to “answer” queries
› show relevant entities
› show the actual answer/fact/…
!
▪ How do you measure/observe/determine success?
› feedback
› human editors
• how to generalize?
› abandonment
› task/location/context/user/… specific notion of relevance
!
▪ Need adequate and reliable metrics
54. Moving towards mobile
54
▪ Limited screen real estate
▪ Costly to scroll/click/back/etc.
› people actually type longer queries on mobile devices!
▪ Rich context
› hyperlocal
• location (lat/lon, home/work/traveling/…)
• time of day
› device type
▪ Move towards “discussion-style” interfaces, i.e., interactive “IR”
› “QA”/Siri
› dialog systems
56. UI/UX
56
▪ Mobile search, centered around entities
57. UI/UX
57
▪ Mobile search, centered around entities
58. UI/UX
58
▪ Mobile search, centered around entities
59. Evaluation on mobile
59
▪ Even more tricky…
!
▪ How do you measure/observe/determine success?
› clicks (“taps”) are not easily interpreted
› neither are swipes, pinches, etc.
▪ Current approaches include
› “field studies”
› observing users (in the lab and in the wild)
› ...
▪ Need adequate and reliable metrics
61. Current challenges
61
▪ Combining
› text
• queries
• documents
• entity descriptions/inlinks
› structure
• internal and external to documents
• explicit (RDF), implicit (links from KG to web pages), inferred (information extraction)
› context
• users, sessions, and tasks
• hyperlocal, personal, and social
• temporal/popularity (buzziness)
› in rich, complex user interactions
• evaluation?