SlideShare une entreprise Scribd logo
1  sur  62
Télécharger pour lire hors ligne
Web-scale semantic search 
Edgar Meij, Yahoo Labs | October 31, 2014
Introduction 
2 
▪ Web search queries are increasingly entity-oriented 
› fact finding 
› transactional 
› exploratory 
› … 
▪ Users expect increasingly “smarter” results 
› geared towards a user’s personal profile and current context 
• device type/user agent, day/time, location, … 
› with sub-second response time 
› fresh/up to date, buzzy 
› more than “just 10 blue links”
Semantic search 
3 
▪ “Plain” IR works well for simple queries and unambiguous terms 
▪ But it gets harder when there is additional context, (implicit) intent, … 
› “barack obama height” 
› “best philips tv” > “sony tv” > “panasonic” 
› “pizza” 
› “pizza near me” 
▪ Semantic search deals with “meaning” 
› essential ingredient: linking text to unambiguous “concepts” (~entitites) 
› Add: 
• query understanding 
• ranking 
• result presentation
Semantic search 
4 
▪ Query understanding 
› entities, relations, types, … 
› intent detection ~ answer type prediction 
• kg-oriented 
– entity/ies 
– property/ies 
– relations 
– summaries 
– ... 
• web pages 
• news 
• POIs 
• ...
Semantic search 
5 
▪ Ranking 
› (relevance) 
› freshness 
› authoritativeness 
› buzziness 
› distance 
› personalization 
› … 
▪ Result presentation 
› serendipity 
› targeted answers 
› …
Semantic search 
6 
▪ “Improve search accuracy by understanding searcher intent and the 
contextual meaning of queries/terms/documents/results/…” 
› “eiffel tower height” 
› “brad pitt married” 
› “chinese food” 
› “obama net worth” 
▪ Uncertainties in 
› the source(s) 
› the query 
› the user 
› the intent 
› …
Semantic search 
7 
▪ Combines IR/NLP/DB/SW 
▪ Tasks 
› information extraction 
› information reconciliation/tracking 
› entity linking 
› query understanding/intent classification 
› retrieving/ranking entities/attributes/relations 
› interleaving/federated search: dedupe, merge, and rank (at runtime) 
› UI/UX 
› personalization 
› …
8 
Knowledge Graphs
Knowledge graphs 
9 
▪ The “backbone” of semantic search 
▪ They define 
› entities 
› attributes 
› types 
› relations 
› (provenance, sometimes) 
› and more 
• external links, homepages, features, …
Entities at the core 
10
Entities are often not directly observed 
11 
▪ They appear in different forms, in different types of data 
! 
▪ Unstructured 
› queries, documents, tweets, web pages, snippets, … 
▪ Semistructured 
› inline XML, RDFa, schema.org, … 
▪ Structured 
› Relational DBs, RDF, … 
Often organized 
around entities 
Data Retrieval system 
collection(s) 
Information 
need 
Result(s)
KG vision @ Yahoo 
12 
▪ A unified knowledge graph for Yahoo 
› all entities and topics relevant to Yahoo (users) 
› rich information about entities: facts, relationships, features 
› identifiers, interlinking across data sources, and links to relevant services 
! 
▪ To power knowledge-based services at Yahoo 
› search: display, and search for, information about entities 
› discovery: relate entities, interconnect data sources, link to relevant services 
› understanding: recognize entities in queries and text 
! 
▪ Managed and served by a central knowledge team / platform
Steps 
13 
Knowledge 
Acquisition 
Knowledge 
Integration 
Knowledge 
Consumption 
Ongoing information extraction, 
from complementary sources. 
Reconciliation into a unified 
knowledge repository. 
Enrichment and serving…
14 
Key Tasks 
Blending 
Entity Reconciliation 
Knowledge Repository 
(common ontology) 
Data Quality Monitoring 
Data Acquisition 
Information Extraction 
Schema Mapping 
Serving Export 
Editorial Curation 
Enrichment 
Knowledge Acquisition Knowledge Integration Knowledge Consumption
15 
Data Acquisition 
Blending 
Entity Reconciliation 
Knowledge Repository 
(common ontology) 
Data Quality Monitoring 
Data Acquisition 
Information Extraction 
Schema Mapping 
Serving Export 
Editorial Curation 
Enrichment 
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Issues 
16 
▪ Multiple complementary data sources 
› combine and cross-validate data from authoritative sources 
› reference data sources such as Wikipedia and Freebase form our backbone 
› specialized data sources such as TMS and Musicbrainz adds breadth/depth 
› optimize for relevance, comprehensiveness, correctness, freshness, consistency 
▪ Ongoing acquisition of raw data 
› feed acquisition from open data sources and paid providers 
› web/Targeted crawling, online fetching, ad hoc acquisition (e.g. Wikipedia monitoring) 
› deal w/ operational complexity: data size, bandwidth, update frequency, license, ©
17 
Information Extraction 
Blending 
Entity Reconciliation 
Knowledge Repository 
(common ontology) 
Data Quality Monitoring 
Data Acquisition 
Information Extraction 
Schema Mapping 
Serving Export 
Editorial Curation 
Enrichment 
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Information Extraction 
18 
▪ Extraction of entities, attributes, relationships, features 
› deal w/ scale, volatility, heterogeneity, inconsistency, schema complexity, breakage 
› expensive to build and maintain (i.e. declarative rules, expert’s knowledge, ML…) 
› being able to measure and monitor data quality is key 
! 
▪ Mixed approach 
› parsing of large data feeds and online data APIs 
› structured data extraction on the Web: markup, Web scraping, Wrapper induction, 
› Wikipedia mining, Web mining, News mining, open information extraction
19 
Key Tasks 
Blending 
Entity Reconciliation 
Knowledge Repository 
(common ontology) 
Data Quality Monitoring 
Data Acquisition 
Information Extraction 
Schema Mapping 
Serving Export 
Editorial Curation 
Enrichment 
Knowledge Acquisition Knowledge Integration Knowledge Consumption
20 
Entity Reconciliation & Blending 
▪ Disambiguate and merge entities across/within data sources 
Blocking Select candidates most likely to refer to the same real world entity Fast approx. similarity, hashing 
Scoring Compute similarity score between all pair of candidates ML classifier or heuristics 
Clustering Decide which candidates refer to the same entity and interlink them ML clustering or heuristics 
Merging Build a unified object for each cluster. Populate with best properties ML selection or heuristics 
▪ Challenges 
› not trivial! 
› scale and adapt to new entity types, data sources, data sizes, update frequencies… 
› ongoing reconciliation/blending/evaluation. Need for consistent entity IDs. Provenance.
Knowledge graphs 
21
Knowledge graphs… 
22 
▪ … are not perfect 
▪ Or: the importance of 
human editors
Knowledge graphs… 
23 
▪ … are not perfect 
▪ Or: the importance of 
human editors
Knowledge graphs… 
24 
▪ … are not perfect 
▪ Or: the importance of 
human editors
Knowledge graphs… 
25 
▪ … are not perfect 
▪ Or: the importance of 
human editors
26 
Entity Linking
Knowledge graphs 
27
Entity linking 
28 
▪ Typical steps 
› mention detection – which part(s) should we link? 
› candidate ranking/selection – where should they be linked to? 
› disambiguation – maximize a “global” objective function 
▪ Entity linking for web search queries 
› pre-retrieval 
• needs to be fast, space-efficient, and accurate 
› queries are short and noisy 
• high level of ambiguity 
• limited context
Entity linking for web search 
29 
▪ Approach 
› probabilistic model 
• unsupervised 
• large-scale set of aliases from Wikipedia and from click logs 
› contextual relevance model based on neural language models 
› state-of-the-art hashing and bit encoding techniques
from the set S, s ⇠ Multinomial(✓s) 
e a set of entities e 2 e, where each e is 
drawn from the set E, e ⇠ Multinomial(✓e) 
as ⇠ Bernoulli(✓sa 
) indicates if s is an alias 
Entity linking for web search 
30 
as,e ⇠ Bernoulli(✓s,e 
a ) indicates if s is an 
alias pointing (linking/clicked) to e 
c indicates which collection acts as a source 
of information—query logs or Wikipedia (cq or cw) 
n(s, c) count of s in c 
n(e, c) count of e in c 
▪ Idea: jointly model mention detection (segmentation) and entity 
selection 
Let q be the input query, which we represent with the set 
Sq of all possible segmentations of its tokens t1 · · · tk. The 
algorithm will return the set of entities e, along with their 
scores, that maximizes: 
› compute probabilistic score for each segment-entity pair 
› optimize the score of the whole query 
argmax 
e2E 
log P(e|q) = argmax 
e2E,s2Sq 
P 
e2e log P(e|s) (1) 
s.t. s 2 s , 
S 
s ✓ s , 
T 
s = ;. (2) 
In Eq. 1 we assume independence of the entities given a query 
segment, and in Eq. 2 we impose that the segmentations are 
disjoint. Each individual entity/segment probability is then 
segmentation that optimizes the entity: 
argmax 
e2E,s2Sq 
max 
e2e,s2s 
Both Eq. 1 and Eq. 10 are instances segmentation problem, defined as terms t = t1 · · · tk, denote any segment [titi+1 . . . ti+j−1] 8i, j ( 0. Let "that maps segments to real numbers, score of a segmentation is defined m(t1, t2, . . . , tk) = 
✓ 
max 
# (m(t1),m(t2, . . . , tk)) , . . . , # ("([t1 . . . , tk−1]),m(where m(t1) = "([t1]) and #(a, b) function, such as #(a, b) = a + #(a, b) = max(a, b) in the case function s(·) only depends on the the others, the segmentation with computed in O(k2) time using dynamic We instantiate the above problem "(s) = highestscore(s, q) =
Main idea 
▪ Use dynamic programming to solve 
31 
estimated as: 
P(e|s) = 
X 
c2{cq,cw} 
P(c|s)P(e|c, s) 
= 
X 
c2{cq,cw} 
P(c|s) 
X 
as={0,1} 
P(as|c, s)P(e|as, c, s) 
= 
X 
c2{cq,cw} 
P(c|s) 
 
P(as = 0|c, s)P(e|as = 0, c, s) 
+ P(as = 1|c, s)P(e|as = 1, c, s)) 
& 
. 
The maximum likelihood probabilities are (note that in this 
case P(e|as = 0, c, s) = 0 and therefore the right hand side 
of the summation cancels out): 
P(c|s) = 
Pn(s, c) 
c0 n(s, c0) 
(3) 
P(as = 1|c, s) = 
P 
s:as=1 n(s, c) 
n(s, c) 
(4) 
pair and then optimizing the score of the whole query. Note 
that we do not employ any supervision and let the model and 
data operate in a parameterless fashion; it is however possible 
to add an additional layer that makes use of human-labeled 
training data in order to enhance the performance of the 
model. This is the approach followed in Alley-oop where the 
ranking model described in this paper is used to perform a 
first-phase ranking, followed by a second phase ranking using 
a supervised, machine-learned model. 
To describe our model we use the following random vari-ables, 
assuming as an event space S ⇥ E where S is the set 
of all sequences and E the set of all entities known to the 
system: 
s a sequence of terms s 2 s drawn 
from the set S, s ⇠ Multinomial(✓s) 
e a set of entities e 2 e, where each e is 
drawn from the set E, e ⇠ Multinomial(✓e) 
as ⇠ Bernoulli(✓sa 
) indicates if s is an alias 
as,e ⇠ Bernoulli(✓s,e 
a ) indicates if s is an 
alias pointing (linking/clicked) to e 
c indicates which collection acts as a source 
of information—query logs or Wikipedia (cq or cw) 
n(s, c) count of s in c 
n(e, c) count of e in c 
Let q be the input query, which we represent with the set 
Sq of all possible segmentations of its tokens t1 · · · tk. The 
algorithm will return the set of entities e, along with their 
scores, that maximizes: 
argmax 
e2E 
log P(e|q) = argmax 
e2E,s2Sq 
P 
e2e log P(e|s) (1) 
S 
T 
Those the In 1−smoothing. segmentation entity: 
Both segmentation terms [that score where function, plemented eciently using dynamic programming in O(k2), 
where k is the number of query terms. 
2. MODELING ENTITY LINKING 
For our entity linking model we establish a connection 
between entities and their aliases (which are their textual 
representations, also known as surface forms) by leveraging 
anchor text or user queries leading to a click into the Web 
page that represents the entity. In the context of this pa-per 
we focus on using Wikipedia as KB and therefore only 
consider anchor text within Wikipedia and clicks from web 
search results on Wikipedia results—although it is general 
enough to accommodate for other sources of information. The 
problem we address consists of automatically segmenting the 
query and simultaneously selecting the right entity for each 
segment. Our Fast Entity Linker (FEL) tackles this problem 
by computing a probabilistic score for each segment-entity 
pair and then optimizing the score of the whole query. Note 
that we do not employ any supervision and let the model and 
data operate in a parameterless fashion; it is however possible 
to add an additional layer that makes use of human-labeled 
training data in order to enhance the performance of the 
model. This is the approach followed in Alley-oop where the 
estimated as: 
P(e|s) = 
X 
c2{cq,cw} 
P(c|s)P(e|c, s) 
= 
X 
c2{cq,cw} 
P(c|s) 
X 
as={0,1} 
P(as|c, s)P(e|as, c, s) 
= 
X 
c2{cq,cw} 
P(c|s) 
 
P(as = 0|c, s)P(e|as = 0, c, s) 
+ P(as = 1|c, s)P(e|as = 1, c, s)) 
 
. 
The maximum likelihood probabilities are (note that in this 
case P(e|as = 0, c, s) = 0 and therefore the right hand side 
of the summation cancels out): 
P(c|s) = 
Pn(s, c) 
c0 n(s, c0) 
(3) 
P(as = 1|c, s) = 
P 
s:as=1 n(s, c) 
n(s, c) 
(4) 
P(e|as = 1, c, s) = 
P 
s:as,e=1 n(s, c) 
P 
s:as=1 n(s, c) 
(5) 
Those maximum likelihood probabilities can be smoothed ap-propriately 
using an entity prior. Using Dirichlet smoothing,
Contextual relevance model 
32 
▪ Note: P(e|s) is independent of other s’s in each query 
› fast, but might be suboptimal 
› e.g., “hollywood lyrics”
Contextual relevance model 
33 
▪ Note: P(e|s) is independent of other s’s in each query 
› fast, but might be suboptimal 
› e.g., “hollywood lyrics” 
▪ Solution: contextual relevance model 
› add query “context” t ∈ qs (i.e., the query remainder) into the model: P(e|s, q) 
• boils down to calculating Πt P(t|e) and merging this back into the model 
› naive implementation: LM or NB-based 
› more advanced: use word embeddings from neural language models
Evaluation 
34 
▪ Webscope query-to-entities dataset (publicly available) 
› http://webscope.sandbox.yahoo.com/ 
› 2.6k queries with 6k editorially linked Wikipedia entities 
▪ 4.6m candidate Wikipedia entities 
▪ Entities and aliases 
› 13m aliases from Wikipedia hyperlinks (after filtering) 
› 100m aliases from click logs 
▪ Baselines 
› commonness (most likely sense of an alias) 
› IR-based (LM) 
› Wikifier 
› Bing
Results 
and the size of 
query. If we break 
(Figure 2), we 
length of the 
greater than 
aliases that are 
another heavy 
86%) point to 
original 35 
word2vec 
extracted from 
dimensionality D = 200. 
query examples 
hyperparameters 
P@1 MRR MAP R-Prec 
LM 0.0394 0.1386 0.1053 0.0365 
LM-Click 0.4882 0.5799 0.4264 0.3835 
Bing 0.6349 0.7018 0.5388 0.5223 
Wikifier 0.2983 0.3201 0.2030 0.2086 
Commonness 0.7336 0.7798 0.6418 0.6464 
FEL 0.7669 0.8092 0.6528 0.6575 
FEL+Centroid 0.8035 0.8366 0.6728 0.6765 
FEL+LR 0.8352 0.8684 0.6912 0.6883 
Table 4: Entity linking ecacy. 
based on cmns and ignore the smaller, constituent n-grams. 
Otherwise we recurse and try to match the (n-1)-grams. In-formation 
retrieval-based approaches (denoted LM) make 
use of a Wikipedia index and can rank the pages using their
Optimizations 
36 
▪ Early stopping 
▪ Compressing word embedding vectors 
› Golomb coding + Elias-Fano monotone sequence data structure 
• allowing retrieval in constant time 
› compression: 
• word vectors: 3.44 bits per entry 
• centroid vectors: 3.42 bits per entry 
• LR vectors: 3.83 bits per entry 
• overall: ~10 times smaller than 32-bit floating point 
▪ Compressing counts 
▪ Sub-millisecond response time
37 
Query Intents
Query understanding 
38 
▪ Entails mapping queries to entities, relations, types, attributes, … 
▪ Still in its infancy, especially for keyword queries 
› QA 
› query patterns/templates 
› direct displays 
› query interpretation 
• rank interpretations! 
› context context context 
target id=4 text=James Dean 
qa 
q id=4.1 type=FACTOIDWhen was James Dean born?/q 
/qa 
qa 
q id=4.2 type=FACTOIDWhen did James Dean die?/q 
/qa 
qa 
q id=4.3 type=FACTOIDHow did he die?/q 
/qa 
qa 
q id=4.4 type=LISTWhat movies did he appear in?/q 
/qa 
qa 
q id=4.5 type=FACTOIDWhich was the first movie that he was in?/q 
/qa 
qa 
q id=4.6 type=OTHEROther/q 
/qa 
/target
Query understanding 
39 
▪ Entails mapping queries to entities, relations, types, attributes, … 
▪ Still in its infancy, especially for keyword queries 
› QA 
› query patterns/templates 
› direct displays 
› query interpretation 
• rank interpretations! 
› context context context 
target id=4 text=James Dean 
qa 
q id=4.1 type=FACTOIDWhen was James Dean born?/q 
/qa 
qa 
q id=4.2 type=FACTOIDWhen did James Dean die?/q 
/qa 
qa 
q id=4.3 type=FACTOIDHow did he die?/q 
/qa 
qa 
q id=4.4 type=LISTWhat movies did he appear in?/q 
/qa 
qa 
q id=4.5 type=FACTOIDWhich was the first movie that he was in?/q 
/qa 
qa 
q id=4.6 type=OTHEROther/q 
/qa 
/target
Query understanding 
40 
▪ Entails mapping queries to entities, relations, types, attributes, … 
▪ Still in its infancy, especially for keyword queries 
› QA 
› query patterns/templates 
› direct displays 
› query interpretation 
• rank interpretations! 
› context context context
Query understanding 
41 
▪ Entails mapping queries to entities, relations, types, attributes, … 
▪ Still in its infancy, especially for keyword queries 
› QA 
› query patterns/templates 
› direct displays 
› query interpretation 
• rank interpretations! 
› context context context
Intents? 
42 
▪ Intent = “need behind the query”, “objective”, “task”, etc. 
› used for triggering, reranking, selecting, disambiguation, … 
› detection + mapping 
▪ Search-oriented intents 
› navigational, informational, transactional 
› domains/verticals 
• autos, local, product, recipe, travel, … 
› entity/type-centered intents (“attribute intents”), tied to 
• attributes 
• specific return type/s 
• actions 
• facets/refiners
How to detect them? 
43 
▪ language models 
▪ editorial 
▪ rules/templates 
▪ ML + neural LMs
44 
UI/UX
Knowledge graphs 
45
Knowledge graphs 
46
Knowledge graphs 
47
Knowledge graphs 
48
Semantic search introduces new information access tasks 
49 
▪ Users come to expect increasingly 
advanced results 
› related entity finding 
› relationship explanation 
• between two “adjacent” entities 
• between more than two entities 
• for any path between entities in the KG 
› relationship ranking 
› (contextual) type ranking 
› disambiguation
Semantic search introduces new information access tasks 
50 
▪ Users come to expect increasingly 
advanced results 
› related entity finding 
› relationship explanation 
• between two “adjacent” entities 
• between more than two entities 
• for any path between entities in the KG 
› relationship ranking 
› (contextual) type ranking 
› disambiguation
Semantic search introduces new information access tasks 
51 
▪ Users come to expect increasingly 
advanced results 
› related entity finding 
› relationship explanation 
• between two “adjacent” entities 
• between more than two entities 
• for any path between entities in the KG 
› relationship ranking 
› (contextual) type ranking 
› disambiguation
Evaluation 
52 
▪ How to measure success (and train models)? 
› clicks 
› A/B testing, bucket testing 
› interleaving 
› dwell time 
› eye/mouse tracking 
› editorial assessments 
• costly 
• hard to generalize 
• relevance is not objective (but very subjective/contextual)
Evaluating semantic search 
53 
▪ Semantic search aims to “answer” queries 
› show relevant entities 
› show the actual answer/fact/… 
! 
▪ How do you measure/observe/determine success? 
› feedback 
› human editors 
• how to generalize? 
› abandonment 
› task/location/context/user/… specific notion of relevance 
! 
▪ Need adequate and reliable metrics
Moving towards mobile 
54 
▪ Limited screen real estate 
▪ Costly to scroll/click/back/etc. 
› people actually type longer queries on mobile devices! 
▪ Rich context 
› hyperlocal 
• location (lat/lon, home/work/traveling/…) 
• time of day 
› device type 
▪ Move towards “discussion-style” interfaces, i.e., interactive “IR” 
› “QA”/Siri 
› dialog systems
55
UI/UX 
56 
▪ Mobile search, centered around entities
UI/UX 
57 
▪ Mobile search, centered around entities
UI/UX 
58 
▪ Mobile search, centered around entities
Evaluation on mobile 
59 
▪ Even more tricky… 
! 
▪ How do you measure/observe/determine success? 
› clicks (“taps”) are not easily interpreted 
› neither are swipes, pinches, etc. 
▪ Current approaches include 
› “field studies” 
› observing users (in the lab and in the wild) 
› ... 
▪ Need adequate and reliable metrics
60 
Challenges
Current challenges 
61 
▪ Combining 
› text 
• queries 
• documents 
• entity descriptions/inlinks 
› structure 
• internal and external to documents 
• explicit (RDF), implicit (links from KG to web pages), inferred (information extraction) 
› context 
• users, sessions, and tasks 
• hyperlocal, personal, and social 
• temporal/popularity (buzziness) 
› in rich, complex user interactions 
• evaluation?
Thanks! 
62 
▪ More info? 
› @edgarmeij 
› emeij@yahoo-inc.com 
› http://edgar.meij.pro

Contenu connexe

En vedette

Master Minds on Data Science - Maarten de Rijke
Master Minds on Data Science - Maarten de RijkeMaster Minds on Data Science - Maarten de Rijke
Master Minds on Data Science - Maarten de RijkeMedia Perspectives
 
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...Pierpaolo Basile
 
(Micro)Blog : un sujet de recherche actuel [08/02/2011]
(Micro)Blog : un sujet de recherche actuel [08/02/2011](Micro)Blog : un sujet de recherche actuel [08/02/2011]
(Micro)Blog : un sujet de recherche actuel [08/02/2011]Guillaume Cabanac
 
Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux
Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociauxBarometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux
Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociauxHelloWork
 
Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Quels facteurs de pertinence pour la recherche de produits e-commerce ?Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Quels facteurs de pertinence pour la recherche de produits e-commerce ?Lamjed Ben Jabeur
 
Moederpresentatie Cross Media Cafe - Uit het Lab
Moederpresentatie Cross Media Cafe - Uit het LabMoederpresentatie Cross Media Cafe - Uit het Lab
Moederpresentatie Cross Media Cafe - Uit het LabMedia Perspectives
 
Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...
Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...
Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...Lamjed Ben Jabeur
 
Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...
Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...
Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...Lamjed Ben Jabeur
 
Un modèle de Recherche d'Information Sociale pour l'Accès aux Ressources Bib...
Un modèle de Recherche d'Information Sociale  pour l'Accès aux Ressources Bib...Un modèle de Recherche d'Information Sociale  pour l'Accès aux Ressources Bib...
Un modèle de Recherche d'Information Sociale pour l'Accès aux Ressources Bib...Lamjed Ben Jabeur
 

En vedette (9)

Master Minds on Data Science - Maarten de Rijke
Master Minds on Data Science - Maarten de RijkeMaster Minds on Data Science - Maarten de Rijke
Master Minds on Data Science - Maarten de Rijke
 
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
 
(Micro)Blog : un sujet de recherche actuel [08/02/2011]
(Micro)Blog : un sujet de recherche actuel [08/02/2011](Micro)Blog : un sujet de recherche actuel [08/02/2011]
(Micro)Blog : un sujet de recherche actuel [08/02/2011]
 
Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux
Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociauxBarometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux
Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux
 
Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Quels facteurs de pertinence pour la recherche de produits e-commerce ?Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Quels facteurs de pertinence pour la recherche de produits e-commerce ?
 
Moederpresentatie Cross Media Cafe - Uit het Lab
Moederpresentatie Cross Media Cafe - Uit het LabMoederpresentatie Cross Media Cafe - Uit het Lab
Moederpresentatie Cross Media Cafe - Uit het Lab
 
Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...
Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...
Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...
 
Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...
Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...
Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...
 
Un modèle de Recherche d'Information Sociale pour l'Accès aux Ressources Bib...
Un modèle de Recherche d'Information Sociale  pour l'Accès aux Ressources Bib...Un modèle de Recherche d'Information Sociale  pour l'Accès aux Ressources Bib...
Un modèle de Recherche d'Information Sociale pour l'Accès aux Ressources Bib...
 

Similaire à Web-scale semantic search

The Yahoo Knowledge Graph (SemTech 2014)
The Yahoo Knowledge Graph (SemTech 2014)The Yahoo Knowledge Graph (SemTech 2014)
The Yahoo Knowledge Graph (SemTech 2014)Nicolas Torzec
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesKarthik Murugesan
 
Data Structure and Types
Data Structure and TypesData Structure and Types
Data Structure and TypesAnjani Phuyal
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101AmmarChalifah
 
Entity centric data_management_2013
Entity centric data_management_2013Entity centric data_management_2013
Entity centric data_management_2013eXascale Infolab
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyondErnesto Reig
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationRinke Hoekstra
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web TechnologiesKANIMOZHIUMA
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Institute of Contemporary Sciences
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementTrey Grainger
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsHugh McCamphill
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsZemanta
 

Similaire à Web-scale semantic search (20)

The Yahoo Knowledge Graph (SemTech 2014)
The Yahoo Knowledge Graph (SemTech 2014)The Yahoo Knowledge Graph (SemTech 2014)
The Yahoo Knowledge Graph (SemTech 2014)
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Data Structure and Types
Data Structure and TypesData Structure and Types
Data Structure and Types
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
 
Entity centric data_management_2013
Entity centric data_management_2013Entity centric data_management_2013
Entity centric data_management_2013
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web Technologies
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely tests
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Web-scale semantic search

  • 1. Web-scale semantic search Edgar Meij, Yahoo Labs | October 31, 2014
  • 2. Introduction 2 ▪ Web search queries are increasingly entity-oriented › fact finding › transactional › exploratory › … ▪ Users expect increasingly “smarter” results › geared towards a user’s personal profile and current context • device type/user agent, day/time, location, … › with sub-second response time › fresh/up to date, buzzy › more than “just 10 blue links”
  • 3. Semantic search 3 ▪ “Plain” IR works well for simple queries and unambiguous terms ▪ But it gets harder when there is additional context, (implicit) intent, … › “barack obama height” › “best philips tv” > “sony tv” > “panasonic” › “pizza” › “pizza near me” ▪ Semantic search deals with “meaning” › essential ingredient: linking text to unambiguous “concepts” (~entitites) › Add: • query understanding • ranking • result presentation
  • 4. Semantic search 4 ▪ Query understanding › entities, relations, types, … › intent detection ~ answer type prediction • kg-oriented – entity/ies – property/ies – relations – summaries – ... • web pages • news • POIs • ...
  • 5. Semantic search 5 ▪ Ranking › (relevance) › freshness › authoritativeness › buzziness › distance › personalization › … ▪ Result presentation › serendipity › targeted answers › …
  • 6. Semantic search 6 ▪ “Improve search accuracy by understanding searcher intent and the contextual meaning of queries/terms/documents/results/…” › “eiffel tower height” › “brad pitt married” › “chinese food” › “obama net worth” ▪ Uncertainties in › the source(s) › the query › the user › the intent › …
  • 7. Semantic search 7 ▪ Combines IR/NLP/DB/SW ▪ Tasks › information extraction › information reconciliation/tracking › entity linking › query understanding/intent classification › retrieving/ranking entities/attributes/relations › interleaving/federated search: dedupe, merge, and rank (at runtime) › UI/UX › personalization › …
  • 9. Knowledge graphs 9 ▪ The “backbone” of semantic search ▪ They define › entities › attributes › types › relations › (provenance, sometimes) › and more • external links, homepages, features, …
  • 10. Entities at the core 10
  • 11. Entities are often not directly observed 11 ▪ They appear in different forms, in different types of data ! ▪ Unstructured › queries, documents, tweets, web pages, snippets, … ▪ Semistructured › inline XML, RDFa, schema.org, … ▪ Structured › Relational DBs, RDF, … Often organized around entities Data Retrieval system collection(s) Information need Result(s)
  • 12. KG vision @ Yahoo 12 ▪ A unified knowledge graph for Yahoo › all entities and topics relevant to Yahoo (users) › rich information about entities: facts, relationships, features › identifiers, interlinking across data sources, and links to relevant services ! ▪ To power knowledge-based services at Yahoo › search: display, and search for, information about entities › discovery: relate entities, interconnect data sources, link to relevant services › understanding: recognize entities in queries and text ! ▪ Managed and served by a central knowledge team / platform
  • 13. Steps 13 Knowledge Acquisition Knowledge Integration Knowledge Consumption Ongoing information extraction, from complementary sources. Reconciliation into a unified knowledge repository. Enrichment and serving…
  • 14. 14 Key Tasks Blending Entity Reconciliation Knowledge Repository (common ontology) Data Quality Monitoring Data Acquisition Information Extraction Schema Mapping Serving Export Editorial Curation Enrichment Knowledge Acquisition Knowledge Integration Knowledge Consumption
  • 15. 15 Data Acquisition Blending Entity Reconciliation Knowledge Repository (common ontology) Data Quality Monitoring Data Acquisition Information Extraction Schema Mapping Serving Export Editorial Curation Enrichment Knowledge Acquisition Knowledge Integration Knowledge Consumption
  • 16. Issues 16 ▪ Multiple complementary data sources › combine and cross-validate data from authoritative sources › reference data sources such as Wikipedia and Freebase form our backbone › specialized data sources such as TMS and Musicbrainz adds breadth/depth › optimize for relevance, comprehensiveness, correctness, freshness, consistency ▪ Ongoing acquisition of raw data › feed acquisition from open data sources and paid providers › web/Targeted crawling, online fetching, ad hoc acquisition (e.g. Wikipedia monitoring) › deal w/ operational complexity: data size, bandwidth, update frequency, license, ©
  • 17. 17 Information Extraction Blending Entity Reconciliation Knowledge Repository (common ontology) Data Quality Monitoring Data Acquisition Information Extraction Schema Mapping Serving Export Editorial Curation Enrichment Knowledge Acquisition Knowledge Integration Knowledge Consumption
  • 18. Information Extraction 18 ▪ Extraction of entities, attributes, relationships, features › deal w/ scale, volatility, heterogeneity, inconsistency, schema complexity, breakage › expensive to build and maintain (i.e. declarative rules, expert’s knowledge, ML…) › being able to measure and monitor data quality is key ! ▪ Mixed approach › parsing of large data feeds and online data APIs › structured data extraction on the Web: markup, Web scraping, Wrapper induction, › Wikipedia mining, Web mining, News mining, open information extraction
  • 19. 19 Key Tasks Blending Entity Reconciliation Knowledge Repository (common ontology) Data Quality Monitoring Data Acquisition Information Extraction Schema Mapping Serving Export Editorial Curation Enrichment Knowledge Acquisition Knowledge Integration Knowledge Consumption
  • 20. 20 Entity Reconciliation & Blending ▪ Disambiguate and merge entities across/within data sources Blocking Select candidates most likely to refer to the same real world entity Fast approx. similarity, hashing Scoring Compute similarity score between all pair of candidates ML classifier or heuristics Clustering Decide which candidates refer to the same entity and interlink them ML clustering or heuristics Merging Build a unified object for each cluster. Populate with best properties ML selection or heuristics ▪ Challenges › not trivial! › scale and adapt to new entity types, data sources, data sizes, update frequencies… › ongoing reconciliation/blending/evaluation. Need for consistent entity IDs. Provenance.
  • 22. Knowledge graphs… 22 ▪ … are not perfect ▪ Or: the importance of human editors
  • 23. Knowledge graphs… 23 ▪ … are not perfect ▪ Or: the importance of human editors
  • 24. Knowledge graphs… 24 ▪ … are not perfect ▪ Or: the importance of human editors
  • 25. Knowledge graphs… 25 ▪ … are not perfect ▪ Or: the importance of human editors
  • 28. Entity linking 28 ▪ Typical steps › mention detection – which part(s) should we link? › candidate ranking/selection – where should they be linked to? › disambiguation – maximize a “global” objective function ▪ Entity linking for web search queries › pre-retrieval • needs to be fast, space-efficient, and accurate › queries are short and noisy • high level of ambiguity • limited context
  • 29. Entity linking for web search 29 ▪ Approach › probabilistic model • unsupervised • large-scale set of aliases from Wikipedia and from click logs › contextual relevance model based on neural language models › state-of-the-art hashing and bit encoding techniques
  • 30. from the set S, s ⇠ Multinomial(✓s) e a set of entities e 2 e, where each e is drawn from the set E, e ⇠ Multinomial(✓e) as ⇠ Bernoulli(✓sa ) indicates if s is an alias Entity linking for web search 30 as,e ⇠ Bernoulli(✓s,e a ) indicates if s is an alias pointing (linking/clicked) to e c indicates which collection acts as a source of information—query logs or Wikipedia (cq or cw) n(s, c) count of s in c n(e, c) count of e in c ▪ Idea: jointly model mention detection (segmentation) and entity selection Let q be the input query, which we represent with the set Sq of all possible segmentations of its tokens t1 · · · tk. The algorithm will return the set of entities e, along with their scores, that maximizes: › compute probabilistic score for each segment-entity pair › optimize the score of the whole query argmax e2E log P(e|q) = argmax e2E,s2Sq P e2e log P(e|s) (1) s.t. s 2 s , S s ✓ s , T s = ;. (2) In Eq. 1 we assume independence of the entities given a query segment, and in Eq. 2 we impose that the segmentations are disjoint. Each individual entity/segment probability is then segmentation that optimizes the entity: argmax e2E,s2Sq max e2e,s2s Both Eq. 1 and Eq. 10 are instances segmentation problem, defined as terms t = t1 · · · tk, denote any segment [titi+1 . . . ti+j−1] 8i, j ( 0. Let "that maps segments to real numbers, score of a segmentation is defined m(t1, t2, . . . , tk) = ✓ max # (m(t1),m(t2, . . . , tk)) , . . . , # ("([t1 . . . , tk−1]),m(where m(t1) = "([t1]) and #(a, b) function, such as #(a, b) = a + #(a, b) = max(a, b) in the case function s(·) only depends on the the others, the segmentation with computed in O(k2) time using dynamic We instantiate the above problem "(s) = highestscore(s, q) =
  • 31. Main idea ▪ Use dynamic programming to solve 31 estimated as: P(e|s) = X c2{cq,cw} P(c|s)P(e|c, s) = X c2{cq,cw} P(c|s) X as={0,1} P(as|c, s)P(e|as, c, s) = X c2{cq,cw} P(c|s)  P(as = 0|c, s)P(e|as = 0, c, s) + P(as = 1|c, s)P(e|as = 1, c, s)) & . The maximum likelihood probabilities are (note that in this case P(e|as = 0, c, s) = 0 and therefore the right hand side of the summation cancels out): P(c|s) = Pn(s, c) c0 n(s, c0) (3) P(as = 1|c, s) = P s:as=1 n(s, c) n(s, c) (4) pair and then optimizing the score of the whole query. Note that we do not employ any supervision and let the model and data operate in a parameterless fashion; it is however possible to add an additional layer that makes use of human-labeled training data in order to enhance the performance of the model. This is the approach followed in Alley-oop where the ranking model described in this paper is used to perform a first-phase ranking, followed by a second phase ranking using a supervised, machine-learned model. To describe our model we use the following random vari-ables, assuming as an event space S ⇥ E where S is the set of all sequences and E the set of all entities known to the system: s a sequence of terms s 2 s drawn from the set S, s ⇠ Multinomial(✓s) e a set of entities e 2 e, where each e is drawn from the set E, e ⇠ Multinomial(✓e) as ⇠ Bernoulli(✓sa ) indicates if s is an alias as,e ⇠ Bernoulli(✓s,e a ) indicates if s is an alias pointing (linking/clicked) to e c indicates which collection acts as a source of information—query logs or Wikipedia (cq or cw) n(s, c) count of s in c n(e, c) count of e in c Let q be the input query, which we represent with the set Sq of all possible segmentations of its tokens t1 · · · tk. The algorithm will return the set of entities e, along with their scores, that maximizes: argmax e2E log P(e|q) = argmax e2E,s2Sq P e2e log P(e|s) (1) S T Those the In 1−smoothing. segmentation entity: Both segmentation terms [that score where function, plemented eciently using dynamic programming in O(k2), where k is the number of query terms. 2. MODELING ENTITY LINKING For our entity linking model we establish a connection between entities and their aliases (which are their textual representations, also known as surface forms) by leveraging anchor text or user queries leading to a click into the Web page that represents the entity. In the context of this pa-per we focus on using Wikipedia as KB and therefore only consider anchor text within Wikipedia and clicks from web search results on Wikipedia results—although it is general enough to accommodate for other sources of information. The problem we address consists of automatically segmenting the query and simultaneously selecting the right entity for each segment. Our Fast Entity Linker (FEL) tackles this problem by computing a probabilistic score for each segment-entity pair and then optimizing the score of the whole query. Note that we do not employ any supervision and let the model and data operate in a parameterless fashion; it is however possible to add an additional layer that makes use of human-labeled training data in order to enhance the performance of the model. This is the approach followed in Alley-oop where the estimated as: P(e|s) = X c2{cq,cw} P(c|s)P(e|c, s) = X c2{cq,cw} P(c|s) X as={0,1} P(as|c, s)P(e|as, c, s) = X c2{cq,cw} P(c|s)  P(as = 0|c, s)P(e|as = 0, c, s) + P(as = 1|c, s)P(e|as = 1, c, s)) . The maximum likelihood probabilities are (note that in this case P(e|as = 0, c, s) = 0 and therefore the right hand side of the summation cancels out): P(c|s) = Pn(s, c) c0 n(s, c0) (3) P(as = 1|c, s) = P s:as=1 n(s, c) n(s, c) (4) P(e|as = 1, c, s) = P s:as,e=1 n(s, c) P s:as=1 n(s, c) (5) Those maximum likelihood probabilities can be smoothed ap-propriately using an entity prior. Using Dirichlet smoothing,
  • 32. Contextual relevance model 32 ▪ Note: P(e|s) is independent of other s’s in each query › fast, but might be suboptimal › e.g., “hollywood lyrics”
  • 33. Contextual relevance model 33 ▪ Note: P(e|s) is independent of other s’s in each query › fast, but might be suboptimal › e.g., “hollywood lyrics” ▪ Solution: contextual relevance model › add query “context” t ∈ qs (i.e., the query remainder) into the model: P(e|s, q) • boils down to calculating Πt P(t|e) and merging this back into the model › naive implementation: LM or NB-based › more advanced: use word embeddings from neural language models
  • 34. Evaluation 34 ▪ Webscope query-to-entities dataset (publicly available) › http://webscope.sandbox.yahoo.com/ › 2.6k queries with 6k editorially linked Wikipedia entities ▪ 4.6m candidate Wikipedia entities ▪ Entities and aliases › 13m aliases from Wikipedia hyperlinks (after filtering) › 100m aliases from click logs ▪ Baselines › commonness (most likely sense of an alias) › IR-based (LM) › Wikifier › Bing
  • 35. Results and the size of query. If we break (Figure 2), we length of the greater than aliases that are another heavy 86%) point to original 35 word2vec extracted from dimensionality D = 200. query examples hyperparameters P@1 MRR MAP R-Prec LM 0.0394 0.1386 0.1053 0.0365 LM-Click 0.4882 0.5799 0.4264 0.3835 Bing 0.6349 0.7018 0.5388 0.5223 Wikifier 0.2983 0.3201 0.2030 0.2086 Commonness 0.7336 0.7798 0.6418 0.6464 FEL 0.7669 0.8092 0.6528 0.6575 FEL+Centroid 0.8035 0.8366 0.6728 0.6765 FEL+LR 0.8352 0.8684 0.6912 0.6883 Table 4: Entity linking ecacy. based on cmns and ignore the smaller, constituent n-grams. Otherwise we recurse and try to match the (n-1)-grams. In-formation retrieval-based approaches (denoted LM) make use of a Wikipedia index and can rank the pages using their
  • 36. Optimizations 36 ▪ Early stopping ▪ Compressing word embedding vectors › Golomb coding + Elias-Fano monotone sequence data structure • allowing retrieval in constant time › compression: • word vectors: 3.44 bits per entry • centroid vectors: 3.42 bits per entry • LR vectors: 3.83 bits per entry • overall: ~10 times smaller than 32-bit floating point ▪ Compressing counts ▪ Sub-millisecond response time
  • 38. Query understanding 38 ▪ Entails mapping queries to entities, relations, types, attributes, … ▪ Still in its infancy, especially for keyword queries › QA › query patterns/templates › direct displays › query interpretation • rank interpretations! › context context context target id=4 text=James Dean qa q id=4.1 type=FACTOIDWhen was James Dean born?/q /qa qa q id=4.2 type=FACTOIDWhen did James Dean die?/q /qa qa q id=4.3 type=FACTOIDHow did he die?/q /qa qa q id=4.4 type=LISTWhat movies did he appear in?/q /qa qa q id=4.5 type=FACTOIDWhich was the first movie that he was in?/q /qa qa q id=4.6 type=OTHEROther/q /qa /target
  • 39. Query understanding 39 ▪ Entails mapping queries to entities, relations, types, attributes, … ▪ Still in its infancy, especially for keyword queries › QA › query patterns/templates › direct displays › query interpretation • rank interpretations! › context context context target id=4 text=James Dean qa q id=4.1 type=FACTOIDWhen was James Dean born?/q /qa qa q id=4.2 type=FACTOIDWhen did James Dean die?/q /qa qa q id=4.3 type=FACTOIDHow did he die?/q /qa qa q id=4.4 type=LISTWhat movies did he appear in?/q /qa qa q id=4.5 type=FACTOIDWhich was the first movie that he was in?/q /qa qa q id=4.6 type=OTHEROther/q /qa /target
  • 40. Query understanding 40 ▪ Entails mapping queries to entities, relations, types, attributes, … ▪ Still in its infancy, especially for keyword queries › QA › query patterns/templates › direct displays › query interpretation • rank interpretations! › context context context
  • 41. Query understanding 41 ▪ Entails mapping queries to entities, relations, types, attributes, … ▪ Still in its infancy, especially for keyword queries › QA › query patterns/templates › direct displays › query interpretation • rank interpretations! › context context context
  • 42. Intents? 42 ▪ Intent = “need behind the query”, “objective”, “task”, etc. › used for triggering, reranking, selecting, disambiguation, … › detection + mapping ▪ Search-oriented intents › navigational, informational, transactional › domains/verticals • autos, local, product, recipe, travel, … › entity/type-centered intents (“attribute intents”), tied to • attributes • specific return type/s • actions • facets/refiners
  • 43. How to detect them? 43 ▪ language models ▪ editorial ▪ rules/templates ▪ ML + neural LMs
  • 49. Semantic search introduces new information access tasks 49 ▪ Users come to expect increasingly advanced results › related entity finding › relationship explanation • between two “adjacent” entities • between more than two entities • for any path between entities in the KG › relationship ranking › (contextual) type ranking › disambiguation
  • 50. Semantic search introduces new information access tasks 50 ▪ Users come to expect increasingly advanced results › related entity finding › relationship explanation • between two “adjacent” entities • between more than two entities • for any path between entities in the KG › relationship ranking › (contextual) type ranking › disambiguation
  • 51. Semantic search introduces new information access tasks 51 ▪ Users come to expect increasingly advanced results › related entity finding › relationship explanation • between two “adjacent” entities • between more than two entities • for any path between entities in the KG › relationship ranking › (contextual) type ranking › disambiguation
  • 52. Evaluation 52 ▪ How to measure success (and train models)? › clicks › A/B testing, bucket testing › interleaving › dwell time › eye/mouse tracking › editorial assessments • costly • hard to generalize • relevance is not objective (but very subjective/contextual)
  • 53. Evaluating semantic search 53 ▪ Semantic search aims to “answer” queries › show relevant entities › show the actual answer/fact/… ! ▪ How do you measure/observe/determine success? › feedback › human editors • how to generalize? › abandonment › task/location/context/user/… specific notion of relevance ! ▪ Need adequate and reliable metrics
  • 54. Moving towards mobile 54 ▪ Limited screen real estate ▪ Costly to scroll/click/back/etc. › people actually type longer queries on mobile devices! ▪ Rich context › hyperlocal • location (lat/lon, home/work/traveling/…) • time of day › device type ▪ Move towards “discussion-style” interfaces, i.e., interactive “IR” › “QA”/Siri › dialog systems
  • 55. 55
  • 56. UI/UX 56 ▪ Mobile search, centered around entities
  • 57. UI/UX 57 ▪ Mobile search, centered around entities
  • 58. UI/UX 58 ▪ Mobile search, centered around entities
  • 59. Evaluation on mobile 59 ▪ Even more tricky… ! ▪ How do you measure/observe/determine success? › clicks (“taps”) are not easily interpreted › neither are swipes, pinches, etc. ▪ Current approaches include › “field studies” › observing users (in the lab and in the wild) › ... ▪ Need adequate and reliable metrics
  • 61. Current challenges 61 ▪ Combining › text • queries • documents • entity descriptions/inlinks › structure • internal and external to documents • explicit (RDF), implicit (links from KG to web pages), inferred (information extraction) › context • users, sessions, and tasks • hyperlocal, personal, and social • temporal/popularity (buzziness) › in rich, complex user interactions • evaluation?
  • 62. Thanks! 62 ▪ More info? › @edgarmeij › emeij@yahoo-inc.com › http://edgar.meij.pro