SlideShare une entreprise Scribd logo
1  sur  172
Mining Web content for Enhanced Search 
Roi Blanco 
Yahoo! Research
Yahoo! Research Barcelona 
• Established January, 2006 
• Led by Ricardo Baeza-Yates 
• Research areas 
• Web Mining 
• Social Media 
• Distributed Web retrieval 
• Semantic Search 
• NLP/IR 
2
Natural Language Retrieval 
• How to exploit the structure and meaning of 
natural language text to improve search 
• Current search engines perform only limited NLP 
(tokenization, stemming) 
• Automated tools exist for deeper analysis 
• Applications to diversity-aware search 
• Source, Location, Time, Language, Opinion, 
Ranking… 
• Search over semi-structured data, semantic 
search 
• Roll-out user experiences that use higher layers 
of the NLP stack 
3
Agenda 
• Intro to current status of Web Search 
• Semantic Search 
– Annotating documents 
– Entity linking 
– Entity search 
– Applications 
• Time-aware information access 
– Motivation 
– Applications 
4
WEB SEARCH 
5
Some Challenges 
• From user understanding to selecting the right 
hardware/software architecture 
– Modeling user behavior 
– Modeling user engagement/return 
• Providing answers, not links 
– Richer interfaces 
• Enriching document structure using semantic annotations 
– Indexing those annotations 
– Accessing them in real-time 
– What are the right annotations? The right level of granularity? 
Trade-offs? 
• Time-aware information access 
– Changes the game in almost every step of the pipeline 
6
7
8
9
Structured data - Web search 
Top-1 entity with 
structured data 
Related entities 
Structured data 
extracted from HTML 
10
New devices 
• Different interaction (e.g. voice) 
• Different capabilities (e.g. display) 
• More Information (geo-localization) 
• More personalized 
11
SEMANTIC SEARCH 
12
Semantic Search 
• What different kinds of search and 
applications beyond string matching or 
returning 10 blue links? 
• Can we have a better understanding of 
documents and queries? 
• New devices open new possibilities, new 
experiences 
• Is current technology in natural language 
understanding mature enough? 
13
Search is really fast, without necessarily being 
intelligent 
14
Why Semantic Search? Part I 
• Improvements in IR are harder and harder to 
come by 
– Machine learning using hundreds of features 
• Text-based features for matching 
• Graph-based features provide authority 
– Heavy investment in computational power, e.g. real-time 
indexing and instant search 
• Remaining challenges are not computational, but 
in modeling user cognition 
– Need a deeper understanding of the query, the 
content and/or the world at large 
– Could Watson explain why the answer is Toronto? 
15
Semantic Search: a definition 
• Semantic search is a retrieval paradigm that 
– Makes use of the structure of the data or explicit schemas to 
understand user intent and the meaning of content 
– Exploits this understanding at some part of the search process 
• Emerging field of research 
– Exploiting Semantic Annotations in Information Retrieval (2008- 
2012) 
– Semantic Search (SemSearch) workshop series (2008-2011) 
– Entity-oriented search workshop (2010-2011) 
– Joint Intl. Workshop on Semantic and Entity-oriented Search (2012) 
– SIGIR 2012 tracks on Structured Data and Entities 
• Related fields: 
– XML retrieval, Keyword search in databases, NL retrieval 
16
What it’s like to be a machine? 
Roi Blanco 
17
What it’s like to be a machine? 
✜Θ♬♬ţğ 
✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓ 
ţğ★✜ 
✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫ 
≠=⅚©§★✓♪ΒΓΕ℠ 
✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ 
⏎⌥°¶§ΥΦΦΦ✗✕☐ 
18
Interactive search and task 
completion 
19
Why Semantic Search? Part II 
• The Semantic Web is here 
– Data 
• Large amounts of RDF data 
• Heterogeneous schemas 
• Diverse quality 
– End users 
• Not skilled in writing complex 
queries (e.g. SPARQL) 
• Not familiar with the data 
• Novel applications 
– Complementing document 
search 
• Rich Snippets, related entities, 
direct answers 
– Other novel search tasks 
20
Other novel applications 
• Aggregation of search results 
– e.g. price comparison across websites 
• Analysis and prediction 
– e.g. world temperature by 2020 
• Semantic profiling 
– Ontology-based modeling of user interests 
• Semantic log analysis 
– Linking query and navigation logs to ontologies 
• Task completion 
– e.g. booking a vacation using a combination of services 
• Conversational search 
21
Common tasks in Semantic Search 
semantic search 
list search 
entity search 
SemSearch 2010/11 
related entity finding 
list completion 
SemSearch 2011 
TREC REF-LOD task TREC ELC task 
22
Is NLU that complex? 
”A child of five would understand 
this. Send someone to fetch a child 
of five”. 
Groucho Marx 
23
Some examples… 
• Ambiguous searches 
– paris hilton 
• Multimedia search 
– paris hilton sexy 
• Imprecise or overly precise searches 
– jim hendler 
– pictures of strong adventures people 
• Searches for descriptions 
– 33 year old computer scientist living in barcelona 
– reliable digital camera under 300 dollars 
• Searches that require aggregation 
– height eiffel tower 
– harry potter movie review 
– world temperature 2020 
24
Language is Ambiguous 
The man saw the girl with the telescope 
25
Paraphrases 
• ‘This parrot is dead’ 
• ‘This parrot has kicked the bucket’ 
• ‘This parrot has passed away’ 
• ‘This parrot is no more' 
• 'He's expired and gone to meet his maker,’ 
• 'His metabolic processes are now history’ 
26
Not just search… 
27
Semantics at every step of the IR 
process 
bla bla bla? 
bla 
bla bla 
The IR engine The Web 
Query interpretation 
q=“bla” * 3 
Document processing bla 
bla bla 
bla 
bla 
bla 
Indexing 
Ranking 
θ(q,d) “bla” 
Result presentation 
28
Understanding Queries 
• Query logs are a big source of information & 
knowledge 
To rank better the results (what you click) 
To understand queries better 
Paris Paris Flights 
Paris Paris Hilton 
29
“Understand” Documents 
NLU is 
still an 
open 
issue 
30
NLP for IR 
• Full NLU is AI complete, not scalable to the web 
size (parsing the web is really hard). 
• BUT … what about other shallow NLP 
techniques? 
• Hypothesis/Requirements: 
• Linear extraction/parsing time 
• Error-prone output (e.g. 60-90%) 
• Highly redundant information 
• Explore new ways of browsing 
• Support your answers 
31
Usability 
We also fail at using the technology 
Sometimes 
32
Support your answers 
Errors happen: choose the right ones! 
• Humans need to “verify” unknown facts 
• Multiple sources of evidence 
• Common sense vs. Contradictions 
• are you sure? is this spam? Interesting! 
• Tolerance to errors greatly increases if users can verify 
things fast 
• Importance of snippets, image search 
• Often the context is as important as the fact 
• E.g. “S discovered the transistor in X” 
• There are different kinds of errors 
• Ridiculous result (decreases overall confidence in system) 
• Reasonably wrong result (makes us feel good) 
33
TOOLS FOR GENERATING 
ANNOTATIONS 
34
Entity? 
• Uniquely identifiable “thing” or “object” 
– “A thing with a distinct and independent existence” 
• Properties: 
– ID 
– Name(s) 
– Type(s) 
– Attributes (/Descriptions) 
– Relationships to other entities 
• An entity is only one kind of textual annotation! 
35
Entity? 
36
Named Entity Recognition 
• “Named Entity”, IE context of MUC-6 (R. 
Grishman & Sundheim’1996) 
• Recognize information units like names, including 
PERson, ORGanization, LOCation names, and 
numeric expressions including time, date, money 
[Yahoo!] employee [Roi Blanco] visits [Sapporo] . 
ORG PER LOC 
37
ML methods 
• Supervised 
• HMM 
• CRF 
• SVM 
• … 
• Unsupervised 
• Lexical Resources (e.gWordNet), lexical patterns and 
statistics from large non-annotated corpora. 
• “Semi-supervised” (or “weakly supervised”) 
• Mostly “Bootstrapping” from a set of entities 
38
Entity Linking 
Image taken from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM '08. 
39
NE Linking / NED 
• Linking mentions of Named Entities to a 
Knowledge Base (e.g. Wikipedia, Freebase …) 
• Name Variations 
• Shortened forms 
• Alternative Spellings 
• “New” Entities (not contained in the 
Knowledge Base) 
• As the Knowledge Base grows … everything 
looks like an Entity (a town named web!?) 
40
Entity Ambiguity 
• In Wikipedia Michael Jordan may also refer to: 
– Michael Jordan (mycologist) 
– Michael Jordan (footballer) 
– Michael Jordan (Irish politician) 
– Michael B. Jordan (born 1987), American actor 
– Michael I. Jordan (born 1957), American researcher 
– Michael H. Jordan (1936–2010), American executive 
– Michael-Hakim Jordan (born 1977), basketball player 
41
Traditional IE 
• Relations: assertions linking two or more 
concepts 
– actors-act in-movies 
– cities-capital of-countries 
• Facts: instantiations of relations 
– leonardo dicaprio-act in-inception 
– cairo-capital of-egypt 
• Attributes: facts capturing quantifiable properties 
– actors --> awards, birth date, height 
– movies --> producer, release date, budget 
42
ENTITY LINKING 
43 
http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/
Image taken from Mihalcea and Csomai (2007). Wikify!: linking documents to encyclopedic knowledge. In 
CIKM '07. 44
45
Bing 
46
Google 
47
Yahoo! 
48
49 
Yahoo! Homepage
50 
Yahoo! Homepage
51 
Yahoo! Homepage
Why do we need entity linking? 
• (Automatic) document enrichment 
– go-read-here 
– assistance for (Wikipedia) editors 
– inline (microformats, RDFa) 
52
Why do we need entity linking? 
• “Use as feature” 
– to improve 
• classification 
• retrieval 
• word sense disambiguation 
• semantic similarity 
• ... 
– dimensionality reduction (as compared to, 
e.g., term vectors) 
53
Why do we need entity linking? 
• Enable 
– semantic search 
– advanced UI/UX 
– ontology learning, KB population 
– ... 
54
A bit of history 
• Text classification 
• NER 
• WSD 
• NED 
– {person name, geo, movie name, ...} 
disambiguation 
– (Cross-document) co-reference resolution 
• Entity linking 
55
Entity linking? 
• NE normalization / canonicalization / sense 
disambiguation 
• DB record linkage / schema mapping 
• Knowledge base population 
• Entity linking 
– D2W 
– Wikification 
– Semantic linking 
56
Entity Linking: main problem 
• Linking free text to entities 
– Entities (typically) taken from a knowledge 
base 
• Wikipedia 
• Freebase 
• ... 
– Any piece of text 
• news documents 
• blog posts 
• tweets 
• queries 
• ... 
57
Typical steps 
1. Determine “linkable” phrases 
– mention detection – MD 
2. Rank/Select candidate entity links 
– link generation – LG 
– may include NILs (null values, i.e., no target in KB) 
3. (Use “context” to 
disambiguate/filter/improve) 
– disambiguation – DA 
58
Preliminaries 
• Wikipedia 
• Wikipedia-based measures 
– commonness 
– relatedness 
– keyphraseness 
59
Wikipedia 
• Basic element: article (proper) 
• But also 
– redirect pages 
– disambiguation pages 
– category/template pages 
– admin pages 
• Hyperlinks 
– use “unique identifiers” (URLs) 
• [[United States]] or [[United States|American]] 
• [[United States (TV series)]] or 
[[United States (TV series)|TV show]] 60
Wikipedia style guidelines 
• “the lead contains a quick summary of the 
topic's most important points, and each 
major subtopic is detailed in its own section 
of the article” 
– “The lead section (also known as the lead, 
introduction or intro) of a Wikipedia article is the 
section before the table of contents and the first 
heading. The lead serves as an introduction to the 
article and a summary of its most important 
aspects.” 
See http://en.wikipedia.org/wiki/Wikipedia:Summary_style 
61
Some statistics 
• WordNet 
– 80k entity definitions 
– 115k surface forms 
– 142k senses (entity - surface form combinations) 
• Wikipedia (only) 
– ~4M entity definitions 
– ~12M surface forms 
– ~24M senses 
62
Wikipedia-based measures 
• keyphraseness(w) [Mihalcea & Csomai 2007] 
Collection frequency term 
w as a link to another Wikipedia 
article 
Collection frequency term w 
63
Wikipedia-based measures 
• commonness(w,c) [Medelyan et al. 2008] 
Number of links 
with target c’ and anchor text w 
64
Commonness and 
keyphraseness 
Image taken from Li et al. (2013). TSDW: Two-stage word sense disambiguation using Wikipedia. In 
JASIST 2013. 65
Wikipedia-based measures 
• relatedness(c, c’) [Milne & Witten 2008a] 
Image taken from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness 
Obtained from Wikipedia Links. In AAAI WikiAI Workshop. 66
Wikipedia-based measures 
• relatedness(c, c’) [Milne & Witten 2008a] 
Total number of 
Wikipedia articles 
Intersection of inlinks 
with target c and c’ 
Number of links 
with target c 
67
Recall the steps 
1. mention detection – MD 
2. link generation – LG 
3. (disambiguation) – DA 
68
Wikify! 
[Mihalcea & Csomai 2007] 
• MD 
– tf.idf, Χ2, keyphraseness 
• LG 
1.Overlap between definition (Wikipedia page) and 
context (paragraph) [Lesk 1986] 
2.Naive Bayes [Mihalcea 2007] 
• context, POS, entity-specific terms 
3.Voting between (1) and (2) 
69
Large-Scale Named Entity Disambiguation Based 
on Wikipedia Data 
[Cucerzan 2007] 
• Key intuition: leverage context links 
– '''Texas''' is a [[pop music]] band from [[Glasgow]], 
[[Scotland]], [[United Kingdom]]. They were 
founded by [[Johnny McElhone]] in [[1986 in 
music|1986]] and had their performing debut in 
[[March]] [[1988]] at ... 
• Prune the candidates, keep only: 
– appearances in the first paragraph of an article, 
and 
– reciprocal links 
70
Large-Scale Named Entity Disambiguation Based 
on Wikipedia Data 
[Cucerzan 2007] 
• MD 
– NER; rule-based; co-ref resolution 
• LG 
– Represent entities as vectors 
• context, categories 
– Same for all candidate entity links 
– Determine maximally coherent set 
71
Topic Indexing with Wikipedia 
[Medelyan et al. 2008] 
• MD 
– keyphraseness [Mihalcea & Csomai 2007] 
• LG 
• combination of average relatedness & 
commonness 
• LG/DA 
• Naive Bayes 
• TF.IDF, position, length, degree, weighted keyphraseness 
72
Learning to Link with Wikipedia 
[Milne & Witten 2008b] 
• Key idea: disambiguation informs detection 
– compare each possible sense with its relatedness 
to the context sense candidates 
– start with unambiguous senses 
– So, first LG, then base MD on these results 
73
Learning to Link with Wikipedia 
[Milne & Witten 2008b] 
Image taken from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM '08. 
74
Learning to Link with Wikipedia 
[Milne & Witten 2008b] 
• Filter non-informative, non-ambiguous candidates 
(e.g., “the”) 
– based on keyphraseness, i.e., link probability 
• Filter non-central candidates 
– based on average relatedness to all other context senses 
• Combine 
75
Learning to Link with Wikipedia 
[Milne & Witten 2008b] 
• MD 
• Machine learning 
• link probability, relatedness, confidence of LG, 
generality, frequency, location, spread 
• LG 
• Machine learning 
• keyphraseness, average relatedness, sum of average 
weights 
76
Image taken from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM '08. 
77
Local and Global Algorithms for 
Disambiguation to Wikipedia 
[Ratinov et al. 2011] 
• Explicit focus on global versus local 
algorithms 
– “Global,” i.e., disambiguation of the candidate 
graph 
– NP-hard 
• Optimization 
– reduce the search space to a “disambiguation 
context,” e.g., 
• all plausible disambiguations [Cucerzan 2007] 
• unambiguous surface forms [Milne & Witten 2008b] 78
Local and Global Algorithms for 
Disambiguation to Wikipedia 
[Ratinov et al. 2011] 
• Main contribution, in steps 
1.use “local” approach (e.g., commonness) to 
generate a disambiguation context 
2.apply “global” machine learning approach 
• relatedness, PMI 
• {inlinks, outlinks} in various combinations (c and c’) 
• {avg, max} 
• Finally, apply another round of machine 
learning 
79
TAGME: On-the-fly Annotation of Short 
Text Fragments 
[Ferragina & Scaiella 2010] 
• MD 
– keyphraseness [Mihalcea & Csomai 2007] 
• LG 
• use “local” approach to generate a 
disambiguation context, similar to [Ratinov et al. 
2011] 
• Heavy pruning 
• mentions; candidate links; coherence 
• Accessible at http://tagme.di.unipi.it 
80
A Graph-based Method for Entity 
Linking 
[Guo et al. 2011] 
• MD 
– rule-based; prefer longer links 
– generate a disambiguation context 
• LG 
• (weighted interpolation of) in- and outdegree in 
disambiguation context to select entity links 
• edges defined by wikilinks 
• Evaluation on TAC KBP 
81
Recap 
• Essential ingredients 
– MD 
• commonness 
• keyphraseness 
– LG 
• commonness 
• machine learning 
– DA 
• relatedness 
• machine learning 
82
SEARCH OVER ANNOTATED 
DOCUMENTS 
83
Annotated documents 
Barack Obama visited Tokyo this Monday as part of an extended Asian trip. 
He is expected to deliver a speech at the ASEAN conference next Tuesday 
20 May 2009 
28 May 2009 
Barack Obama visited Tokyo this Monday as part of an extended Asian trip. 
He is expected to deliver a speech at the ASEAN conference next Tuesday 
84
85
How does it work? 
Monty 
Python Inverted Index 
(sentence/doc level) 
Forward Index 
(entity level) 
Flying Circus 
John Cleese 
Brian 
86
Efficient element retrieval 
• Goal 
– Given an ad-hoc query, return a list of documents and 
annotations ranked according to their relevance to the query 
• Simple Solution 
– For each document that matches the query, retrieve its 
annotations and return the ones with the highest counts 
• Problems 
– If there are many documents in the result set this will take too 
long - too many disk seeks, too much data to search through 
– What if counting isn’t the best method for ranking elements? 
• Solution 
– Special compressed data structures designed specifically for 
annotation retrieval 
87
Forward Index 
• Access metadata and document contents 
– Length, terms, annotations 
• Compressed (in memory) forward indexes 
– Gamma, Delta, Nibble, Zeta codes (power laws) 
• Retrieving and scoring annotations 
– Sort terms by frequency 
• Random access using an extra compressed 
pointer list (Elias-Fano) 
88
Parallel Indexes 
• Standard index contains only tokens 
• Parallel indices contain annotations on the tokens – the 
annotation indices must be aligned with main token index 
• For example: given the sentence “New York has great 
pizza” where New York has been annotated as a LOCATION 
– Token index has five entries 
(“new”, “york”, “has”, “great”, “pizza”) 
– The annotation index has five entries 
(“LOC”, “LOC”, “O”,”O”,”O”) 
Can optionally encode BIO format (e.g. LOC-B, LOC-I) 
• To search for the New York location entity, we search for: 
“token:New ^ entity:LOC token:York ^ entity:LOC” 
89
Parallel Indices (II) 
Doc #3: The last time Peter exercised was in the XXth century. 
Doc #5: Hope claims that in 1994 she run to Peter Town. 
Peter  D3:1, D5:9 
Town  D5:10 
Hope  D5:1 
1994  D5:5 
… 
Possible Queries: 
“Peter AND run” 
“Peter AND WNS:N_DATE” 
“(WSJ:CITY ^ *) AND run” 
“(WSJ:PERSON ^ Hope) AND run” 
WSJ:PERSON  D3:1, D5:1 
WSJ:CITY  D5:9 
WNS:V_DATE  D5:5 
(Bracketing can also be dealt with) 
90
Pipelined Architecture 
91
Entity Ranking 
92
Entity Ranking 
• Given a topic, find relevant entities 
• Evaluated in TREC and INEX campaigns 
• Most well-known: people and expert 
search 
• Many other applications: dates, events, 
locations, companies, ... 
93
94
Example queries 
• Impressionist art Museums in Holland 
• Countries with the Euro currency 
• German car manufacturers 
• Artists related to Pablo Picasso 
• Actors who played Hamlet 
• English monarchs who married a French women 
• Many examples on 
http://www.ins.cwi.nl/projects/inex-xer/topics/ 
95
Entity Ranking 
• Topical query Q 
– Entity (result) type TX 
– A list of Entity instance Xs 
• Systems employ categories, structure, links: 
– Kaptein et al., CIKM10: Exploits Wikipedia, identifies entity types, anchor text 
index for entity search 
– Bron et al,CIKM10 Entity co‐occurrence, Entity type filtering, Context 
(relation type) 
• See http://ilps.science.uva.nl/trec-entity/resources/ 
• Futher: http://ejmeij.github.io/entity-linking-and-retrieval- 
tutorial/ 
96
Element Containment Graph 
[Zaragoza et al. CIKM 2007] 
• Given passage s in S, element e in E, there is a 
directed edge from s to e if passage s contains 
element e 
• Element containment graph C = S x E where Cij is 
strength of connection between si and ej 
At query time, compute Cq as a subset of the containment 
graph of the q passages matching the query 
Rank entities using HITS, PageRank, passage similarity, 
element similarity, etc 
97
98 
SEARCH OVER RDF (TRIPLES) DATA
Resource Description Framework 
(RDF) 
• Each resource (thing, entity) is identified by a URI 
– Globally unique identifiers 
– Locators of information 
• Data is broken down into individual facts 
– Triples of (subject, predicate, object) 
• A set of triples (an RDF graph) is published together in an 
RDF document 
example:roi 
“Roi Blanco” 
type 
name 
foaf:Person 
RDF document
Linking resources 
example:roi 
“Roi Blanco” 
name 
foaf:Person 
sameAs 
example:roi2 
worksWith 
example:peter 
type 
email 
“pmika@yahoo-inc.com” 
type 
Roi’s homepage 
Yahoo 
Friend-of-a-Friend ontology
Publishing RDF 
• Linked Data 
– Data published as RDF documents linked to other RDF 
documents 
– Community effort to re-publish large public datasets 
(e.g. Dbpedia, open government data) 
• RDFa 
– Data embedded inside HTML pages 
– Recommended for site owners by Yahoo!, Google, 
Facebook 
• SPARQL endpoints 
– Triple stores (RDF databases) that can be queried 
through the web
The state of Linked Data 
• Rapidly growing community effort to (re)publish open datasets as Linked 
Data 
– In particular, scientific and government datasets 
– see linkeddata.org 
• Less commercial interest, real usage (Haas et al. SIGIR 2011)
Glimmer: Architecture overview 
Doc 
1. Download, uncompress, 
convert (if needed) 
2. Sort quads by subject 
3. Compute Minimal Perfect 
Hash (MPH) 
map 
map 
reduce 
reduce 
map reduce 
Index 
3. Each mapper reads part of the 
collection 
4. Each reducer builds an index 
for a subset of the vocabulary 
5. Optionally, we also build an 
archive (forward-index) 
5. The sub-indices are 
merged into a single index 
6. Serving 
and 
Ranking 
https://github.com/yahoo/Glimmer 
103
Horizontal index structure 
• One field per position 
– one for object (token), one for predicates (property), optionally one for context 
• For each term, store the property on the same position in the property 
index 
– Positions are required even without phrase queries 
• Query engine needs to support fields and the alignment operator 
 Dictionary is number of unique terms + number of properties 
 Occurrences is number of tokens * 2 
104
Vertical index structure 
• One field (index) per property 
• Positions are not required 
• Query engine needs to support fields 
 Dictionary is number of unique terms 
 Occurrences is number of tokens 
✗ Number of fields is a problem for merging, query performance 
• In experiments we index the N most common properties 
105
Efficiency improvements 
• r-vertical (reduced-vertical) index 
– One field per weight vs. one field per property 
– More efficient for keyword queries but loses the ability to restrict per 
field 
– Example: three weight levels 
• Pre-computation of alignments 
– Additional term-to-field index 
– Used to quickly determine which fields contain a term (in any 
document) 
106
Run-time efficiency 
• Measured average execution time (including ranking) 
– Using 150k queries that lead to a click on Wikipedia 
– Avg. length 2.2 tokens 
– Baseline is plain text indexing with BM25 
• Results 
– Some cost for field-based retrieval compared to plain text indexing 
– AND is always faster than OR 
• Except in horizontal, where alignment time dominates 
– r-vertical significantly improves execution time in OR mode 
AND mode OR mode 
plain text 46 ms 80 ms 
horizontal 819 ms 847 ms 
vertical 97 ms 780 ms 
r-vertical 78 ms 152 ms 
107
BM25F Ranking 
BM25(F) uses a term-frequency (tf) that accounts for the decreasing 
marginal contribution of terms 
where 
vs is the weight of the field 
tfsi is the frequency of term i in field s 
Bs is the document length normalization factor: 
ls is the length of field s 
avls is the average length of s 
bs is a tunable parameter 
108
BM25F ranking (II) 
• Final term score is a combination of tf and idf 
where 
k1 is a tunable parameter 
wIDF is the inverse-document frequency: 
• Finally, the score of a document D is the sum of 
the scores of query terms q 
109
Effectiveness evaluation 
• Semantic Search Challenge 2010 
– Data, queries, assessments available online 
• Billion Triples Challenge 2009 dataset 
• 92 entity queries from web search 
– Queries where the user is looking for a single entity 
– Sampled randomly from Microsoft and Yahoo! query logs 
• Assessed using Amazon’s Mechanical Turk 
– Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 
2010 
– Blanco et al. Repeatable and Reliable Search System 
Evaluation using Crowd-Sourcing, SIGIR2011 
110
Evaluation form 
111
Effectiveness results 
• Individual features 
– Positive, stat. significant improvement from each feature 
– Even a manual classification of properties and domains helps 
• Combination 
– Positive stat. significant marginal improvement from each additional feature 
– Total improvement of 53% over the baseline 
– Different signals of relevance 
112
Conclusions 
• Indexing and ranking RDF data 
– Novel index structures 
– Ranking method based on BM25F 
• Future work 
– Ranking documents with metadata 
• e.g. microdata/RDFa 
– Exploiting more semantics 
• e.g. sameAs 
– Ranking triples for display 
– Question-answering 
113
Entity Ranking Conclusions 
• Entity ranking provides a richer user 
experience for search-triggered 
applications 
• Ranking entities can be done 
effectively/efficiently using simple 
frequency-based models 
• Explaining things is as important as 
retrieving them (Blanco and Zaragoza SIGIR 2010) 
– Support sentences: several features based on 
scores of sentences and entities 
– The role of the context boosts performance 
114
Related entity ranking in 
web search
Spark: related entity 
recommendations in web search 
• A search assistance tool for exploration 
• Recommend related entities given the user’s current query 
– Cf. Entity Search at SemSearch, TREC Entity Track 
• Ranking explicit relations in a Knowledge Base 
– Cf. TREC Related Entity Finding in LOD (REF-LOD) task 
• A previous version of the system live since 2010 
• van Zwol et al.: Faceted exploration of image search results. WWW 2010: 961- 
970 
– The current version (described here) at 
• Blanco et al: Entity recommendations in Web Search, ISWC 2013 
116
Motivation 
• Some users are short on time 
– Need for direct answers 
– Query expansion, question-answering, 
information boxes, rich results… 
• Other users have time at their hand 
– Long term interests such as sports, celebrities, 
movies and music 
– Long running tasks such as travel planning 
117
Examples 
118
Spark in use 
119
How does it work? 
/user/torzecn/ 
Shared/Entity- 
Relationship_Graphs 
Yalinda feed parser 
(GFFeedMapper, 
GFFeedReducer) 
$VIS_GRID_HOME/ 
gfstorageall/{1,2,3}/ 
yalinda/phyfacet/data 
/projects/gridfaces/ 
feed/yahooomg/ 
Y! OMG feed parser 
(GFFeedMapper, 
GFFeedReducer) 
$VIS_GRID_HOME/ 
gfstorageall/6/yahooomg/ 
phyfacet/data 
/projects/gridfaces/ 
feed/geo 
Y! Geo feed parser 
(GFFeedMapper, 
GFFeedReducer) 
$VIS_GRID_HOME/ 
gfstorageall/10/geo/ 
phyfacet/data 
/projects/gridfaces/ 
feed/yahootv 
Y! TV feed parser 
(GFFeedMapper, 
GFFeedReducer) 
$VIS_GRID_HOME/ 
gfstorageall/5 
/projects/gridfaces/ 
feed/editorialdata/ 
data/objects 
Editorial feed parser 
(GFFeedMapper, 
GFEditorialMergeReducer) 
$VIS_GRID_HOME/ 
gfstorageall/logicalobj/ 
dataArchive/ 
$VIS_GRID_HOME/ 
gfstorageall/ 
logicalfacet/data 
$VIS_GRID_HOME/ 
gfstorageall/logicalobj/ 
data 
/projects/gridfaces/ 
feed/editorialdata/ 
data/facets 
Activated 
Deactivated 
Empty/missing 
GFDumpMain-First 
(DumpFirstMapper, 
DumpFirstReducer) 
$VIS_GRID_HOME/ 
rankinputTmp 
$VIS_GRID_HOME/ 
rankinput 
STEP 1: FEED PARSING PIPELINE 
GFDumpMain-Second 
(DumpSecondMapper, 
DumpSecondReducer) 
STEP 2: PREPROCESSING PIPELINE (before feature extraction and ranking) 
$VIS_GRID_HOME/ 
rankinput 
CreateDictionary 
(DictionaryMapper, 
DistinctReducer) 
$VIS_GRID_HOME/ 
rankprepout/ 
dictionary 
/data/SDS/data/ 
search_US 
CreateIntermediate 
LogFormat 
(NullValueSillyMapp 
er, DistinctReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
logs/week{WNo} 
CreateQuerySessions 
CommonModelFilter 
(FiltererMapper, 
FiltererReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
qsessionsfilterer 
CreateQuerySessions 
CommonModel 
(SillyMapper, 
QuerySessionsCreate 
SessionsReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
qsessions_cm 
CreateQueryTermsJoin 
DictionaryAndLogs 
(LeftJoinMapper, 
LeftJoinReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
qtermsjoindictionarylog 
CreateQueryTermsJoin 
QTermsAndLogs 
(SillyMapper, 
ImplodeReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
qterms_cm 
/projects/gridfaces/ 
flickr/feed 
CreateFlickrIntermediate 
LogFormat 
(FlickrReformatMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
flickrtagsintermediate 
CreateFlickrTagsCommon 
ModelFilter 
(FiltererMapper, 
FiltererReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
flickrtagsfilterer 
CreateFlickrTagsCommon 
ModelFilter 
(SillyMapper, 
ImplodeReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
flickr_cm 
General 
Query logs 
Flickr 
Twitter 
/projects/rtds/twitter/ 
firehose 
CreateTwitterIntermediate 
LogFormat 
(NullValueSillyMapper, 
DistinctReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
twitter_logs/week{WNo} 
CreateTweetsCommon 
ModelFilter 
(FiltererMapper, 
FiltererReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
tweetsfilterer 
CreateTweetsCommon 
Model 
(SillyMapper, 
ImplodeReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
tweets_cm 
DistinctUsersTwitter 
(SillyMapper, 
DistinctReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/tweets/ 
distinctusers 
DistinctUsersflickr 
(SillyMapper, 
DistinctReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/flickr/ 
distinctusers 
DistinctUsersQSessions 
(SillyMapper, 
DistinctReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
qsessions/distinctusers 
DistinctUsersQTerms 
(SillyMapper, 
DistinctReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/qterms/ 
distinctusers 
CountUsers 
(CounterMapper, 
CounterReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
distinctusers 
CountEvents 
(CounterMapper, 
CounterReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
countevents 
STEP 3a: FEATURE EXTRACTION (QUERY TERMS) PIPELINE 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
probability 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserprobability2/ 
qterms/queryfacet 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
countusers 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
qterms_cm 
EventProbabilityQTerms 
(ProbabilityMapper, 
ProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
countevents 
ConditionalUserProbability 
2_3-qt 
(ConditionalUserProbability 
PrepareMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserproba 
bility1/qterms 
EventConditionalProbabilityQterms 
(ConditionalProbabilityMapper, 
ConditionalProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
conditionalprobability 
Extracted features 
Others 
EventJointProbabilityQterms 
(JointProbabilityMapper, 
ProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
jointprobability 
EventJointUserProbabilityQTerms 
(JointUserProbabilityMapper, 
JointUserProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
jointuserprobability 
EventConditionalUserProb 
abilityQterms 
(ConditionalUserProbability 
PrepareMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserprobability2/ 
qterms/query 
EventConditionalUser 
Probability3_qterms 
(SillyMapper, 
ConditionalUserProba 
bilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/qterms/ 
week{WNo}/ 
conditionaluserprobability 
EventEntropyQTerms 
(EntropyMapper, 
EntropyReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
entropy 
EventPMI1QTerms 
(SillyMapper, 
JoinUnaryMetricReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
pmiqterms 
EventPMIQTerms 
(PMIMapper, 
PMIReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
pmi 
EventKLDivergenceUnary1QTerms 
(KLDivergenceUnaryJoinerMapper, 
JoinUnaryMetricReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
klunaryqterms 
EventKLDivergenceUnaryQTerms 
(SillyMapper, 
KLDivergenceUnaryReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
kldivergenceunary 
EventCosineSimilarityQTerms 
(PMIMapper, 
CosineSimilarityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
cosinesimilarity 
STEP 3c: FEATURE EXTRACTION (FLICKR TAGS) PIPELINE 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
probability 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserprobability2/ 
flickr/queryfacet 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
countusers 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
flickr_cm 
EventProbabilityFlickr 
(ProbabilityMapper, 
ProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
countevents 
ConditionalUserProbability 
2_3-fl 
(ConditionalUserProbability 
PrepareMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserproba 
bility1/flickr 
EventConditionalProbabilityFlickr 
(ConditionalProbabilityMapper, 
ConditionalProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
conditionalprobability 
Extracted features 
Others 
EventJointProbabilityFlickr 
(JointProbabilityMapper, 
ProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
jointprobability 
EventJointUserProbabilityFlickr 
(JointUserProbabilityMapper, 
JointUserProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
jointuserprobability 
EventConditionalUserProb 
abilityFlickr 
(ConditionalUserProbability 
PrepareMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserprobability2/ 
flickr/query 
EventConditionalUser 
Probability3_flickr 
(SillyMapper, 
ConditionalUserProba 
bilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/flickr/ 
week{WNo}/ 
conditionaluserprobability 
EventEntropyFlickr 
(EntropyMapper, 
EntropyReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
entropy 
EventPMI1Flickr 
(SillyMapper, 
JoinUnaryMetricReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
pmiflickr 
EventPMIFlickr 
(PMIMapper, 
PMIReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
pmi 
EventKLDivergenceUnary1Flickr 
(KLDivergenceUnaryJoinerMapper, 
JoinUnaryMetricReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
klunaryflickr 
EventKLDivergenceUnaryFlickr 
(SillyMapper, 
KLDivergenceUnaryReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
kldivergenceunary 
EventCosineSimilarityFlickr 
(PMIMapper, 
CosineSimilarityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/flickr/ 
week{WNo}/ 
cosinesimilarity 
STEP 3d: FEATURE EXTRACTION (TWEETS) PIPELINE 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
probability 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserprobability2/ 
tweets/queryfacet 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
countusers 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
tweets_cm 
EventProbabilityTweets 
(ProbabilityMapper, 
ProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
countevents 
ConditionalUserProbability 
2_3-tw 
(ConditionalUserProbability 
PrepareMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserproba 
bility1/tweets 
EventConditionalProbabilityTweets 
(ConditionalProbabilityMapper, 
ConditionalProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
conditionalprobability 
Extracted features 
Others 
EventJointProbabilityTweets 
(JointProbabilityMapper, 
ProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
jointprobability 
EventJointUserProbabilityTweets 
(JointUserProbabilityMapper, 
JointUserProbabilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
jointuserprobability 
EventConditionalUserProb 
abilityTweets 
(ConditionalUserProbability 
PrepareMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
conditionaluserprobability2/ 
tweets/query 
EventConditionalUser 
Probability3_tweets 
(SillyMapper, 
ConditionalUserProba 
bilityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/tweets/ 
week{WNo}/ 
conditionaluserprobability 
EventEntropyTweets 
(EntropyMapper, 
EntropyReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
entropy 
EventPMI1Tweets 
(SillyMapper, 
JoinUnaryMetricReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
pmitweets 
EventPMITweets 
(PMIMapper, 
PMIReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
pmi 
EventKLDivergenceUnary1Tweets 
(KLDivergenceUnaryJoinerMapper, 
JoinUnaryMetricReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
klunarytweets 
EventKLDivergenceUnary1Tweets 
(SillyMapper, 
KLDivergenceUnaryReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
kldivergenceunary 
EventCosineSimilarityTweets 
(PMIMapper, 
CosineSimilarityReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
cosinesimilarity 
graph_sharedconnect_1 
(GRSharedConnectMapper, 
GRSharedConnectReducer) 
(MergeFeaturesMapper, 
SymmetricFeatureMergerReducer) 
STEP 4a: FEATURE MERGING (QUERY TERMS) PIPELINE 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
entropy 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
probability 
unaryfeaturemerger1_qterms 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
join_graph 
$VIS_GRID_HOME/ 
rankinput 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
unary1_qterms 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
jointprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
jointuserprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
pmi 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
conditionaluserprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
conditionalprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
kldivergenceunary 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qterms/week{WNo}/ 
cosinesimilarity 
unaryfeaturemerger2_qterms 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
unary2_qterms 
symmetricfeaturemerger_qterms 
(MergeFeaturesMapper, 
SymmetricFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
symmetric_qterms 
asymmetricfeaturemerger_qterms 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
Features 
Others 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
asymmetric_qterms 
reverseasymmetricfeaturemerger_qterms 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
$VIS_GRID_HOM 
E/rankprepout/tmp/ 
statsmerge/ 
reverseasymetric_ 
qterms 
STEP 4b: FEATURE MERGING (QUERY SESSIONS) PIPELINE 
$VIS_GRID_HOME 
/rankprepout/spark/ 
qsessions/ 
week{WNo}/entropy 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qsessions/ 
week{WNo}/ 
probability 
unaryfeaturemerger1_qsessions 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankinput 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
unary1_qsessions 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qsessions/ 
week{WNo}/ 
jointprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qsessions/ 
week{WNo}/ 
jointuserprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qsessions/ 
week{WNo}/pmi 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qsessions/week{WNo}/ 
conditionaluserprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qsessions/ 
week{WNo}/ 
conditionalprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qsessions/ 
week{WNo}/ 
kldivergenceunary 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
qsessions/week{WNo}/ 
cosinesimilarity 
unaryfeaturemerger2_qsessions 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
unary2_qsessions 
symmetricfeaturemerger_qsessions 
(MergeFeaturesMapper, 
SymmetricFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
symmetric_qsessions 
asymmetricfeaturemerger_qsessions 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
Features 
Others 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
asymmetric_qsessions 
reverseasymmetricfeaturemerger_qsessions 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
$VIS_GRID_HOM 
E/rankprepout/tmp/ 
statsmerge/ 
reverseasymetric_ 
qsessions 
STEP 4c: FEATURE MERGING (FLICKR TAGS) PIPELINE 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
entropy 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
probability 
unaryfeaturemerger1_flickr 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankinput 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
unary1_flickr 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
jointprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
jointuserprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/pmi 
$VIS_GRID_HOME/ 
rankprepout/spark/flickr/ 
week{WNo}/ 
conditionaluserprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
conditionalprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
flickr/week{WNo}/ 
kldivergenceunary 
$VIS_GRID_HOME/ 
rankprepout/spark/flickr/ 
week{WNo}/ 
cosinesimilarity 
unaryfeaturemerger2_flickr 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
unary2_flickr 
symmetricfeaturemerger_flickr 
(MergeFeaturesMapper, 
SymmetricFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
symmetric_flickr 
asymmetricfeaturemerger_flickr 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
Features 
Others 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
asymmetric_flickr 
reverseasymmetricfeaturemerger_flickr 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
$VIS_GRID_HOM 
E/rankprepout/tmp/ 
statsmerge/ 
reverseasymetric_f 
lickr 
STEP 4d: FEATURE MERGING (TWEETS) PIPELINE 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
entropy 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
probability 
unaryfeaturemerger1_tweets 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankinput 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
unary1_tweets 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
jointprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
jointuserprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
pmi 
$VIS_GRID_HOME/ 
rankprepout/spark/tweets/ 
week{WNo}/ 
conditionaluserprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
conditionalprobability 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
kldivergenceunary 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
tweets/week{WNo}/ 
cosinesimilarity 
unaryfeaturemerger2_tweets 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
unary2_tweets 
symmetricfeaturemerger_tweets 
(MergeFeaturesMapper, 
SymmetricFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
symmetric_tweets 
asymmetricfeaturemerger_tweets 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
Features 
Others 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
asymmetric_tweets 
reverseasymmetricfeaturemerger_tweets 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
$VIS_GRID_HOM 
E/rankprepout/tmp/ 
statsmerge/ 
reverseasymetric_t 
weets 
STEP 5a: FEATURE EXTRACTION AND MERGING (COMBINED FEATURES) PIPELINE 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
reverseasymetric_qsessions 
combinedfeaturem 
erger_qsessions 
(CombinedFeature 
MergerMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
combined_qsessions 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
reverseasymetric_qterms 
combinedfeaturem 
erger_qterms 
(CombinedFeature 
MergerMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
combined_qterms 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
reverseasymetric_flickr 
combinedfeaturem 
erger_flickr 
(CombinedFeature 
MergerMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
combined_flickr 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
reverseasymetric_tweets 
combinedfeaturem 
erger_tweets 
(CombinedFeature 
MergerMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
combined_tweets 
joinfeatures 
(JoinerMapper, 
JoinerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeatures3 
STEP 5b: FEATURE EXTRACTION AND MERGING (GRAPH) PIPELINE 
Features 
Others 
$VIS_GRID_HOME 
/rankprepout/tmp/ 
statsmerge/ 
joinfeatures3 
$VIS_GRID_HOME/ 
rankinput 
$VIS_GRID_HOME/ 
rankprepout/graph/ 
sharedconnect 
graph_sharedconnect_2 
(NullValueSillyMapper, 
CounterReducer) 
$VIS_GRID_HOME/ 
rankprepout/graph/ 
sharedconnect_2 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeature_g_1 
graph_popularity_rank_all 
(GRPopularityRankMapper, 
GRPopularityRankReducer) 
$VIS_GRID_HOME/ 
rankprepout/graph/ 
popularity_rank_all 
graph_popularity_rank_directed 
(GRPopularityRankMapper, 
GRPopularityRankReducer) 
$VIS_GRID_HOME/ 
rankprepout/graph/ 
popularity_rank_directed 
unaryfeaturemerger_entpopmov 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeature_pop1 
unaryfeaturemerger_entpopmov2 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME 
/rankprepout/tmp/ 
statsmerge/ 
joinfeature_pop2 
unaryfeaturemerger_entpopmov3 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeature_pop3 
unaryfeaturemerger_entpopmov4 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeature_pop4 
STEP 5c: FEATURE MERGING (POPULARITY) PIPELINE 
Features 
Others 
$VIS_GRID_HOME/ 
web_citation/ 
deep_hits 
$VIS_GRID_HOME/ 
rankinput 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeature_pop4 
$VIS_GRID_HOME/ 
web_citation/ 
total_hits 
WebCitationTotalHits 
Normalization 
(SillyMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
webcitation_totalhits 
WebCitationDeepHits 
Normalization 
(SillyMapper) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
webcitation_deephits 
asymmetricfeaturemerger_webcitation 
(MergeFeaturesMapper, 
AsymmetricFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
asymmetric_webcitation 
coverage 
(QueryCountMapper, 
QueryCountReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
coverage/week{WNo} 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
logs/week{WNo} 
joincov1 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeaturescov1 
joincov2 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeaturescov2 
joinwikipop1 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
/user/barla/Spark/ 
wikiResultCounts 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeatureswikipop1 
joinwikipop2 
(MergeFeaturesMapper, 
UnaryFeatureMergerReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/ 
joinfeatureswikipop2 
jointypes 
(EntityRelationTypeMapper, 
EntityRelationTypeReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/jointypes 
STEP 7: DATAPACK GENERATION PIPELINE 
(DumpFirstMapper, 
DumpFirstReducer) 
STEP 6: RANKING PIPELINE 
dumper1 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
statsmerge/jointypes 
MLR scoring 
(/homes/barla/gf_mlr/ 
mlr_rank) 
/homes/barla/gf_mlr/ 
gbrank.xml 
/homes/barla/gf_mlr/ 
header.tsv 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
ranking 
MLR scoring 
(SillyMapper, 
DisambiguationReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
ranking-disambiguated 
groupmax 
(SillyMapper, 
OutputMaxReducer) 
$VIS_GRID_HOME/ 
rankprepout/tmp/ 
ranking 
formatranking 
(RankingFormatterMapper, 
RankingFormatterReducer) 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
rankingfinal 
$VIS_GRID_HOME/ 
rankprepout/spark/ 
rankingfinal 
mergerank 
(GFFeedMapper, 
GFFeedReducer) 
$VIS_GRID_HOME/ 
gfrankout 
$VIS_GRID_HOME/ 
gfrankout/1/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfrankout/2/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfrankout/3/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfrankout/4/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfrankout/5/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfrankout10/geo/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfstorageall/1/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfstorageall/2/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfstorageall/3/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfstorageall/4/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfstorageall/5/yalinda/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfstorageall/10/geo/ 
phyfacet/ranking/spark 
$VIS_GRID_HOME/ 
gfstorageall/1/yalinda/ 
phyfacet/data 
$VIS_GRID_HOME/ 
gfstorageall/2/yalinda/ 
phyfacet/data 
$VIS_GRID_HOME/ 
gfstorageall/3/yalinda/ 
phyfacet/data 
$VIS_GRID_HOME/ 
gfstorageall/4/yalinda/ 
phyfacet/data 
$VIS_GRID_HOME/ 
gfstorageall/5/yalinda/ 
phyfacet/data 
$VIS_GRID_HOME/ 
gfstorageall/10/geo/ 
phyfacet/data 
$VIS_GRID_HOME/ 
gfstorageall/ 
logicalobj/data 
$VIS_GRID_HOME/ 
platdumpTmp 
dumper2 
(DumpSecondMapper, 
DumpSecondReducer) 
$VIS_GRID_HOME/ 
platdump 
datapack 
(Mapper, 
YISFacetMergeReducer) 
$VIS_GRID_HOME/ 
datapack 
120
High-Level Architecture View 
Entity 
graph 
Data 
preprocessing 
Feature 
extraction 
Model 
learning 
Feature 
sources 
Editorial 
judgements 
Datapack 
Ranking 
model 
Ranking and 
disambiguation 
Entity 
data 
Features 
121
Spark Architecture 
Entity 
graph 
Data 
preprocessing 
Feature 
extraction 
Model 
learning 
Feature 
sources 
Editorial 
judgements 
Datapack 
Ranking 
model 
Ranking and 
disambiguation 
Entity 
data 
Features 
122
Entity 
graph 
Data Preprocessing 
Data 
preprocessing 
Feature 
extraction 
Model 
learning 
Feature 
sources 
Editorial 
judgements 
Datapack 
Ranking 
model 
Ranking and 
disambiguation 
Entity 
data 
Features 
123
Entity graph 
• 3.4 million entities, 160 million relations 
• Locations: Internet Locality, Wikipedia, Yahoo! Travel 
• Athletes, teams: Yahoo! Sports 
• People, characters, movies, TV shows, albums: Dbpedia 
• Example entities 
• Dbpedia Brad_Pitt Brad Pitt Movie_Actor 
• Dbpedia Brad_Pitt Brad Pitt Movie_Producer 
• Dbpedia Brad_Pitt Brad Pitt Person 
• Dbpedia Brad_Pitt Brad Pitt TV_Actor 
• Dbpedia Brad_Pitt_(boxer) Brad Pitt Person 
• Example relations 
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person 
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor 
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor 
124
Entity graph challenges 
• Coverage of the query volume 
– New entities and entity types 
– Additional inference 
– International data 
– Aliases, e.g. jlo, big apple, thomas cruise mapother iv 
• Freshness 
– People query for a movie long before it’s released 
• Irrelevant entity and relation types 
– E.g. voice actors who co-acted in a movie, cities in a continent 
• Data quality 
– Andy Lau has never acted in Iron Man 3 
125
Entity 
graph 
Feature Extraction 
Data 
preprocessing 
Feature 
extraction 
Model 
learning 
Feature 
sources 
Editorial 
judgements 
Datapack 
Ranking 
model 
Ranking and 
disambiguation 
Entity 
data 
Features 
126
Feature extraction from text 
• Text sources 
– Query terms 
– Query sessions 
– Flickr tags 
– Tweets 
• Common representation 
Input tweet: 
Brad Pitt married to Angelina Jolie in Las Vegas 
Output event: 
Brad Pitt + Angelina Jolie 
Brad Pitt + Las Vegas 
Angelina Jolie + Las Vegas 
127
Features 
• Unary 
– Popularity features from text: probability, entropy, 
wiki id popularity … 
– Graph features: PageRank on the entity graph, 
Wikipedia, web graph 
– Type features: entity type 
• Binary 
– Co-occurrence features from text: conditional 
probability, joint probability … 
– Graph features: common neighbors … 
– Type features: relation type 
128
Feature extraction challenges 
• Efficiency of text tagging 
– Hadoop Map/Reduce 
• More features are not always better 
– Can lead to over-fitting without sufficient training 
data 
129
Entity 
graph 
Data 
Model Learning 
preprocessing 
Feature 
extraction 
Model 
learning 
Feature 
sources 
Editorial 
judgements 
Datapack 
Ranking 
model 
Ranking and 
disambiguation 
Entity 
data 
Features 
130
Model Learning 
• Training data created by editors (five grades) 
400 Brandi adriana lima Brad Pitt person Embarassing 
1397 David H. andy garcia Brad Pitt person Mostly Related 
3037 Jennifer benicio del toro Brad Pitt person Somewhat Related 
4615 Sarah burn after reading Brad Pitt person Excellent 
9853 Jennifer fight club movie Brad Pitt person Perfect 
• Join between the editorial data and the feature file 
• Trained a regression model using GBDT 
– Gradient Boosted Decision Trees 
• 10-fold cross validation optimizing NDCG and tuning 
– number of trees 
– number of nodes per tree 
131
Impact of training data 
Number of training instances (judged relations) 
132
linear combination of the conditional probabilities gives 
best performance on the collected judgements using wqt = 2, 
= 0.5, and wf t = 1. However, the editorial data was not 
Model Learning challenges 
substantial enough to learn a ranking with GBDT. 
Click-through Rate versus Click over Expected Click 
From the image search query logs, we collect the user click 
• Editorial preferences not necessarily coincide with usage 
data that is related to the facets. This allows us to compute the 
click-through rate (CTR) on a facet for a given entity that is 
detected in a user query and for which the facets were shown 
– Users click a lot more on people than expected 
– Image bias? 
• Alternative: optimize for usage data 
the user. Let cl ickse,f be the number of clicks on a facet 
– Clicks turned into labels or preferences 
– Size of the data is not a concern 
– Gains are computed from normalized CTR/COEC 
entity f show in relation to entity e, and viewse,f the number 
times the facet f is shown to a user for a related entity e, 
then the probability of a click on a facet entity f for a given 
entity e can be modelled as ctr e,f : 
ctr e,f = 
cl ickse,f 
viewse,f 
– See van Zwol et al. Ranking Entity Facets Based on User Click 
Feedback. ICSC 2010: 192-199. 
(2) 
In Figure 3 the conditional click-through rate is shown for 
first ten positions. It shows theCTR per position for every 
page view where one of the facets is clicked, aggregated over 
P P 
p= 1 viewse,Zhang and Jones [3] refer to expected clicks, based on the denominator expected clicks given the positions C. Gradient Boosted Decision Trees 
Stochastic gradient boosted decision themost widely used learning algorithms today. Gradient tree boosting constructs regres-sion 
model, utilizing decision trees One advantage over other learners trees in general is that the feature are highly interpretable. GBDT different loss functions can be used research presented here we used our loss function. In related work, pairwise and ranking specific loss well at improving search relevance[shallow decision trees, trees in stochastic on a randomly selected subset of prone to over-fitting [14]. For the 196 
shown in the search engine 
the ground truth for creating 
test set used by the gradient 
Conditional Probabilities 
of the facets search expe-rience, 
function rank(e, f ) that is 
conditional probabilities extracted 
wqs⇥Pqs(f |e)+ wf t⇥Pf t (f , e) 
(1) 
e) arethe conditional prob-abilities 
theweights for the different 
qt), query session (qs) and 
editorial judgements collected for 
their facets we find that 
conditional probabilities gives 
judgements using wqt = 2, 
However, the editorial data was not 
ranking with GBDT. 
Click over Expected Click 
logs, we collect the user click 
This allows us to compute the 
all entities. Observe that the CTR declines when the position 
at which a facet is shown increases. 
We introduce a second click model, based on the notion 
of clicks over expected clicks (COEC). To allows us to deal 
with the so called position bias – where facets appearing in 
lower positions are less likely to be clicked even if they are 
relevant [2]. This phenomenon isoften observed inWeb search 
and we adopt the COEC model proposed by Chapelle and 
Zhang [11]. In that model, we estimate ctr p as the aggregated 
ctr – over all queries and sessions – in position p for all 
positions P. Let then cl ickse,f be the number of clicks on 
a facet entity f show in relation to entity e, and viewse,f p the 
number of times the facet f is shown to a user for a related 
entity e at position p. The probability of a click over expected 
click on a facet entity f for a given entity e can then be 
modelled as coece,f : 
coece,f = 
cl ickse,f 
P P 
p= 1 viewse,f p ⇥ ctr p 
(3) 
Zhang and Jones [3] refer to this method as clicks over 
expected clicks, based on the denominator that includes the 
expected clicks given the positions that the url appeared in. 
C. Gradient Boosted Decision Trees 
133 
Stochastic gradient boosted decision trees (GBDT) is one of 
themost widely used learning algorithms in machine learning
Entity 
graph 
Ranking and Disambiguation 
Data 
preprocessing 
Feature 
extraction 
Model 
learning 
Feature 
sources 
Editorial 
judgements 
Datapack 
Ranking 
model 
Ranking and 
disambiguation 
Entity 
data 
Features 
134
Ranking and Disambiguation 
• We apply the ranking function offline to the data 
• Disambiguation 
– How many times a given wiki id was retrieved for queries containing the entity name? 
Brad Pitt Brad_Pitt 21158 
Brad Pitt Brad_Pitt_(boxer) 247 
XXX XXX_(movie) 1775 
XXX XXX_(Asia_album) 89 
XXX XXX_(ZZ_Top_album) 87 
XXX XXX_(Danny_Brown_album) 67 
– PageRank for disambiguating locations (wiki ids are not available) 
• Expansion to query patterns 
– Entity name + context, e.g. brad pitt actor 
135
Ranking and Disambiguation 
challenges 
• Disambiguation cases that are too close to call 
– Fargo Fargo_(film) 3969 
– Fargo Fargo,_North_Dakota 4578 
• Disambiguation across Wikipedia and other 
sources 
136
Evaluation #2: Side-by-side 
testing 
• Comparing two systems 
– A/B comparison, e.g. current system under development and production system 
– Scale: A is better, B is better 
• Separate tests for relevance and image quality 
– Image quality can significantly influence user perceptions 
– Images can violate safe search rules 
• Classification of errors 
– Results: missing important results/contains irrelevant results, too few results, entities 
are not fresh, more/less diverse, should not have triggered 
– Images: bad photo choice, blurry, group shots, nude/racy etc. 
• Notes 
– Borderline, set one entities relate to the movie Psy but the query is most likely about 
Gangnam style 
– Blondie and Mickey Gilley are 70’s performers and do not belong on a list of 60’s 
musicians. 
– There is absolutely no relation between Finland and California. 
137
Evaluation #3: Bucket testing 
• Also called online evaluation 
– Comparing against baseline version of the system 
– Baseline does not change during the test 
• Small % of search traffic redirected to test system, another 
small % to the baseline system 
• Data collection over at least a week, looking for stat. 
significant differences that are also stable over time 
• Metrics in web search 
– Coverage and Click-through Rate (CTR) 
– Searches per browser-cookie (SPBC) 
– Other key metrics should not impacted negatively, e.g. 
Abandonment and retry rate, Daily Active Users (DAU), Revenue 
Per Search (RPS), etc. 
138
Coverage before and after the 
new system 
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 
Days 
Coverage 
Coverage before Spark 
Trend before Spark 
Coverage after Spark 
Trend after Spark 
Spark is deployed 
in production 
Before release: 
Flat, lower 
After release: 
Flat, higher 
139
Click-through rate (CTR) before 
and after the new system 
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 
Before release: 
Gradually 
degrading performance 
due to lack of fresh data 
After release: 
Learning effect: 
users are starting to 
use the tool again 
Days 
CTR 
CTR before Spark 
Trend before Spark 
CTR after Spark 
Trend after Spark 
Spark is deployed 
in production 
140
Summary 
• Spark 
– System for related entity recommendations 
• Knowledge base 
• Extraction of signals from query logs and other user-generated 
content 
• Machine learned ranking 
• Evaluation 
• Other applications 
– Recommendations on topic-entity pages 
141
Future work 
• New query types 
– Queries with multiple entities 
• adele skyfall 
– Question-answering on keyword queries 
• brad pitt movies 
• brad pitt movies 2010 
• Extending coverage 
– Spark now live in CA, UK, AU, NZ, TW, HK, ES 
• Even fresher data 
– Stream processing of query log data 
• Data quality improvements 
• Online ranking with post-retrieval features 
142
TIME AND SEARCH 
143
Web Search and time 
• Information freshness adds constraints/tensions in every 
layer of WSEs 
• Architecture 
• Crawling 
• Indexing 
• Distribution 
• Caching 
• Serving system 
• Modeling 
• Time-dependent user intent 
• UI (how to let the user take control) 
144
High-level Architecture of WSEs 
results 
Runtime system 
Parser/ 
Tokenizer 
Index 
terms 
Cache 
Query 
Engine 
queries 
Indexing pipeline 
WWW 
145
Adding the time dimension 
• Some solutions don’t scale up anymore 
• Review your architecture 
• Review your algorithms 
• Add more machines (~$$$) 
• Some solutions don’t apply anymore 
• Caching 
146
Evolution 
• 1999 
• Index updated ~once per month 
• Disk-based updates/indexing 
• 2001 
• In-memory indexes 
• Changes the whole-game! 
• 2007 
• Indexing time < 1 minute 
• Accept updates while serving 
• Now 
• Focused crawling, delayed transactions, etc. 
• Batch Updates -> Incremental processing 
147
Some landmarks 
• Reliable distributed storage 
• Some models/processes require millions of accesses 
• Massive parallelization 
• Map/Reduce – Hadoop 
• Storm/S4 – Streaming 
• Hadoop 2.0 - YARN! 
• Semi-structured storage systems 
• Asynchronous item updates 
148
What’s going on “right now”? 
149
Query temporal profiles 
• Modeling 
• Time-dependent user intent 
• Implicitly time-qualified search queries 
• SIGIR 
• Dream theater barcelona 
• Barcelona vs Madrid 
• …. 
• Interested in these tasks? 
• Joho et al. A survey of temporal Web experience, WWW 2013 
• Keep an eye on NTCIR next year! 
150
Caching for Real-Time Indexes 
• Queries are redundant (heavy-tail) and bursty 
• Caching search results saves up executing ~30/60% of the queries 
• Tens of machines do the work of 1000s 
• Dilemma: Freshness versus Computation 
• Extreme #1: do not cache at all – evaluate all queries 
• 100% fresh results, lots of redundant evaluations 
• Extreme #2: never invalidate the cache 
• A majority of stale results – results refreshed only 
due to cache replacement, no redundant work 
• Middle ground: invalidate periodically (TTL) 
• A time-to-live parameter is applied to each cached entry 
151 
Blanco et al. SIGIR 2010
CACHING FOR INCREMENTAL 
•Problem: 
•In fast crawling, cache not always up-to-date (stale) 
•Solution: 
• Cache Invalidator Predictor - looks into new documents and invalidates 
queries accordingly 
User 
Queries 
Runtime system 
Index pipeline 
Cache 
Query 
Engine 
Invalidator 
Synopsis 
Generator 
Parser/ Index 
Tokenizer 
Crawled 
Documents 
152 
• Using synopsis reduces the 
number of refreshes up to 30% 
compared to a time-to-live 
baseline 
INDEXES
Time Aware ER 
• In some cases the time dimension is available 
– News collections 
– Blog postings 
• News stories evolve over time 
– Entities appear/disappear 
– Analyze and exploit relevance evolution 
• An Entity Search system can exploit the past to find 
relevant entities 
– De Martini et al. TAER: Time Aware Entity Retrieval. CIKM 
2010, 
153
Ranking Entities Over Time 
Barack Obama 
Naoto Kan 
Barack Obama Yokohama 
Takao Sato 
Michiko Sato 
Kamakura 
Barack Obama 
United States 
154
TIME EXPLORER 
155
Time(ly) opportunities 
Can we create new user experiences based on a deeper analysis and 
exploration of the time dimension? 
• Goals: 
– Build an application that helps users to explore, 
interact and ultimately understand existing 
information about the past and the future. 
– Help the user cope with the information overload 
and eventually find/learn about what she’s looking 
for. 
156
Original Idea 
• R. Baeza-Yates, Searching the Future, MF/IR 2005 
– On December 1st 2003, on Google News, there were more than 100K 
references to 2004 and beyond. 
– E.g. 2034: 
• The ownership of Dolphin Square in London must revert to an insurance company. 
• Voyager 2 should run out of fuel. 
• Long-term care facilities may have to house 2.1 million people in the USA. 
• A human base in the moon would be in operation. 
157
Time Explorer 
• Public demo since August 2010 
• Winner of HCIR NYT Challenge 
• Goal: explore news through time and into 
the future 
• Using a customized Web crawl from news 
and blog feeds 
http://fbmya01.barcelonamedia.org:8080/future/ 
158 
Matthews et al. HCIR 2010
Time Explorer 
159
Time Explorer - Motivation 
 Time is important to search 
 Recency, particularly in news is highly related 
to relevancy 
 But, what about evolution over time? 
 How has a topic evolved over time? 
 How did the entities (people, place, etc) evolve with respect to the topic over 
time? 
 How will this topic continue to evolve over the future? 
 How does bias and sentiment in blogs and news change over time? 
 Google Trends, Yahoo! Clues, RecordedFuture 
… 
 Great research playground 
 Open source! 
160
Time Explorer 
161
Analysis Pipeline 
 Tokenization, Sentence Splitting, Part-of-speech 
tagging, chunking with OpenNLP 
 Entity extraction with SuperSense tagger 
 Time expressions extracted with TimeML 
 Explicit dates (August 23rd, 2008) 
 Relative dates (Next year, resolved with Pub Date) 
 Sentiment Analysis with LivingKnowledge 
 Ontology matching with Yago 
 Image Analysis – sentiment and face detection 
162
Indexing/Search 
• Lucene/Solr search platform to index and search 
– Sentence level 
– Document level 
• Facets for annotations (multiple fields for faster 
entity-type access) 
• Index publication date and content date –extracted 
dates if they exists or publication date 
• Solr Faceting allows aggregation over query entity 
ranking and for aggregating counts over time 
• Content date enables search into the future 
163
164
Timeline 
165
Timeline - Document 
166
Entities (Facets) 
167
Timeline – Entity Trend 
168
Timeline – Future 
169
Opinions 
170
Quotes 
171
Other challenges 
– Large scale processing 
• Distributed computing 
• Shift from batch (Hadoop) to online (Storm) 
– Efficient extraction/retrieval, algorithmic/data 
structures 
• Critical for interactive exploration 
– Connection with the user experience 
• Measures! User engagement? 
– Personalization 
– Integration with Knowledge Bases (Semantic Web) 
– Multilingual 
172

Contenu connexe

Tendances

Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Peter Mika
 
Making things findable
Making things findableMaking things findable
Making things findablePeter Mika
 
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialPeter Mika
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsPeter Mika
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the RisePeter Mika
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchDavid Amerland
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Trey Grainger
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 

Tendances (20)

Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Making things findable
Making things findableMaking things findable
Making things findable
 
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic Search
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
Searching Online
Searching OnlineSearching Online
Searching Online
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 

En vedette

Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologyLucidworks
 
Greater Halifax Partnership 2010 Annual Report
Greater Halifax Partnership 2010 Annual ReportGreater Halifax Partnership 2010 Annual Report
Greater Halifax Partnership 2010 Annual ReportHalifax Partnership
 
The Fuelwood Market Chain of Kinshasa: Socio-economic and sustainability outc...
The Fuelwood Market Chain of Kinshasa: Socio-economic and sustainability outc...The Fuelwood Market Chain of Kinshasa: Socio-economic and sustainability outc...
The Fuelwood Market Chain of Kinshasa: Socio-economic and sustainability outc...Verina Ingram
 
Pastperfect
PastperfectPastperfect
PastperfectVicky
 
Destination pluto
Destination plutoDestination pluto
Destination plutoLisa Baird
 
Pattern seeking across pictures
Pattern seeking across picturesPattern seeking across pictures
Pattern seeking across picturesRachel Bear
 
A fine mess: Bricolaged forest governance in Cameroon
A fine mess: Bricolaged forest governance in CameroonA fine mess: Bricolaged forest governance in Cameroon
A fine mess: Bricolaged forest governance in CameroonVerina Ingram
 
Responsible Disclosure - For Dutch ISACA chapter
Responsible Disclosure - For Dutch ISACA chapterResponsible Disclosure - For Dutch ISACA chapter
Responsible Disclosure - For Dutch ISACA chapterFrank Breedijk
 
Mayonn Inc Bus Plan Presentation
Mayonn Inc Bus Plan PresentationMayonn Inc Bus Plan Presentation
Mayonn Inc Bus Plan Presentationmayonn
 
D:\Ben\G48 53011810075
D:\Ben\G48 53011810075D:\Ben\G48 53011810075
D:\Ben\G48 53011810075BenjamasS
 
Physical Science: Chapter 3 sec3
Physical Science: Chapter 3 sec3Physical Science: Chapter 3 sec3
Physical Science: Chapter 3 sec3mshenry
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法Takeshi Furusato
 
03 dsk dunia muzik tahun 1 bt
03 dsk dunia muzik tahun 1   bt03 dsk dunia muzik tahun 1   bt
03 dsk dunia muzik tahun 1 btNaga Cobra
 
Reported statements
Reported statementsReported statements
Reported statementsVicky
 

En vedette (20)

Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
 
Greater Halifax Partnership 2010 Annual Report
Greater Halifax Partnership 2010 Annual ReportGreater Halifax Partnership 2010 Annual Report
Greater Halifax Partnership 2010 Annual Report
 
The Fuelwood Market Chain of Kinshasa: Socio-economic and sustainability outc...
The Fuelwood Market Chain of Kinshasa: Socio-economic and sustainability outc...The Fuelwood Market Chain of Kinshasa: Socio-economic and sustainability outc...
The Fuelwood Market Chain of Kinshasa: Socio-economic and sustainability outc...
 
Q4 evaluation
Q4 evaluationQ4 evaluation
Q4 evaluation
 
Pastperfect
PastperfectPastperfect
Pastperfect
 
Media evaluation
Media evaluationMedia evaluation
Media evaluation
 
Destination pluto
Destination plutoDestination pluto
Destination pluto
 
Pattern seeking across pictures
Pattern seeking across picturesPattern seeking across pictures
Pattern seeking across pictures
 
Gic2011 aula05-ingles
Gic2011 aula05-inglesGic2011 aula05-ingles
Gic2011 aula05-ingles
 
A fine mess: Bricolaged forest governance in Cameroon
A fine mess: Bricolaged forest governance in CameroonA fine mess: Bricolaged forest governance in Cameroon
A fine mess: Bricolaged forest governance in Cameroon
 
Gic2011 aula8-ingles
Gic2011 aula8-inglesGic2011 aula8-ingles
Gic2011 aula8-ingles
 
Responsible Disclosure - For Dutch ISACA chapter
Responsible Disclosure - For Dutch ISACA chapterResponsible Disclosure - For Dutch ISACA chapter
Responsible Disclosure - For Dutch ISACA chapter
 
Andy warhol
Andy warholAndy warhol
Andy warhol
 
Mayonn Inc Bus Plan Presentation
Mayonn Inc Bus Plan PresentationMayonn Inc Bus Plan Presentation
Mayonn Inc Bus Plan Presentation
 
D:\Ben\G48 53011810075
D:\Ben\G48 53011810075D:\Ben\G48 53011810075
D:\Ben\G48 53011810075
 
Physical Science: Chapter 3 sec3
Physical Science: Chapter 3 sec3Physical Science: Chapter 3 sec3
Physical Science: Chapter 3 sec3
 
Cloud webinar final
Cloud webinar finalCloud webinar final
Cloud webinar final
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
 
03 dsk dunia muzik tahun 1 bt
03 dsk dunia muzik tahun 1   bt03 dsk dunia muzik tahun 1   bt
03 dsk dunia muzik tahun 1 bt
 
Reported statements
Reported statementsReported statements
Reported statements
 

Similaire à Mining Web content for Enhanced Search

Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Peter Mika
 
(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”icwe2015
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 
Brave new search world
Brave new search worldBrave new search world
Brave new search worldvoginip
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the WebPeter Mika
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Yandex
 
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Marco Brambilla
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationRoberto García
 
Design for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govDesign for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govUXPA International
 
Design for Findability at the Library of Congress
Design for Findability at the Library of CongressDesign for Findability at the Library of Congress
Design for Findability at the Library of CongressJill MacNeice
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsSloan Carne
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Peter Mika
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
How to Future-proof Your Content by Sarah Beckley
How to Future-proof Your Content by Sarah BeckleyHow to Future-proof Your Content by Sarah Beckley
How to Future-proof Your Content by Sarah BeckleyContent Strategy Workshops
 

Similaire à Mining Web content for Enhanced Search (20)

Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015
 
(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Brave new search world
Brave new search worldBrave new search world
Brave new search world
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"
 
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data Exploration
 
Design for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govDesign for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.gov
 
Design for Findability at the Library of Congress
Design for Findability at the Library of CongressDesign for Findability at the Library of Congress
Design for Findability at the Library of Congress
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Riley-o.com
Riley-o.comRiley-o.com
Riley-o.com
 
How to Future-proof Your Content by Sarah Beckley
How to Future-proof Your Content by Sarah BeckleyHow to Future-proof Your Content by Sarah Beckley
How to Future-proof Your Content by Sarah Beckley
 

Plus de Roi Blanco

Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationRoi Blanco
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF GraphsRoi Blanco
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operatorsRoi Blanco
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search EnginesRoi Blanco
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataRoi Blanco
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesRoi Blanco
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entitiesRoi Blanco
 

Plus de Roi Blanco (7)

Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance Minimization
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
 

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Mining Web content for Enhanced Search

  • 1. Mining Web content for Enhanced Search Roi Blanco Yahoo! Research
  • 2. Yahoo! Research Barcelona • Established January, 2006 • Led by Ricardo Baeza-Yates • Research areas • Web Mining • Social Media • Distributed Web retrieval • Semantic Search • NLP/IR 2
  • 3. Natural Language Retrieval • How to exploit the structure and meaning of natural language text to improve search • Current search engines perform only limited NLP (tokenization, stemming) • Automated tools exist for deeper analysis • Applications to diversity-aware search • Source, Location, Time, Language, Opinion, Ranking… • Search over semi-structured data, semantic search • Roll-out user experiences that use higher layers of the NLP stack 3
  • 4. Agenda • Intro to current status of Web Search • Semantic Search – Annotating documents – Entity linking – Entity search – Applications • Time-aware information access – Motivation – Applications 4
  • 6. Some Challenges • From user understanding to selecting the right hardware/software architecture – Modeling user behavior – Modeling user engagement/return • Providing answers, not links – Richer interfaces • Enriching document structure using semantic annotations – Indexing those annotations – Accessing them in real-time – What are the right annotations? The right level of granularity? Trade-offs? • Time-aware information access – Changes the game in almost every step of the pipeline 6
  • 7. 7
  • 8. 8
  • 9. 9
  • 10. Structured data - Web search Top-1 entity with structured data Related entities Structured data extracted from HTML 10
  • 11. New devices • Different interaction (e.g. voice) • Different capabilities (e.g. display) • More Information (geo-localization) • More personalized 11
  • 13. Semantic Search • What different kinds of search and applications beyond string matching or returning 10 blue links? • Can we have a better understanding of documents and queries? • New devices open new possibilities, new experiences • Is current technology in natural language understanding mature enough? 13
  • 14. Search is really fast, without necessarily being intelligent 14
  • 15. Why Semantic Search? Part I • Improvements in IR are harder and harder to come by – Machine learning using hundreds of features • Text-based features for matching • Graph-based features provide authority – Heavy investment in computational power, e.g. real-time indexing and instant search • Remaining challenges are not computational, but in modeling user cognition – Need a deeper understanding of the query, the content and/or the world at large – Could Watson explain why the answer is Toronto? 15
  • 16. Semantic Search: a definition • Semantic search is a retrieval paradigm that – Makes use of the structure of the data or explicit schemas to understand user intent and the meaning of content – Exploits this understanding at some part of the search process • Emerging field of research – Exploiting Semantic Annotations in Information Retrieval (2008- 2012) – Semantic Search (SemSearch) workshop series (2008-2011) – Entity-oriented search workshop (2010-2011) – Joint Intl. Workshop on Semantic and Entity-oriented Search (2012) – SIGIR 2012 tracks on Structured Data and Entities • Related fields: – XML retrieval, Keyword search in databases, NL retrieval 16
  • 17. What it’s like to be a machine? Roi Blanco 17
  • 18. What it’s like to be a machine? ✜Θ♬♬ţğ ✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓ ţğ★✜ ✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫ ≠=⅚©§★✓♪ΒΓΕ℠ ✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ ⏎⌥°¶§ΥΦΦΦ✗✕☐ 18
  • 19. Interactive search and task completion 19
  • 20. Why Semantic Search? Part II • The Semantic Web is here – Data • Large amounts of RDF data • Heterogeneous schemas • Diverse quality – End users • Not skilled in writing complex queries (e.g. SPARQL) • Not familiar with the data • Novel applications – Complementing document search • Rich Snippets, related entities, direct answers – Other novel search tasks 20
  • 21. Other novel applications • Aggregation of search results – e.g. price comparison across websites • Analysis and prediction – e.g. world temperature by 2020 • Semantic profiling – Ontology-based modeling of user interests • Semantic log analysis – Linking query and navigation logs to ontologies • Task completion – e.g. booking a vacation using a combination of services • Conversational search 21
  • 22. Common tasks in Semantic Search semantic search list search entity search SemSearch 2010/11 related entity finding list completion SemSearch 2011 TREC REF-LOD task TREC ELC task 22
  • 23. Is NLU that complex? ”A child of five would understand this. Send someone to fetch a child of five”. Groucho Marx 23
  • 24. Some examples… • Ambiguous searches – paris hilton • Multimedia search – paris hilton sexy • Imprecise or overly precise searches – jim hendler – pictures of strong adventures people • Searches for descriptions – 33 year old computer scientist living in barcelona – reliable digital camera under 300 dollars • Searches that require aggregation – height eiffel tower – harry potter movie review – world temperature 2020 24
  • 25. Language is Ambiguous The man saw the girl with the telescope 25
  • 26. Paraphrases • ‘This parrot is dead’ • ‘This parrot has kicked the bucket’ • ‘This parrot has passed away’ • ‘This parrot is no more' • 'He's expired and gone to meet his maker,’ • 'His metabolic processes are now history’ 26
  • 28. Semantics at every step of the IR process bla bla bla? bla bla bla The IR engine The Web Query interpretation q=“bla” * 3 Document processing bla bla bla bla bla bla Indexing Ranking θ(q,d) “bla” Result presentation 28
  • 29. Understanding Queries • Query logs are a big source of information & knowledge To rank better the results (what you click) To understand queries better Paris Paris Flights Paris Paris Hilton 29
  • 30. “Understand” Documents NLU is still an open issue 30
  • 31. NLP for IR • Full NLU is AI complete, not scalable to the web size (parsing the web is really hard). • BUT … what about other shallow NLP techniques? • Hypothesis/Requirements: • Linear extraction/parsing time • Error-prone output (e.g. 60-90%) • Highly redundant information • Explore new ways of browsing • Support your answers 31
  • 32. Usability We also fail at using the technology Sometimes 32
  • 33. Support your answers Errors happen: choose the right ones! • Humans need to “verify” unknown facts • Multiple sources of evidence • Common sense vs. Contradictions • are you sure? is this spam? Interesting! • Tolerance to errors greatly increases if users can verify things fast • Importance of snippets, image search • Often the context is as important as the fact • E.g. “S discovered the transistor in X” • There are different kinds of errors • Ridiculous result (decreases overall confidence in system) • Reasonably wrong result (makes us feel good) 33
  • 34. TOOLS FOR GENERATING ANNOTATIONS 34
  • 35. Entity? • Uniquely identifiable “thing” or “object” – “A thing with a distinct and independent existence” • Properties: – ID – Name(s) – Type(s) – Attributes (/Descriptions) – Relationships to other entities • An entity is only one kind of textual annotation! 35
  • 37. Named Entity Recognition • “Named Entity”, IE context of MUC-6 (R. Grishman & Sundheim’1996) • Recognize information units like names, including PERson, ORGanization, LOCation names, and numeric expressions including time, date, money [Yahoo!] employee [Roi Blanco] visits [Sapporo] . ORG PER LOC 37
  • 38. ML methods • Supervised • HMM • CRF • SVM • … • Unsupervised • Lexical Resources (e.gWordNet), lexical patterns and statistics from large non-annotated corpora. • “Semi-supervised” (or “weakly supervised”) • Mostly “Bootstrapping” from a set of entities 38
  • 39. Entity Linking Image taken from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM '08. 39
  • 40. NE Linking / NED • Linking mentions of Named Entities to a Knowledge Base (e.g. Wikipedia, Freebase …) • Name Variations • Shortened forms • Alternative Spellings • “New” Entities (not contained in the Knowledge Base) • As the Knowledge Base grows … everything looks like an Entity (a town named web!?) 40
  • 41. Entity Ambiguity • In Wikipedia Michael Jordan may also refer to: – Michael Jordan (mycologist) – Michael Jordan (footballer) – Michael Jordan (Irish politician) – Michael B. Jordan (born 1987), American actor – Michael I. Jordan (born 1957), American researcher – Michael H. Jordan (1936–2010), American executive – Michael-Hakim Jordan (born 1977), basketball player 41
  • 42. Traditional IE • Relations: assertions linking two or more concepts – actors-act in-movies – cities-capital of-countries • Facts: instantiations of relations – leonardo dicaprio-act in-inception – cairo-capital of-egypt • Attributes: facts capturing quantifiable properties – actors --> awards, birth date, height – movies --> producer, release date, budget 42
  • 43. ENTITY LINKING 43 http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/
  • 44. Image taken from Mihalcea and Csomai (2007). Wikify!: linking documents to encyclopedic knowledge. In CIKM '07. 44
  • 45. 45
  • 52. Why do we need entity linking? • (Automatic) document enrichment – go-read-here – assistance for (Wikipedia) editors – inline (microformats, RDFa) 52
  • 53. Why do we need entity linking? • “Use as feature” – to improve • classification • retrieval • word sense disambiguation • semantic similarity • ... – dimensionality reduction (as compared to, e.g., term vectors) 53
  • 54. Why do we need entity linking? • Enable – semantic search – advanced UI/UX – ontology learning, KB population – ... 54
  • 55. A bit of history • Text classification • NER • WSD • NED – {person name, geo, movie name, ...} disambiguation – (Cross-document) co-reference resolution • Entity linking 55
  • 56. Entity linking? • NE normalization / canonicalization / sense disambiguation • DB record linkage / schema mapping • Knowledge base population • Entity linking – D2W – Wikification – Semantic linking 56
  • 57. Entity Linking: main problem • Linking free text to entities – Entities (typically) taken from a knowledge base • Wikipedia • Freebase • ... – Any piece of text • news documents • blog posts • tweets • queries • ... 57
  • 58. Typical steps 1. Determine “linkable” phrases – mention detection – MD 2. Rank/Select candidate entity links – link generation – LG – may include NILs (null values, i.e., no target in KB) 3. (Use “context” to disambiguate/filter/improve) – disambiguation – DA 58
  • 59. Preliminaries • Wikipedia • Wikipedia-based measures – commonness – relatedness – keyphraseness 59
  • 60. Wikipedia • Basic element: article (proper) • But also – redirect pages – disambiguation pages – category/template pages – admin pages • Hyperlinks – use “unique identifiers” (URLs) • [[United States]] or [[United States|American]] • [[United States (TV series)]] or [[United States (TV series)|TV show]] 60
  • 61. Wikipedia style guidelines • “the lead contains a quick summary of the topic's most important points, and each major subtopic is detailed in its own section of the article” – “The lead section (also known as the lead, introduction or intro) of a Wikipedia article is the section before the table of contents and the first heading. The lead serves as an introduction to the article and a summary of its most important aspects.” See http://en.wikipedia.org/wiki/Wikipedia:Summary_style 61
  • 62. Some statistics • WordNet – 80k entity definitions – 115k surface forms – 142k senses (entity - surface form combinations) • Wikipedia (only) – ~4M entity definitions – ~12M surface forms – ~24M senses 62
  • 63. Wikipedia-based measures • keyphraseness(w) [Mihalcea & Csomai 2007] Collection frequency term w as a link to another Wikipedia article Collection frequency term w 63
  • 64. Wikipedia-based measures • commonness(w,c) [Medelyan et al. 2008] Number of links with target c’ and anchor text w 64
  • 65. Commonness and keyphraseness Image taken from Li et al. (2013). TSDW: Two-stage word sense disambiguation using Wikipedia. In JASIST 2013. 65
  • 66. Wikipedia-based measures • relatedness(c, c’) [Milne & Witten 2008a] Image taken from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In AAAI WikiAI Workshop. 66
  • 67. Wikipedia-based measures • relatedness(c, c’) [Milne & Witten 2008a] Total number of Wikipedia articles Intersection of inlinks with target c and c’ Number of links with target c 67
  • 68. Recall the steps 1. mention detection – MD 2. link generation – LG 3. (disambiguation) – DA 68
  • 69. Wikify! [Mihalcea & Csomai 2007] • MD – tf.idf, Χ2, keyphraseness • LG 1.Overlap between definition (Wikipedia page) and context (paragraph) [Lesk 1986] 2.Naive Bayes [Mihalcea 2007] • context, POS, entity-specific terms 3.Voting between (1) and (2) 69
  • 70. Large-Scale Named Entity Disambiguation Based on Wikipedia Data [Cucerzan 2007] • Key intuition: leverage context links – '''Texas''' is a [[pop music]] band from [[Glasgow]], [[Scotland]], [[United Kingdom]]. They were founded by [[Johnny McElhone]] in [[1986 in music|1986]] and had their performing debut in [[March]] [[1988]] at ... • Prune the candidates, keep only: – appearances in the first paragraph of an article, and – reciprocal links 70
  • 71. Large-Scale Named Entity Disambiguation Based on Wikipedia Data [Cucerzan 2007] • MD – NER; rule-based; co-ref resolution • LG – Represent entities as vectors • context, categories – Same for all candidate entity links – Determine maximally coherent set 71
  • 72. Topic Indexing with Wikipedia [Medelyan et al. 2008] • MD – keyphraseness [Mihalcea & Csomai 2007] • LG • combination of average relatedness & commonness • LG/DA • Naive Bayes • TF.IDF, position, length, degree, weighted keyphraseness 72
  • 73. Learning to Link with Wikipedia [Milne & Witten 2008b] • Key idea: disambiguation informs detection – compare each possible sense with its relatedness to the context sense candidates – start with unambiguous senses – So, first LG, then base MD on these results 73
  • 74. Learning to Link with Wikipedia [Milne & Witten 2008b] Image taken from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM '08. 74
  • 75. Learning to Link with Wikipedia [Milne & Witten 2008b] • Filter non-informative, non-ambiguous candidates (e.g., “the”) – based on keyphraseness, i.e., link probability • Filter non-central candidates – based on average relatedness to all other context senses • Combine 75
  • 76. Learning to Link with Wikipedia [Milne & Witten 2008b] • MD • Machine learning • link probability, relatedness, confidence of LG, generality, frequency, location, spread • LG • Machine learning • keyphraseness, average relatedness, sum of average weights 76
  • 77. Image taken from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM '08. 77
  • 78. Local and Global Algorithms for Disambiguation to Wikipedia [Ratinov et al. 2011] • Explicit focus on global versus local algorithms – “Global,” i.e., disambiguation of the candidate graph – NP-hard • Optimization – reduce the search space to a “disambiguation context,” e.g., • all plausible disambiguations [Cucerzan 2007] • unambiguous surface forms [Milne & Witten 2008b] 78
  • 79. Local and Global Algorithms for Disambiguation to Wikipedia [Ratinov et al. 2011] • Main contribution, in steps 1.use “local” approach (e.g., commonness) to generate a disambiguation context 2.apply “global” machine learning approach • relatedness, PMI • {inlinks, outlinks} in various combinations (c and c’) • {avg, max} • Finally, apply another round of machine learning 79
  • 80. TAGME: On-the-fly Annotation of Short Text Fragments [Ferragina & Scaiella 2010] • MD – keyphraseness [Mihalcea & Csomai 2007] • LG • use “local” approach to generate a disambiguation context, similar to [Ratinov et al. 2011] • Heavy pruning • mentions; candidate links; coherence • Accessible at http://tagme.di.unipi.it 80
  • 81. A Graph-based Method for Entity Linking [Guo et al. 2011] • MD – rule-based; prefer longer links – generate a disambiguation context • LG • (weighted interpolation of) in- and outdegree in disambiguation context to select entity links • edges defined by wikilinks • Evaluation on TAC KBP 81
  • 82. Recap • Essential ingredients – MD • commonness • keyphraseness – LG • commonness • machine learning – DA • relatedness • machine learning 82
  • 83. SEARCH OVER ANNOTATED DOCUMENTS 83
  • 84. Annotated documents Barack Obama visited Tokyo this Monday as part of an extended Asian trip. He is expected to deliver a speech at the ASEAN conference next Tuesday 20 May 2009 28 May 2009 Barack Obama visited Tokyo this Monday as part of an extended Asian trip. He is expected to deliver a speech at the ASEAN conference next Tuesday 84
  • 85. 85
  • 86. How does it work? Monty Python Inverted Index (sentence/doc level) Forward Index (entity level) Flying Circus John Cleese Brian 86
  • 87. Efficient element retrieval • Goal – Given an ad-hoc query, return a list of documents and annotations ranked according to their relevance to the query • Simple Solution – For each document that matches the query, retrieve its annotations and return the ones with the highest counts • Problems – If there are many documents in the result set this will take too long - too many disk seeks, too much data to search through – What if counting isn’t the best method for ranking elements? • Solution – Special compressed data structures designed specifically for annotation retrieval 87
  • 88. Forward Index • Access metadata and document contents – Length, terms, annotations • Compressed (in memory) forward indexes – Gamma, Delta, Nibble, Zeta codes (power laws) • Retrieving and scoring annotations – Sort terms by frequency • Random access using an extra compressed pointer list (Elias-Fano) 88
  • 89. Parallel Indexes • Standard index contains only tokens • Parallel indices contain annotations on the tokens – the annotation indices must be aligned with main token index • For example: given the sentence “New York has great pizza” where New York has been annotated as a LOCATION – Token index has five entries (“new”, “york”, “has”, “great”, “pizza”) – The annotation index has five entries (“LOC”, “LOC”, “O”,”O”,”O”) Can optionally encode BIO format (e.g. LOC-B, LOC-I) • To search for the New York location entity, we search for: “token:New ^ entity:LOC token:York ^ entity:LOC” 89
  • 90. Parallel Indices (II) Doc #3: The last time Peter exercised was in the XXth century. Doc #5: Hope claims that in 1994 she run to Peter Town. Peter  D3:1, D5:9 Town  D5:10 Hope  D5:1 1994  D5:5 … Possible Queries: “Peter AND run” “Peter AND WNS:N_DATE” “(WSJ:CITY ^ *) AND run” “(WSJ:PERSON ^ Hope) AND run” WSJ:PERSON  D3:1, D5:1 WSJ:CITY  D5:9 WNS:V_DATE  D5:5 (Bracketing can also be dealt with) 90
  • 93. Entity Ranking • Given a topic, find relevant entities • Evaluated in TREC and INEX campaigns • Most well-known: people and expert search • Many other applications: dates, events, locations, companies, ... 93
  • 94. 94
  • 95. Example queries • Impressionist art Museums in Holland • Countries with the Euro currency • German car manufacturers • Artists related to Pablo Picasso • Actors who played Hamlet • English monarchs who married a French women • Many examples on http://www.ins.cwi.nl/projects/inex-xer/topics/ 95
  • 96. Entity Ranking • Topical query Q – Entity (result) type TX – A list of Entity instance Xs • Systems employ categories, structure, links: – Kaptein et al., CIKM10: Exploits Wikipedia, identifies entity types, anchor text index for entity search – Bron et al,CIKM10 Entity co‐occurrence, Entity type filtering, Context (relation type) • See http://ilps.science.uva.nl/trec-entity/resources/ • Futher: http://ejmeij.github.io/entity-linking-and-retrieval- tutorial/ 96
  • 97. Element Containment Graph [Zaragoza et al. CIKM 2007] • Given passage s in S, element e in E, there is a directed edge from s to e if passage s contains element e • Element containment graph C = S x E where Cij is strength of connection between si and ej At query time, compute Cq as a subset of the containment graph of the q passages matching the query Rank entities using HITS, PageRank, passage similarity, element similarity, etc 97
  • 98. 98 SEARCH OVER RDF (TRIPLES) DATA
  • 99. Resource Description Framework (RDF) • Each resource (thing, entity) is identified by a URI – Globally unique identifiers – Locators of information • Data is broken down into individual facts – Triples of (subject, predicate, object) • A set of triples (an RDF graph) is published together in an RDF document example:roi “Roi Blanco” type name foaf:Person RDF document
  • 100. Linking resources example:roi “Roi Blanco” name foaf:Person sameAs example:roi2 worksWith example:peter type email “pmika@yahoo-inc.com” type Roi’s homepage Yahoo Friend-of-a-Friend ontology
  • 101. Publishing RDF • Linked Data – Data published as RDF documents linked to other RDF documents – Community effort to re-publish large public datasets (e.g. Dbpedia, open government data) • RDFa – Data embedded inside HTML pages – Recommended for site owners by Yahoo!, Google, Facebook • SPARQL endpoints – Triple stores (RDF databases) that can be queried through the web
  • 102. The state of Linked Data • Rapidly growing community effort to (re)publish open datasets as Linked Data – In particular, scientific and government datasets – see linkeddata.org • Less commercial interest, real usage (Haas et al. SIGIR 2011)
  • 103. Glimmer: Architecture overview Doc 1. Download, uncompress, convert (if needed) 2. Sort quads by subject 3. Compute Minimal Perfect Hash (MPH) map map reduce reduce map reduce Index 3. Each mapper reads part of the collection 4. Each reducer builds an index for a subset of the vocabulary 5. Optionally, we also build an archive (forward-index) 5. The sub-indices are merged into a single index 6. Serving and Ranking https://github.com/yahoo/Glimmer 103
  • 104. Horizontal index structure • One field per position – one for object (token), one for predicates (property), optionally one for context • For each term, store the property on the same position in the property index – Positions are required even without phrase queries • Query engine needs to support fields and the alignment operator  Dictionary is number of unique terms + number of properties  Occurrences is number of tokens * 2 104
  • 105. Vertical index structure • One field (index) per property • Positions are not required • Query engine needs to support fields  Dictionary is number of unique terms  Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance • In experiments we index the N most common properties 105
  • 106. Efficiency improvements • r-vertical (reduced-vertical) index – One field per weight vs. one field per property – More efficient for keyword queries but loses the ability to restrict per field – Example: three weight levels • Pre-computation of alignments – Additional term-to-field index – Used to quickly determine which fields contain a term (in any document) 106
  • 107. Run-time efficiency • Measured average execution time (including ranking) – Using 150k queries that lead to a click on Wikipedia – Avg. length 2.2 tokens – Baseline is plain text indexing with BM25 • Results – Some cost for field-based retrieval compared to plain text indexing – AND is always faster than OR • Except in horizontal, where alignment time dominates – r-vertical significantly improves execution time in OR mode AND mode OR mode plain text 46 ms 80 ms horizontal 819 ms 847 ms vertical 97 ms 780 ms r-vertical 78 ms 152 ms 107
  • 108. BM25F Ranking BM25(F) uses a term-frequency (tf) that accounts for the decreasing marginal contribution of terms where vs is the weight of the field tfsi is the frequency of term i in field s Bs is the document length normalization factor: ls is the length of field s avls is the average length of s bs is a tunable parameter 108
  • 109. BM25F ranking (II) • Final term score is a combination of tf and idf where k1 is a tunable parameter wIDF is the inverse-document frequency: • Finally, the score of a document D is the sum of the scores of query terms q 109
  • 110. Effectiveness evaluation • Semantic Search Challenge 2010 – Data, queries, assessments available online • Billion Triples Challenge 2009 dataset • 92 entity queries from web search – Queries where the user is looking for a single entity – Sampled randomly from Microsoft and Yahoo! query logs • Assessed using Amazon’s Mechanical Turk – Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 – Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011 110
  • 112. Effectiveness results • Individual features – Positive, stat. significant improvement from each feature – Even a manual classification of properties and domains helps • Combination – Positive stat. significant marginal improvement from each additional feature – Total improvement of 53% over the baseline – Different signals of relevance 112
  • 113. Conclusions • Indexing and ranking RDF data – Novel index structures – Ranking method based on BM25F • Future work – Ranking documents with metadata • e.g. microdata/RDFa – Exploiting more semantics • e.g. sameAs – Ranking triples for display – Question-answering 113
  • 114. Entity Ranking Conclusions • Entity ranking provides a richer user experience for search-triggered applications • Ranking entities can be done effectively/efficiently using simple frequency-based models • Explaining things is as important as retrieving them (Blanco and Zaragoza SIGIR 2010) – Support sentences: several features based on scores of sentences and entities – The role of the context boosts performance 114
  • 115. Related entity ranking in web search
  • 116. Spark: related entity recommendations in web search • A search assistance tool for exploration • Recommend related entities given the user’s current query – Cf. Entity Search at SemSearch, TREC Entity Track • Ranking explicit relations in a Knowledge Base – Cf. TREC Related Entity Finding in LOD (REF-LOD) task • A previous version of the system live since 2010 • van Zwol et al.: Faceted exploration of image search results. WWW 2010: 961- 970 – The current version (described here) at • Blanco et al: Entity recommendations in Web Search, ISWC 2013 116
  • 117. Motivation • Some users are short on time – Need for direct answers – Query expansion, question-answering, information boxes, rich results… • Other users have time at their hand – Long term interests such as sports, celebrities, movies and music – Long running tasks such as travel planning 117
  • 119. Spark in use 119
  • 120. How does it work? /user/torzecn/ Shared/Entity- Relationship_Graphs Yalinda feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/{1,2,3}/ yalinda/phyfacet/data /projects/gridfaces/ feed/yahooomg/ Y! OMG feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/6/yahooomg/ phyfacet/data /projects/gridfaces/ feed/geo Y! Geo feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/data /projects/gridfaces/ feed/yahootv Y! TV feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/5 /projects/gridfaces/ feed/editorialdata/ data/objects Editorial feed parser (GFFeedMapper, GFEditorialMergeReducer) $VIS_GRID_HOME/ gfstorageall/logicalobj/ dataArchive/ $VIS_GRID_HOME/ gfstorageall/ logicalfacet/data $VIS_GRID_HOME/ gfstorageall/logicalobj/ data /projects/gridfaces/ feed/editorialdata/ data/facets Activated Deactivated Empty/missing GFDumpMain-First (DumpFirstMapper, DumpFirstReducer) $VIS_GRID_HOME/ rankinputTmp $VIS_GRID_HOME/ rankinput STEP 1: FEED PARSING PIPELINE GFDumpMain-Second (DumpSecondMapper, DumpSecondReducer) STEP 2: PREPROCESSING PIPELINE (before feature extraction and ranking) $VIS_GRID_HOME/ rankinput CreateDictionary (DictionaryMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/ dictionary /data/SDS/data/ search_US CreateIntermediate LogFormat (NullValueSillyMapp er, DistinctReducer) $VIS_GRID_HOME/ rankprepout/spark/ logs/week{WNo} CreateQuerySessions CommonModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessionsfilterer CreateQuerySessions CommonModel (SillyMapper, QuerySessionsCreate SessionsReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessions_cm CreateQueryTermsJoin DictionaryAndLogs (LeftJoinMapper, LeftJoinReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qtermsjoindictionarylog CreateQueryTermsJoin QTermsAndLogs (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qterms_cm /projects/gridfaces/ flickr/feed CreateFlickrIntermediate LogFormat (FlickrReformatMapper) $VIS_GRID_HOME/ rankprepout/tmp/ flickrtagsintermediate CreateFlickrTagsCommon ModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ flickrtagsfilterer CreateFlickrTagsCommon ModelFilter (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ flickr_cm General Query logs Flickr Twitter /projects/rtds/twitter/ firehose CreateTwitterIntermediate LogFormat (NullValueSillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/spark/ twitter_logs/week{WNo} CreateTweetsCommon ModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ tweetsfilterer CreateTweetsCommon Model (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ tweets_cm DistinctUsersTwitter (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/tweets/ distinctusers DistinctUsersflickr (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/flickr/ distinctusers DistinctUsersQSessions (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessions/distinctusers DistinctUsersQTerms (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/qterms/ distinctusers CountUsers (CounterMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/tmp/ distinctusers CountEvents (CounterMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countevents STEP 3a: FEATURE EXTRACTION (QUERY TERMS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ probability $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ qterms/queryfacet $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/tmp/ qterms_cm EventProbabilityQTerms (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countevents ConditionalUserProbability 2_3-qt (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/qterms EventConditionalProbabilityQterms (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionalprobability Extracted features Others EventJointProbabilityQterms (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointprobability EventJointUserProbabilityQTerms (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointuserprobability EventConditionalUserProb abilityQterms (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ qterms/query EventConditionalUser Probability3_qterms (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/qterms/ week{WNo}/ conditionaluserprobability EventEntropyQTerms (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ entropy EventPMI1QTerms (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmiqterms EventPMIQTerms (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ pmi EventKLDivergenceUnary1QTerms (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunaryqterms EventKLDivergenceUnaryQTerms (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ kldivergenceunary EventCosineSimilarityQTerms (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ cosinesimilarity STEP 3c: FEATURE EXTRACTION (FLICKR TAGS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ probability $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ flickr/queryfacet $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/tmp/ flickr_cm EventProbabilityFlickr (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countevents ConditionalUserProbability 2_3-fl (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/flickr EventConditionalProbabilityFlickr (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ conditionalprobability Extracted features Others EventJointProbabilityFlickr (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ jointprobability EventJointUserProbabilityFlickr (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ jointuserprobability EventConditionalUserProb abilityFlickr (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ flickr/query EventConditionalUser Probability3_flickr (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/flickr/ week{WNo}/ conditionaluserprobability EventEntropyFlickr (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ entropy EventPMI1Flickr (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmiflickr EventPMIFlickr (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ pmi EventKLDivergenceUnary1Flickr (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunaryflickr EventKLDivergenceUnaryFlickr (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ kldivergenceunary EventCosineSimilarityFlickr (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/flickr/ week{WNo}/ cosinesimilarity STEP 3d: FEATURE EXTRACTION (TWEETS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ probability $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ tweets/queryfacet $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/tmp/ tweets_cm EventProbabilityTweets (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countevents ConditionalUserProbability 2_3-tw (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/tweets EventConditionalProbabilityTweets (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ conditionalprobability Extracted features Others EventJointProbabilityTweets (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointprobability EventJointUserProbabilityTweets (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointuserprobability EventConditionalUserProb abilityTweets (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ tweets/query EventConditionalUser Probability3_tweets (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/tweets/ week{WNo}/ conditionaluserprobability EventEntropyTweets (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ entropy EventPMI1Tweets (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmitweets EventPMITweets (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ pmi EventKLDivergenceUnary1Tweets (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunarytweets EventKLDivergenceUnary1Tweets (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ kldivergenceunary EventCosineSimilarityTweets (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ cosinesimilarity graph_sharedconnect_1 (GRSharedConnectMapper, GRSharedConnectReducer) (MergeFeaturesMapper, SymmetricFeatureMergerReducer) STEP 4a: FEATURE MERGING (QUERY TERMS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ probability unaryfeaturemerger1_qterms (MergeFeaturesMapper, UnaryFeatureMergerReducer) join_graph $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_qterms $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ pmi $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ cosinesimilarity unaryfeaturemerger2_qterms (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_qterms symmetricfeaturemerger_qterms (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_qterms asymmetricfeaturemerger_qterms (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) Features Others $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_qterms reverseasymmetricfeaturemerger_qterms (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_ qterms STEP 4b: FEATURE MERGING (QUERY SESSIONS) PIPELINE $VIS_GRID_HOME /rankprepout/spark/ qsessions/ week{WNo}/entropy $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ probability unaryfeaturemerger1_qsessions (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_qsessions $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/pmi $VIS_GRID_HOME/ rankprepout/spark/ qsessions/week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ qsessions/week{WNo}/ cosinesimilarity unaryfeaturemerger2_qsessions (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_qsessions symmetricfeaturemerger_qsessions (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_qsessions asymmetricfeaturemerger_qsessions (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) Features Others $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_qsessions reverseasymmetricfeaturemerger_qsessions (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_ qsessions STEP 4c: FEATURE MERGING (FLICKR TAGS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ probability unaryfeaturemerger1_flickr (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_flickr $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/pmi $VIS_GRID_HOME/ rankprepout/spark/flickr/ week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ flickr/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/flickr/ week{WNo}/ cosinesimilarity unaryfeaturemerger2_flickr (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_flickr symmetricfeaturemerger_flickr (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_flickr asymmetricfeaturemerger_flickr (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) Features Others $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_flickr reverseasymmetricfeaturemerger_flickr (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_f lickr STEP 4d: FEATURE MERGING (TWEETS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ probability unaryfeaturemerger1_tweets (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_tweets $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ pmi $VIS_GRID_HOME/ rankprepout/spark/tweets/ week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ cosinesimilarity unaryfeaturemerger2_tweets (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_tweets symmetricfeaturemerger_tweets (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_tweets asymmetricfeaturemerger_tweets (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) Features Others $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_tweets reverseasymmetricfeaturemerger_tweets (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_t weets STEP 5a: FEATURE EXTRACTION AND MERGING (COMBINED FEATURES) PIPELINE $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_qsessions combinedfeaturem erger_qsessions (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_qsessions $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_qterms combinedfeaturem erger_qterms (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_qterms $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_flickr combinedfeaturem erger_flickr (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_flickr $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_tweets combinedfeaturem erger_tweets (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_tweets joinfeatures (JoinerMapper, JoinerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatures3 STEP 5b: FEATURE EXTRACTION AND MERGING (GRAPH) PIPELINE Features Others $VIS_GRID_HOME /rankprepout/tmp/ statsmerge/ joinfeatures3 $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/graph/ sharedconnect graph_sharedconnect_2 (NullValueSillyMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/graph/ sharedconnect_2 $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_g_1 graph_popularity_rank_all (GRPopularityRankMapper, GRPopularityRankReducer) $VIS_GRID_HOME/ rankprepout/graph/ popularity_rank_all graph_popularity_rank_directed (GRPopularityRankMapper, GRPopularityRankReducer) $VIS_GRID_HOME/ rankprepout/graph/ popularity_rank_directed unaryfeaturemerger_entpopmov (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop1 unaryfeaturemerger_entpopmov2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME /rankprepout/tmp/ statsmerge/ joinfeature_pop2 unaryfeaturemerger_entpopmov3 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop3 unaryfeaturemerger_entpopmov4 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop4 STEP 5c: FEATURE MERGING (POPULARITY) PIPELINE Features Others $VIS_GRID_HOME/ web_citation/ deep_hits $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop4 $VIS_GRID_HOME/ web_citation/ total_hits WebCitationTotalHits Normalization (SillyMapper) $VIS_GRID_HOME/ rankprepout/tmp/ webcitation_totalhits WebCitationDeepHits Normalization (SillyMapper) $VIS_GRID_HOME/ rankprepout/tmp/ webcitation_deephits asymmetricfeaturemerger_webcitation (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_webcitation coverage (QueryCountMapper, QueryCountReducer) $VIS_GRID_HOME/ rankprepout/spark/ coverage/week{WNo} $VIS_GRID_HOME/ rankprepout/spark/ logs/week{WNo} joincov1 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeaturescov1 joincov2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeaturescov2 joinwikipop1 (MergeFeaturesMapper, UnaryFeatureMergerReducer) /user/barla/Spark/ wikiResultCounts $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatureswikipop1 joinwikipop2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatureswikipop2 jointypes (EntityRelationTypeMapper, EntityRelationTypeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/jointypes STEP 7: DATAPACK GENERATION PIPELINE (DumpFirstMapper, DumpFirstReducer) STEP 6: RANKING PIPELINE dumper1 $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/jointypes MLR scoring (/homes/barla/gf_mlr/ mlr_rank) /homes/barla/gf_mlr/ gbrank.xml /homes/barla/gf_mlr/ header.tsv $VIS_GRID_HOME/ rankprepout/spark/ ranking MLR scoring (SillyMapper, DisambiguationReducer) $VIS_GRID_HOME/ rankprepout/spark/ ranking-disambiguated groupmax (SillyMapper, OutputMaxReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ranking formatranking (RankingFormatterMapper, RankingFormatterReducer) $VIS_GRID_HOME/ rankprepout/spark/ rankingfinal $VIS_GRID_HOME/ rankprepout/spark/ rankingfinal mergerank (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfrankout $VIS_GRID_HOME/ gfrankout/1/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/2/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/3/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/4/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/5/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout10/geo/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/1/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/2/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/3/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/4/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/5/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/1/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/2/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/3/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/4/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/5/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/ logicalobj/data $VIS_GRID_HOME/ platdumpTmp dumper2 (DumpSecondMapper, DumpSecondReducer) $VIS_GRID_HOME/ platdump datapack (Mapper, YISFacetMergeReducer) $VIS_GRID_HOME/ datapack 120
  • 121. High-Level Architecture View Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features 121
  • 122. Spark Architecture Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features 122
  • 123. Entity graph Data Preprocessing Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features 123
  • 124. Entity graph • 3.4 million entities, 160 million relations • Locations: Internet Locality, Wikipedia, Yahoo! Travel • Athletes, teams: Yahoo! Sports • People, characters, movies, TV shows, albums: Dbpedia • Example entities • Dbpedia Brad_Pitt Brad Pitt Movie_Actor • Dbpedia Brad_Pitt Brad Pitt Movie_Producer • Dbpedia Brad_Pitt Brad Pitt Person • Dbpedia Brad_Pitt Brad Pitt TV_Actor • Dbpedia Brad_Pitt_(boxer) Brad Pitt Person • Example relations • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor 124
  • 125. Entity graph challenges • Coverage of the query volume – New entities and entity types – Additional inference – International data – Aliases, e.g. jlo, big apple, thomas cruise mapother iv • Freshness – People query for a movie long before it’s released • Irrelevant entity and relation types – E.g. voice actors who co-acted in a movie, cities in a continent • Data quality – Andy Lau has never acted in Iron Man 3 125
  • 126. Entity graph Feature Extraction Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features 126
  • 127. Feature extraction from text • Text sources – Query terms – Query sessions – Flickr tags – Tweets • Common representation Input tweet: Brad Pitt married to Angelina Jolie in Las Vegas Output event: Brad Pitt + Angelina Jolie Brad Pitt + Las Vegas Angelina Jolie + Las Vegas 127
  • 128. Features • Unary – Popularity features from text: probability, entropy, wiki id popularity … – Graph features: PageRank on the entity graph, Wikipedia, web graph – Type features: entity type • Binary – Co-occurrence features from text: conditional probability, joint probability … – Graph features: common neighbors … – Type features: relation type 128
  • 129. Feature extraction challenges • Efficiency of text tagging – Hadoop Map/Reduce • More features are not always better – Can lead to over-fitting without sufficient training data 129
  • 130. Entity graph Data Model Learning preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features 130
  • 131. Model Learning • Training data created by editors (five grades) 400 Brandi adriana lima Brad Pitt person Embarassing 1397 David H. andy garcia Brad Pitt person Mostly Related 3037 Jennifer benicio del toro Brad Pitt person Somewhat Related 4615 Sarah burn after reading Brad Pitt person Excellent 9853 Jennifer fight club movie Brad Pitt person Perfect • Join between the editorial data and the feature file • Trained a regression model using GBDT – Gradient Boosted Decision Trees • 10-fold cross validation optimizing NDCG and tuning – number of trees – number of nodes per tree 131
  • 132. Impact of training data Number of training instances (judged relations) 132
  • 133. linear combination of the conditional probabilities gives best performance on the collected judgements using wqt = 2, = 0.5, and wf t = 1. However, the editorial data was not Model Learning challenges substantial enough to learn a ranking with GBDT. Click-through Rate versus Click over Expected Click From the image search query logs, we collect the user click • Editorial preferences not necessarily coincide with usage data that is related to the facets. This allows us to compute the click-through rate (CTR) on a facet for a given entity that is detected in a user query and for which the facets were shown – Users click a lot more on people than expected – Image bias? • Alternative: optimize for usage data the user. Let cl ickse,f be the number of clicks on a facet – Clicks turned into labels or preferences – Size of the data is not a concern – Gains are computed from normalized CTR/COEC entity f show in relation to entity e, and viewse,f the number times the facet f is shown to a user for a related entity e, then the probability of a click on a facet entity f for a given entity e can be modelled as ctr e,f : ctr e,f = cl ickse,f viewse,f – See van Zwol et al. Ranking Entity Facets Based on User Click Feedback. ICSC 2010: 192-199. (2) In Figure 3 the conditional click-through rate is shown for first ten positions. It shows theCTR per position for every page view where one of the facets is clicked, aggregated over P P p= 1 viewse,Zhang and Jones [3] refer to expected clicks, based on the denominator expected clicks given the positions C. Gradient Boosted Decision Trees Stochastic gradient boosted decision themost widely used learning algorithms today. Gradient tree boosting constructs regres-sion model, utilizing decision trees One advantage over other learners trees in general is that the feature are highly interpretable. GBDT different loss functions can be used research presented here we used our loss function. In related work, pairwise and ranking specific loss well at improving search relevance[shallow decision trees, trees in stochastic on a randomly selected subset of prone to over-fitting [14]. For the 196 shown in the search engine the ground truth for creating test set used by the gradient Conditional Probabilities of the facets search expe-rience, function rank(e, f ) that is conditional probabilities extracted wqs⇥Pqs(f |e)+ wf t⇥Pf t (f , e) (1) e) arethe conditional prob-abilities theweights for the different qt), query session (qs) and editorial judgements collected for their facets we find that conditional probabilities gives judgements using wqt = 2, However, the editorial data was not ranking with GBDT. Click over Expected Click logs, we collect the user click This allows us to compute the all entities. Observe that the CTR declines when the position at which a facet is shown increases. We introduce a second click model, based on the notion of clicks over expected clicks (COEC). To allows us to deal with the so called position bias – where facets appearing in lower positions are less likely to be clicked even if they are relevant [2]. This phenomenon isoften observed inWeb search and we adopt the COEC model proposed by Chapelle and Zhang [11]. In that model, we estimate ctr p as the aggregated ctr – over all queries and sessions – in position p for all positions P. Let then cl ickse,f be the number of clicks on a facet entity f show in relation to entity e, and viewse,f p the number of times the facet f is shown to a user for a related entity e at position p. The probability of a click over expected click on a facet entity f for a given entity e can then be modelled as coece,f : coece,f = cl ickse,f P P p= 1 viewse,f p ⇥ ctr p (3) Zhang and Jones [3] refer to this method as clicks over expected clicks, based on the denominator that includes the expected clicks given the positions that the url appeared in. C. Gradient Boosted Decision Trees 133 Stochastic gradient boosted decision trees (GBDT) is one of themost widely used learning algorithms in machine learning
  • 134. Entity graph Ranking and Disambiguation Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features 134
  • 135. Ranking and Disambiguation • We apply the ranking function offline to the data • Disambiguation – How many times a given wiki id was retrieved for queries containing the entity name? Brad Pitt Brad_Pitt 21158 Brad Pitt Brad_Pitt_(boxer) 247 XXX XXX_(movie) 1775 XXX XXX_(Asia_album) 89 XXX XXX_(ZZ_Top_album) 87 XXX XXX_(Danny_Brown_album) 67 – PageRank for disambiguating locations (wiki ids are not available) • Expansion to query patterns – Entity name + context, e.g. brad pitt actor 135
  • 136. Ranking and Disambiguation challenges • Disambiguation cases that are too close to call – Fargo Fargo_(film) 3969 – Fargo Fargo,_North_Dakota 4578 • Disambiguation across Wikipedia and other sources 136
  • 137. Evaluation #2: Side-by-side testing • Comparing two systems – A/B comparison, e.g. current system under development and production system – Scale: A is better, B is better • Separate tests for relevance and image quality – Image quality can significantly influence user perceptions – Images can violate safe search rules • Classification of errors – Results: missing important results/contains irrelevant results, too few results, entities are not fresh, more/less diverse, should not have triggered – Images: bad photo choice, blurry, group shots, nude/racy etc. • Notes – Borderline, set one entities relate to the movie Psy but the query is most likely about Gangnam style – Blondie and Mickey Gilley are 70’s performers and do not belong on a list of 60’s musicians. – There is absolutely no relation between Finland and California. 137
  • 138. Evaluation #3: Bucket testing • Also called online evaluation – Comparing against baseline version of the system – Baseline does not change during the test • Small % of search traffic redirected to test system, another small % to the baseline system • Data collection over at least a week, looking for stat. significant differences that are also stable over time • Metrics in web search – Coverage and Click-through Rate (CTR) – Searches per browser-cookie (SPBC) – Other key metrics should not impacted negatively, e.g. Abandonment and retry rate, Daily Active Users (DAU), Revenue Per Search (RPS), etc. 138
  • 139. Coverage before and after the new system 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Days Coverage Coverage before Spark Trend before Spark Coverage after Spark Trend after Spark Spark is deployed in production Before release: Flat, lower After release: Flat, higher 139
  • 140. Click-through rate (CTR) before and after the new system 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Before release: Gradually degrading performance due to lack of fresh data After release: Learning effect: users are starting to use the tool again Days CTR CTR before Spark Trend before Spark CTR after Spark Trend after Spark Spark is deployed in production 140
  • 141. Summary • Spark – System for related entity recommendations • Knowledge base • Extraction of signals from query logs and other user-generated content • Machine learned ranking • Evaluation • Other applications – Recommendations on topic-entity pages 141
  • 142. Future work • New query types – Queries with multiple entities • adele skyfall – Question-answering on keyword queries • brad pitt movies • brad pitt movies 2010 • Extending coverage – Spark now live in CA, UK, AU, NZ, TW, HK, ES • Even fresher data – Stream processing of query log data • Data quality improvements • Online ranking with post-retrieval features 142
  • 144. Web Search and time • Information freshness adds constraints/tensions in every layer of WSEs • Architecture • Crawling • Indexing • Distribution • Caching • Serving system • Modeling • Time-dependent user intent • UI (how to let the user take control) 144
  • 145. High-level Architecture of WSEs results Runtime system Parser/ Tokenizer Index terms Cache Query Engine queries Indexing pipeline WWW 145
  • 146. Adding the time dimension • Some solutions don’t scale up anymore • Review your architecture • Review your algorithms • Add more machines (~$$$) • Some solutions don’t apply anymore • Caching 146
  • 147. Evolution • 1999 • Index updated ~once per month • Disk-based updates/indexing • 2001 • In-memory indexes • Changes the whole-game! • 2007 • Indexing time < 1 minute • Accept updates while serving • Now • Focused crawling, delayed transactions, etc. • Batch Updates -> Incremental processing 147
  • 148. Some landmarks • Reliable distributed storage • Some models/processes require millions of accesses • Massive parallelization • Map/Reduce – Hadoop • Storm/S4 – Streaming • Hadoop 2.0 - YARN! • Semi-structured storage systems • Asynchronous item updates 148
  • 149. What’s going on “right now”? 149
  • 150. Query temporal profiles • Modeling • Time-dependent user intent • Implicitly time-qualified search queries • SIGIR • Dream theater barcelona • Barcelona vs Madrid • …. • Interested in these tasks? • Joho et al. A survey of temporal Web experience, WWW 2013 • Keep an eye on NTCIR next year! 150
  • 151. Caching for Real-Time Indexes • Queries are redundant (heavy-tail) and bursty • Caching search results saves up executing ~30/60% of the queries • Tens of machines do the work of 1000s • Dilemma: Freshness versus Computation • Extreme #1: do not cache at all – evaluate all queries • 100% fresh results, lots of redundant evaluations • Extreme #2: never invalidate the cache • A majority of stale results – results refreshed only due to cache replacement, no redundant work • Middle ground: invalidate periodically (TTL) • A time-to-live parameter is applied to each cached entry 151 Blanco et al. SIGIR 2010
  • 152. CACHING FOR INCREMENTAL •Problem: •In fast crawling, cache not always up-to-date (stale) •Solution: • Cache Invalidator Predictor - looks into new documents and invalidates queries accordingly User Queries Runtime system Index pipeline Cache Query Engine Invalidator Synopsis Generator Parser/ Index Tokenizer Crawled Documents 152 • Using synopsis reduces the number of refreshes up to 30% compared to a time-to-live baseline INDEXES
  • 153. Time Aware ER • In some cases the time dimension is available – News collections – Blog postings • News stories evolve over time – Entities appear/disappear – Analyze and exploit relevance evolution • An Entity Search system can exploit the past to find relevant entities – De Martini et al. TAER: Time Aware Entity Retrieval. CIKM 2010, 153
  • 154. Ranking Entities Over Time Barack Obama Naoto Kan Barack Obama Yokohama Takao Sato Michiko Sato Kamakura Barack Obama United States 154
  • 156. Time(ly) opportunities Can we create new user experiences based on a deeper analysis and exploration of the time dimension? • Goals: – Build an application that helps users to explore, interact and ultimately understand existing information about the past and the future. – Help the user cope with the information overload and eventually find/learn about what she’s looking for. 156
  • 157. Original Idea • R. Baeza-Yates, Searching the Future, MF/IR 2005 – On December 1st 2003, on Google News, there were more than 100K references to 2004 and beyond. – E.g. 2034: • The ownership of Dolphin Square in London must revert to an insurance company. • Voyager 2 should run out of fuel. • Long-term care facilities may have to house 2.1 million people in the USA. • A human base in the moon would be in operation. 157
  • 158. Time Explorer • Public demo since August 2010 • Winner of HCIR NYT Challenge • Goal: explore news through time and into the future • Using a customized Web crawl from news and blog feeds http://fbmya01.barcelonamedia.org:8080/future/ 158 Matthews et al. HCIR 2010
  • 160. Time Explorer - Motivation  Time is important to search  Recency, particularly in news is highly related to relevancy  But, what about evolution over time?  How has a topic evolved over time?  How did the entities (people, place, etc) evolve with respect to the topic over time?  How will this topic continue to evolve over the future?  How does bias and sentiment in blogs and news change over time?  Google Trends, Yahoo! Clues, RecordedFuture …  Great research playground  Open source! 160
  • 162. Analysis Pipeline  Tokenization, Sentence Splitting, Part-of-speech tagging, chunking with OpenNLP  Entity extraction with SuperSense tagger  Time expressions extracted with TimeML  Explicit dates (August 23rd, 2008)  Relative dates (Next year, resolved with Pub Date)  Sentiment Analysis with LivingKnowledge  Ontology matching with Yago  Image Analysis – sentiment and face detection 162
  • 163. Indexing/Search • Lucene/Solr search platform to index and search – Sentence level – Document level • Facets for annotations (multiple fields for faster entity-type access) • Index publication date and content date –extracted dates if they exists or publication date • Solr Faceting allows aggregation over query entity ranking and for aggregating counts over time • Content date enables search into the future 163
  • 164. 164
  • 168. Timeline – Entity Trend 168
  • 172. Other challenges – Large scale processing • Distributed computing • Shift from batch (Hadoop) to online (Storm) – Efficient extraction/retrieval, algorithmic/data structures • Critical for interactive exploration – Connection with the user experience • Measures! User engagement? – Personalization – Integration with Knowledge Bases (Semantic Web) – Multilingual 172