Mining Web content for Enhanced Search

Mining Web content for Enhanced Search
Roi Blanco
Yahoo! Research

Yahoo! Research Barcelona
• Established January, 2006
• Led by Ricardo Baeza-Yates
• Research areas
• Web Mining
• Social Media
• Distributed Web retrieval
• Semantic Search
• NLP/IR
2

Natural Language Retrieval
• How to exploit the structure and meaning of
natural language text to improve search
• Current search engines perform only limited NLP
(tokenization, stemming)
• Automated tools exist for deeper analysis
• Applications to diversity-aware search
• Source, Location, Time, Language, Opinion,
Ranking…
• Search over semi-structured data, semantic
search
• Roll-out user experiences that use higher layers
of the NLP stack
3

Agenda
• Intro to current status of Web Search
• Semantic Search
– Annotating documents
– Entity linking
– Entity search
– Applications
• Time-aware information access
– Motivation
– Applications
4

Some Challenges
• From user understanding to selecting the right
hardware/software architecture
– Modeling user behavior
– Modeling user engagement/return
• Providing answers, not links
– Richer interfaces
• Enriching document structure using semantic annotations
– Indexing those annotations
– Accessing them in real-time
– What are the right annotations? The right level of granularity?
Trade-offs?
• Time-aware information access
– Changes the game in almost every step of the pipeline
6

Structured data - Web search
Top-1 entity with
structured data
Related entities
Structured data
extracted from HTML
10

New devices
• Different interaction (e.g. voice)
• Different capabilities (e.g. display)
• More Information (geo-localization)
• More personalized
11

Semantic Search
• What different kinds of search and
applications beyond string matching or
returning 10 blue links?
• Can we have a better understanding of
documents and queries?
• New devices open new possibilities, new
experiences
• Is current technology in natural language
understanding mature enough?
13

Search is really fast, without necessarily being
intelligent
14

Why Semantic Search? Part I
• Improvements in IR are harder and harder to
come by
– Machine learning using hundreds of features
• Text-based features for matching
• Graph-based features provide authority
– Heavy investment in computational power, e.g. real-time
indexing and instant search
• Remaining challenges are not computational, but
in modeling user cognition
– Need a deeper understanding of the query, the
content and/or the world at large
– Could Watson explain why the answer is Toronto?
15

Semantic Search: a definition
• Semantic search is a retrieval paradigm that
– Makes use of the structure of the data or explicit schemas to
understand user intent and the meaning of content
– Exploits this understanding at some part of the search process
• Emerging field of research
– Exploiting Semantic Annotations in Information Retrieval (2008-
2012)
– Semantic Search (SemSearch) workshop series (2008-2011)
– Entity-oriented search workshop (2010-2011)
– Joint Intl. Workshop on Semantic and Entity-oriented Search (2012)
– SIGIR 2012 tracks on Structured Data and Entities
• Related fields:
– XML retrieval, Keyword search in databases, NL retrieval
16

What it’s like to be a machine?
Roi Blanco
17

What it’s like to be a machine?
✜Θ♬♬ţğ
✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓
ţğ★✜
✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫
≠=⅚©§★✓♪ΒΓΕ℠
✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ
⏎⌥°¶§ΥΦΦΦ✗✕☐
18

Interactive search and task
completion
19

Why Semantic Search? Part II
• The Semantic Web is here
– Data
• Large amounts of RDF data
• Heterogeneous schemas
• Diverse quality
– End users
• Not skilled in writing complex
queries (e.g. SPARQL)
• Not familiar with the data
• Novel applications
– Complementing document
search
• Rich Snippets, related entities,
direct answers
– Other novel search tasks
20

Other novel applications
• Aggregation of search results
– e.g. price comparison across websites
• Analysis and prediction
– e.g. world temperature by 2020
• Semantic profiling
– Ontology-based modeling of user interests
• Semantic log analysis
– Linking query and navigation logs to ontologies
• Task completion
– e.g. booking a vacation using a combination of services
• Conversational search
21

Common tasks in Semantic Search
semantic search
list search
entity search
SemSearch 2010/11
related entity finding
list completion
SemSearch 2011
TREC REF-LOD task TREC ELC task
22

Is NLU that complex?
”A child of five would understand
this. Send someone to fetch a child
of five”.
Groucho Marx
23

Some examples…
• Ambiguous searches
– paris hilton
• Multimedia search
– paris hilton sexy
• Imprecise or overly precise searches
– jim hendler
– pictures of strong adventures people
• Searches for descriptions
– 33 year old computer scientist living in barcelona
– reliable digital camera under 300 dollars
• Searches that require aggregation
– height eiffel tower
– harry potter movie review
– world temperature 2020
24

Language is Ambiguous
The man saw the girl with the telescope
25

Paraphrases
• ‘This parrot is dead’
• ‘This parrot has kicked the bucket’
• ‘This parrot has passed away’
• ‘This parrot is no more'
• 'He's expired and gone to meet his maker,’
• 'His metabolic processes are now history’
26

Semantics at every step of the IR
process
bla bla bla?
bla
bla bla
The IR engine The Web
Query interpretation
q=“bla” * 3
Document processing bla
bla bla
bla
bla
bla
Indexing
Ranking
θ(q,d) “bla”
Result presentation
28

Understanding Queries
• Query logs are a big source of information &
knowledge
To rank better the results (what you click)
To understand queries better
Paris Paris Flights
Paris Paris Hilton
29

“Understand” Documents
NLU is
still an
open
issue
30

NLP for IR
• Full NLU is AI complete, not scalable to the web
size (parsing the web is really hard).
• BUT … what about other shallow NLP
techniques?
• Hypothesis/Requirements:
• Linear extraction/parsing time
• Error-prone output (e.g. 60-90%)
• Highly redundant information
• Explore new ways of browsing
• Support your answers
31

Usability
We also fail at using the technology
Sometimes
32

Support your answers
Errors happen: choose the right ones!
• Humans need to “verify” unknown facts
• Multiple sources of evidence
• Common sense vs. Contradictions
• are you sure? is this spam? Interesting!
• Tolerance to errors greatly increases if users can verify
things fast
• Importance of snippets, image search
• Often the context is as important as the fact
• E.g. “S discovered the transistor in X”
• There are different kinds of errors
• Ridiculous result (decreases overall confidence in system)
• Reasonably wrong result (makes us feel good)
33

TOOLS FOR GENERATING
ANNOTATIONS
34

Entity?
• Uniquely identifiable “thing” or “object”
– “A thing with a distinct and independent existence”
• Properties:
– ID
– Name(s)
– Type(s)
– Attributes (/Descriptions)
– Relationships to other entities
• An entity is only one kind of textual annotation!
35

Named Entity Recognition
• “Named Entity”, IE context of MUC-6 (R.
Grishman & Sundheim’1996)
• Recognize information units like names, including
PERson, ORGanization, LOCation names, and
numeric expressions including time, date, money
[Yahoo!] employee [Roi Blanco] visits [Sapporo] .
ORG PER LOC
37

ML methods
• Supervised
• HMM
• CRF
• SVM
• …
• Unsupervised
• Lexical Resources (e.gWordNet), lexical patterns and
statistics from large non-annotated corpora.
• “Semi-supervised” (or “weakly supervised”)
• Mostly “Bootstrapping” from a set of entities
38

Entity Linking
Image taken from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM '08.
39

NE Linking / NED
• Linking mentions of Named Entities to a
Knowledge Base (e.g. Wikipedia, Freebase …)
• Name Variations
• Shortened forms
• Alternative Spellings
• “New” Entities (not contained in the
Knowledge Base)
• As the Knowledge Base grows … everything
looks like an Entity (a town named web!?)
40

Entity Ambiguity
• In Wikipedia Michael Jordan may also refer to:
– Michael Jordan (mycologist)
– Michael Jordan (footballer)
– Michael Jordan (Irish politician)
– Michael B. Jordan (born 1987), American actor
– Michael I. Jordan (born 1957), American researcher
– Michael H. Jordan (1936–2010), American executive
– Michael-Hakim Jordan (born 1977), basketball player
41

Traditional IE
• Relations: assertions linking two or more
concepts
– actors-act in-movies
– cities-capital of-countries
• Facts: instantiations of relations
– leonardo dicaprio-act in-inception
– cairo-capital of-egypt
• Attributes: facts capturing quantifiable properties
– actors --> awards, birth date, height
– movies --> producer, release date, budget
42

ENTITY LINKING
43
http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/

Image taken from Mihalcea and Csomai (2007). Wikify!: linking documents to encyclopedic knowledge. In
CIKM '07. 44

Why do we need entity linking?
• (Automatic) document enrichment
– go-read-here
– assistance for (Wikipedia) editors
– inline (microformats, RDFa)
52

• “Use as feature”
– to improve
• classification
• retrieval
• word sense disambiguation
• semantic similarity
• ...
– dimensionality reduction (as compared to,
e.g., term vectors)
53

• Enable
– semantic search
– advanced UI/UX
– ontology learning, KB population
– ...
54

A bit of history
• Text classification
• NER
• WSD
• NED
– {person name, geo, movie name, ...}
disambiguation
– (Cross-document) co-reference resolution
• Entity linking
55

Entity linking?
• NE normalization / canonicalization / sense
disambiguation
• DB record linkage / schema mapping
• Knowledge base population
• Entity linking
– D2W
– Wikification
– Semantic linking
56

Entity Linking: main problem
• Linking free text to entities
– Entities (typically) taken from a knowledge
base
• Wikipedia
• Freebase
• ...
– Any piece of text
• news documents
• blog posts
• tweets
• queries
• ...
57

Typical steps
1. Determine “linkable” phrases
– mention detection – MD
2. Rank/Select candidate entity links
– link generation – LG
– may include NILs (null values, i.e., no target in KB)
3. (Use “context” to
disambiguate/filter/improve)
– disambiguation – DA
58

Preliminaries
• Wikipedia
• Wikipedia-based measures
– commonness
– relatedness
– keyphraseness
59

Wikipedia
• Basic element: article (proper)
• But also
– redirect pages
– disambiguation pages
– category/template pages
– admin pages
• Hyperlinks
– use “unique identifiers” (URLs)
• [[United States]] or [[United States|American]]
• [[United States (TV series)]] or
[[United States (TV series)|TV show]] 60

Wikipedia style guidelines
• “the lead contains a quick summary of the
topic's most important points, and each
major subtopic is detailed in its own section
of the article”
– “The lead section (also known as the lead,
introduction or intro) of a Wikipedia article is the
section before the table of contents and the first
heading. The lead serves as an introduction to the
article and a summary of its most important
aspects.”
See http://en.wikipedia.org/wiki/Wikipedia:Summary_style
61

Some statistics
• WordNet
– 80k entity definitions
– 115k surface forms
– 142k senses (entity - surface form combinations)
• Wikipedia (only)
– ~4M entity definitions
– ~12M surface forms
– ~24M senses
62

Wikipedia-based measures
• keyphraseness(w) [Mihalcea & Csomai 2007]
Collection frequency term
w as a link to another Wikipedia
article
Collection frequency term w
63

• commonness(w,c) [Medelyan et al. 2008]
Number of links
with target c’ and anchor text w
64

Commonness and
keyphraseness
Image taken from Li et al. (2013). TSDW: Two-stage word sense disambiguation using Wikipedia. In
JASIST 2013. 65

• relatedness(c, c’) [Milne & Witten 2008a]
Image taken from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness
Obtained from Wikipedia Links. In AAAI WikiAI Workshop. 66

• relatedness(c, c’) [Milne & Witten 2008a]
Total number of
Wikipedia articles
Intersection of inlinks
with target c and c’
Number of links
with target c
67

Recall the steps
1. mention detection – MD
2. link generation – LG
3. (disambiguation) – DA
68

Wikify!
[Mihalcea & Csomai 2007]
• MD
– tf.idf, Χ2, keyphraseness
• LG
1.Overlap between definition (Wikipedia page) and
context (paragraph) [Lesk 1986]
2.Naive Bayes [Mihalcea 2007]
• context, POS, entity-specific terms
3.Voting between (1) and (2)
69

Large-Scale Named Entity Disambiguation Based
on Wikipedia Data
[Cucerzan 2007]
• Key intuition: leverage context links
– '''Texas''' is a [[pop music]] band from [[Glasgow]],
[[Scotland]], [[United Kingdom]]. They were
founded by [[Johnny McElhone]] in [[1986 in
music|1986]] and had their performing debut in
[[March]] [[1988]] at ...
• Prune the candidates, keep only:
– appearances in the first paragraph of an article,
and
– reciprocal links
70

Large-Scale Named Entity Disambiguation Based
on Wikipedia Data
[Cucerzan 2007]
• MD
– NER; rule-based; co-ref resolution
• LG
– Represent entities as vectors
• context, categories
– Same for all candidate entity links
– Determine maximally coherent set
71

Topic Indexing with Wikipedia
[Medelyan et al. 2008]
• MD
– keyphraseness [Mihalcea & Csomai 2007]
• LG
• combination of average relatedness &
commonness
• LG/DA
• Naive Bayes
• TF.IDF, position, length, degree, weighted keyphraseness
72

Learning to Link with Wikipedia
[Milne & Witten 2008b]
• Key idea: disambiguation informs detection
– compare each possible sense with its relatedness
to the context sense candidates
– start with unambiguous senses
– So, first LG, then base MD on these results
73

74

• Filter non-informative, non-ambiguous candidates
(e.g., “the”)
– based on keyphraseness, i.e., link probability
• Filter non-central candidates
– based on average relatedness to all other context senses
• Combine
75

• MD
• Machine learning
• link probability, relatedness, confidence of LG,
generality, frequency, location, spread
• LG
• Machine learning
• keyphraseness, average relatedness, sum of average
weights
76

77

Local and Global Algorithms for
Disambiguation to Wikipedia
[Ratinov et al. 2011]
• Explicit focus on global versus local
algorithms
– “Global,” i.e., disambiguation of the candidate
graph
– NP-hard
• Optimization
– reduce the search space to a “disambiguation
context,” e.g.,
• all plausible disambiguations [Cucerzan 2007]
• unambiguous surface forms [Milne & Witten 2008b] 78

Local and Global Algorithms for
Disambiguation to Wikipedia
[Ratinov et al. 2011]
• Main contribution, in steps
1.use “local” approach (e.g., commonness) to
generate a disambiguation context
2.apply “global” machine learning approach
• relatedness, PMI
• {inlinks, outlinks} in various combinations (c and c’)
• {avg, max}
• Finally, apply another round of machine
learning
79

TAGME: On-the-fly Annotation of Short
Text Fragments
[Ferragina & Scaiella 2010]
• MD
– keyphraseness [Mihalcea & Csomai 2007]
• LG
• use “local” approach to generate a
disambiguation context, similar to [Ratinov et al.
2011]
• Heavy pruning
• mentions; candidate links; coherence
• Accessible at http://tagme.di.unipi.it
80

A Graph-based Method for Entity
Linking
[Guo et al. 2011]
• MD
– rule-based; prefer longer links
– generate a disambiguation context
• LG
• (weighted interpolation of) in- and outdegree in
disambiguation context to select entity links
• edges defined by wikilinks
• Evaluation on TAC KBP
81

Recap
• Essential ingredients
– MD
• commonness
• keyphraseness
– LG
• commonness
• machine learning
– DA
• relatedness
• machine learning
82

SEARCH OVER ANNOTATED
DOCUMENTS
83

Annotated documents
Barack Obama visited Tokyo this Monday as part of an extended Asian trip.
He is expected to deliver a speech at the ASEAN conference next Tuesday
20 May 2009
28 May 2009
Barack Obama visited Tokyo this Monday as part of an extended Asian trip.
He is expected to deliver a speech at the ASEAN conference next Tuesday
84

How does it work?
Monty
Python Inverted Index
(sentence/doc level)
Forward Index
(entity level)
Flying Circus
John Cleese
Brian
86

Efficient element retrieval
• Goal
– Given an ad-hoc query, return a list of documents and
annotations ranked according to their relevance to the query
• Simple Solution
– For each document that matches the query, retrieve its
annotations and return the ones with the highest counts
• Problems
– If there are many documents in the result set this will take too
long - too many disk seeks, too much data to search through
– What if counting isn’t the best method for ranking elements?
• Solution
– Special compressed data structures designed specifically for
annotation retrieval
87

Forward Index
• Access metadata and document contents
– Length, terms, annotations
• Compressed (in memory) forward indexes
– Gamma, Delta, Nibble, Zeta codes (power laws)
• Retrieving and scoring annotations
– Sort terms by frequency
• Random access using an extra compressed
pointer list (Elias-Fano)
88

Parallel Indexes
• Standard index contains only tokens
• Parallel indices contain annotations on the tokens – the
annotation indices must be aligned with main token index
• For example: given the sentence “New York has great
pizza” where New York has been annotated as a LOCATION
– Token index has five entries
(“new”, “york”, “has”, “great”, “pizza”)
– The annotation index has five entries
(“LOC”, “LOC”, “O”,”O”,”O”)
Can optionally encode BIO format (e.g. LOC-B, LOC-I)
• To search for the New York location entity, we search for:
“token:New ^ entity:LOC token:York ^ entity:LOC”
89

Parallel Indices (II)
Doc #3: The last time Peter exercised was in the XXth century.
Doc #5: Hope claims that in 1994 she run to Peter Town.
Peter  D3:1, D5:9
Town  D5:10
Hope  D5:1
1994  D5:5
…
Possible Queries:
“Peter AND run”
“Peter AND WNS:N_DATE”
“(WSJ:CITY ^ *) AND run”
“(WSJ:PERSON ^ Hope) AND run”
WSJ:PERSON  D3:1, D5:1
WSJ:CITY  D5:9
WNS:V_DATE  D5:5
(Bracketing can also be dealt with)
90

Entity Ranking
• Given a topic, find relevant entities
• Evaluated in TREC and INEX campaigns
• Most well-known: people and expert
search
• Many other applications: dates, events,
locations, companies, ...
93

Example queries
• Impressionist art Museums in Holland
• Countries with the Euro currency
• German car manufacturers
• Artists related to Pablo Picasso
• Actors who played Hamlet
• English monarchs who married a French women
• Many examples on
http://www.ins.cwi.nl/projects/inex-xer/topics/
95

Entity Ranking
• Topical query Q
– Entity (result) type TX
– A list of Entity instance Xs
• Systems employ categories, structure, links:
– Kaptein et al., CIKM10: Exploits Wikipedia, identifies entity types, anchor text
index for entity search
– Bron et al,CIKM10 Entity co‐occurrence, Entity type filtering, Context
(relation type)
• See http://ilps.science.uva.nl/trec-entity/resources/
• Futher: http://ejmeij.github.io/entity-linking-and-retrieval-
tutorial/
96

Element Containment Graph
[Zaragoza et al. CIKM 2007]
• Given passage s in S, element e in E, there is a
directed edge from s to e if passage s contains
element e
• Element containment graph C = S x E where Cij is
strength of connection between si and ej
At query time, compute Cq as a subset of the containment
graph of the q passages matching the query
Rank entities using HITS, PageRank, passage similarity,
element similarity, etc
97

98
SEARCH OVER RDF (TRIPLES) DATA

Resource Description Framework
(RDF)
• Each resource (thing, entity) is identified by a URI
– Globally unique identifiers
– Locators of information
• Data is broken down into individual facts
– Triples of (subject, predicate, object)
• A set of triples (an RDF graph) is published together in an
RDF document
example:roi
“Roi Blanco”
type
name
foaf:Person
RDF document

Linking resources
example:roi
“Roi Blanco”
name
foaf:Person
sameAs
example:roi2
worksWith
example:peter
type
email
“pmika@yahoo-inc.com”
type
Roi’s homepage
Yahoo
Friend-of-a-Friend ontology

Publishing RDF
• Linked Data
– Data published as RDF documents linked to other RDF
documents
– Community effort to re-publish large public datasets
(e.g. Dbpedia, open government data)
• RDFa
– Data embedded inside HTML pages
– Recommended for site owners by Yahoo!, Google,
Facebook
• SPARQL endpoints
– Triple stores (RDF databases) that can be queried
through the web

The state of Linked Data
• Rapidly growing community effort to (re)publish open datasets as Linked
Data
– In particular, scientific and government datasets
– see linkeddata.org
• Less commercial interest, real usage (Haas et al. SIGIR 2011)

Glimmer: Architecture overview
Doc
1. Download, uncompress,
convert (if needed)
2. Sort quads by subject
3. Compute Minimal Perfect
Hash (MPH)
map
map
reduce
reduce
map reduce
Index
3. Each mapper reads part of the
collection
4. Each reducer builds an index
for a subset of the vocabulary
5. Optionally, we also build an
archive (forward-index)
5. The sub-indices are
merged into a single index
6. Serving
and
Ranking
https://github.com/yahoo/Glimmer
103

Horizontal index structure
• One field per position
– one for object (token), one for predicates (property), optionally one for context
• For each term, store the property on the same position in the property
index
– Positions are required even without phrase queries
• Query engine needs to support fields and the alignment operator
 Dictionary is number of unique terms + number of properties
 Occurrences is number of tokens * 2
104

Vertical index structure
• One field (index) per property
• Positions are not required
• Query engine needs to support fields
 Dictionary is number of unique terms
 Occurrences is number of tokens
✗ Number of fields is a problem for merging, query performance
• In experiments we index the N most common properties
105

Efficiency improvements
• r-vertical (reduced-vertical) index
– One field per weight vs. one field per property
– More efficient for keyword queries but loses the ability to restrict per
field
– Example: three weight levels
• Pre-computation of alignments
– Additional term-to-field index
– Used to quickly determine which fields contain a term (in any
document)
106

Run-time efficiency
• Measured average execution time (including ranking)
– Using 150k queries that lead to a click on Wikipedia
– Avg. length 2.2 tokens
– Baseline is plain text indexing with BM25
• Results
– Some cost for field-based retrieval compared to plain text indexing
– AND is always faster than OR
• Except in horizontal, where alignment time dominates
– r-vertical significantly improves execution time in OR mode
AND mode OR mode
plain text 46 ms 80 ms
horizontal 819 ms 847 ms
vertical 97 ms 780 ms
r-vertical 78 ms 152 ms
107

BM25F Ranking
BM25(F) uses a term-frequency (tf) that accounts for the decreasing
marginal contribution of terms
where
vs is the weight of the field
tfsi is the frequency of term i in field s
Bs is the document length normalization factor:
ls is the length of field s
avls is the average length of s
bs is a tunable parameter
108

BM25F ranking (II)
• Final term score is a combination of tf and idf
where
k1 is a tunable parameter
wIDF is the inverse-document frequency:
• Finally, the score of a document D is the sum of
the scores of query terms q
109

Effectiveness evaluation
• Semantic Search Challenge 2010
– Data, queries, assessments available online
• Billion Triples Challenge 2009 dataset
• 92 entity queries from web search
– Queries where the user is looking for a single entity
– Sampled randomly from Microsoft and Yahoo! query logs
• Assessed using Amazon’s Mechanical Turk
– Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST
2010
– Blanco et al. Repeatable and Reliable Search System
Evaluation using Crowd-Sourcing, SIGIR2011
110

Effectiveness results
• Individual features
– Positive, stat. significant improvement from each feature
– Even a manual classification of properties and domains helps
• Combination
– Positive stat. significant marginal improvement from each additional feature
– Total improvement of 53% over the baseline
– Different signals of relevance
112

Conclusions
• Indexing and ranking RDF data
– Novel index structures
– Ranking method based on BM25F
• Future work
– Ranking documents with metadata
• e.g. microdata/RDFa
– Exploiting more semantics
• e.g. sameAs
– Ranking triples for display
– Question-answering
113

Entity Ranking Conclusions
• Entity ranking provides a richer user
experience for search-triggered
applications
• Ranking entities can be done
effectively/efficiently using simple
frequency-based models
• Explaining things is as important as
retrieving them (Blanco and Zaragoza SIGIR 2010)
– Support sentences: several features based on
scores of sentences and entities
– The role of the context boosts performance
114

Related entity ranking in
web search

Spark: related entity
recommendations in web search
• A search assistance tool for exploration
• Recommend related entities given the user’s current query
– Cf. Entity Search at SemSearch, TREC Entity Track
• Ranking explicit relations in a Knowledge Base
– Cf. TREC Related Entity Finding in LOD (REF-LOD) task
• A previous version of the system live since 2010
• van Zwol et al.: Faceted exploration of image search results. WWW 2010: 961-
970
– The current version (described here) at
• Blanco et al: Entity recommendations in Web Search, ISWC 2013
116

Motivation
• Some users are short on time
– Need for direct answers
– Query expansion, question-answering,
information boxes, rich results…
• Other users have time at their hand
– Long term interests such as sports, celebrities,
movies and music
– Long running tasks such as travel planning
117

How does it work?
/user/torzecn/
Shared/Entity-
Relationship_Graphs
Yalinda feed parser
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfstorageall/{1,2,3}/
yalinda/phyfacet/data
/projects/gridfaces/
feed/yahooomg/
Y! OMG feed parser
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfstorageall/6/yahooomg/
phyfacet/data
feed/geo
Y! Geo feed parser
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfstorageall/10/geo/
phyfacet/data
feed/yahootv
Y! TV feed parser
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfstorageall/5
feed/editorialdata/
data/objects
Editorial feed parser
(GFFeedMapper,
GFEditorialMergeReducer)
$VIS_GRID_HOME/
gfstorageall/logicalobj/
dataArchive/
$VIS_GRID_HOME/
gfstorageall/
logicalfacet/data
$VIS_GRID_HOME/
gfstorageall/logicalobj/
data
feed/editorialdata/
data/facets
Activated
Deactivated
Empty/missing
GFDumpMain-First
(DumpFirstMapper,
DumpFirstReducer)
$VIS_GRID_HOME/
rankinputTmp
$VIS_GRID_HOME/
rankinput
STEP 1: FEED PARSING PIPELINE
GFDumpMain-Second
(DumpSecondMapper,
DumpSecondReducer)
STEP 2: PREPROCESSING PIPELINE (before feature extraction and ranking)
$VIS_GRID_HOME/
rankinput
CreateDictionary
(DictionaryMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/
dictionary
/data/SDS/data/
search_US
CreateIntermediate
LogFormat
(NullValueSillyMapp
er, DistinctReducer)
$VIS_GRID_HOME/
rankprepout/spark/
logs/week{WNo}
CreateQuerySessions
CommonModelFilter
(FiltererMapper,
FiltererReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qsessionsfilterer
CreateQuerySessions
CommonModel
(SillyMapper,
QuerySessionsCreate
SessionsReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qsessions_cm
CreateQueryTermsJoin
DictionaryAndLogs
(LeftJoinMapper,
LeftJoinReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qtermsjoindictionarylog
CreateQueryTermsJoin
QTermsAndLogs
(SillyMapper,
ImplodeReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qterms_cm
flickr/feed
CreateFlickrIntermediate
LogFormat
(FlickrReformatMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
flickrtagsintermediate
CreateFlickrTagsCommon
ModelFilter
(FiltererMapper,
FiltererReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
flickrtagsfilterer
CreateFlickrTagsCommon
ModelFilter
(SillyMapper,
ImplodeReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
flickr_cm
General
Query logs
Flickr
Twitter
/projects/rtds/twitter/
firehose
CreateTwitterIntermediate
LogFormat
(NullValueSillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/spark/
twitter_logs/week{WNo}
CreateTweetsCommon
ModelFilter
(FiltererMapper,
FiltererReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
tweetsfilterer
CreateTweetsCommon
Model
(SillyMapper,
ImplodeReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
tweets_cm
DistinctUsersTwitter
(SillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/tmp/tweets/
distinctusers
DistinctUsersflickr
(SillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/tmp/flickr/
distinctusers
DistinctUsersQSessions
(SillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qsessions/distinctusers
DistinctUsersQTerms
(SillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/tmp/qterms/
distinctusers
CountUsers
(CounterMapper,
CounterReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
distinctusers
CountEvents
(CounterMapper,
CounterReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
countevents
STEP 3a: FEATURE EXTRACTION (QUERY TERMS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
probability
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserprobability2/
qterms/queryfacet
$VIS_GRID_HOME/
rankprepout/tmp/
countusers
$VIS_GRID_HOME/
rankprepout/tmp/
qterms_cm
EventProbabilityQTerms
(ProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
countevents
ConditionalUserProbability
2_3-qt
(ConditionalUserProbability
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserproba
bility1/qterms
EventConditionalProbabilityQterms
(ConditionalProbabilityMapper,
ConditionalProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
conditionalprobability
Extracted features
Others
EventJointProbabilityQterms
(JointProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
jointprobability
EventJointUserProbabilityQTerms
(JointUserProbabilityMapper,
JointUserProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
jointuserprobability
EventConditionalUserProb
abilityQterms
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
qterms/query
EventConditionalUser
Probability3_qterms
(SillyMapper,
ConditionalUserProba
bilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/qterms/
week{WNo}/
conditionaluserprobability
EventEntropyQTerms
(EntropyMapper,
EntropyReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
entropy
EventPMI1QTerms
(SillyMapper,
JoinUnaryMetricReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
pmiqterms
EventPMIQTerms
(PMIMapper,
PMIReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
pmi
EventKLDivergenceUnary1QTerms
(KLDivergenceUnaryJoinerMapper,
$VIS_GRID_HOME/
rankprepout/tmp/
klunaryqterms
EventKLDivergenceUnaryQTerms
(SillyMapper,
KLDivergenceUnaryReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
kldivergenceunary
EventCosineSimilarityQTerms
(PMIMapper,
CosineSimilarityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
cosinesimilarity
STEP 3c: FEATURE EXTRACTION (FLICKR TAGS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
probability
$VIS_GRID_HOME/
rankprepout/tmp/
flickr/queryfacet
$VIS_GRID_HOME/
rankprepout/tmp/
countusers
$VIS_GRID_HOME/
rankprepout/tmp/
flickr_cm
EventProbabilityFlickr
(ProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
countevents
2_3-fl
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
bility1/flickr
EventConditionalProbabilityFlickr
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
Extracted features
Others
EventJointProbabilityFlickr
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
jointprobability
EventJointUserProbabilityFlickr
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
abilityFlickr
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
flickr/query
Probability3_flickr
(SillyMapper,
bilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/flickr/
week{WNo}/
EventEntropyFlickr
(EntropyMapper,
EntropyReducer)
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
entropy
EventPMI1Flickr
(SillyMapper,
$VIS_GRID_HOME/
rankprepout/tmp/
pmiflickr
EventPMIFlickr
(PMIMapper,
PMIReducer)
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
pmi
EventKLDivergenceUnary1Flickr
$VIS_GRID_HOME/
rankprepout/tmp/
klunaryflickr
EventKLDivergenceUnaryFlickr
(SillyMapper,
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
kldivergenceunary
EventCosineSimilarityFlickr
(PMIMapper,
$VIS_GRID_HOME/
week{WNo}/
cosinesimilarity
STEP 3d: FEATURE EXTRACTION (TWEETS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
probability
$VIS_GRID_HOME/
rankprepout/tmp/
tweets/queryfacet
$VIS_GRID_HOME/
rankprepout/tmp/
countusers
$VIS_GRID_HOME/
rankprepout/tmp/
tweets_cm
EventProbabilityTweets
(ProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
countevents
2_3-tw
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
bility1/tweets
EventConditionalProbabilityTweets
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
Extracted features
Others
EventJointProbabilityTweets
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
jointprobability
EventJointUserProbabilityTweets
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
abilityTweets
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
tweets/query
Probability3_tweets
(SillyMapper,
bilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/tweets/
week{WNo}/
EventEntropyTweets
(EntropyMapper,
EntropyReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
entropy
EventPMI1Tweets
(SillyMapper,
$VIS_GRID_HOME/
rankprepout/tmp/
pmitweets
EventPMITweets
(PMIMapper,
PMIReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
pmi
EventKLDivergenceUnary1Tweets
$VIS_GRID_HOME/
rankprepout/tmp/
klunarytweets
EventKLDivergenceUnary1Tweets
(SillyMapper,
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
kldivergenceunary
EventCosineSimilarityTweets
(PMIMapper,
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
cosinesimilarity
graph_sharedconnect_1
(GRSharedConnectMapper,
GRSharedConnectReducer)
(MergeFeaturesMapper,
SymmetricFeatureMergerReducer)
STEP 4a: FEATURE MERGING (QUERY TERMS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
entropy
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
probability
unaryfeaturemerger1_qterms
UnaryFeatureMergerReducer)
join_graph
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary1_qterms
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
jointprobability
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
pmi
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
kldivergenceunary
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
cosinesimilarity
unaryfeaturemerger2_qterms
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary2_qterms
symmetricfeaturemerger_qterms
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
symmetric_qterms
asymmetricfeaturemerger_qterms
AsymmetricFeatureMergerReducer)
Features
Others
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_qterms
reverseasymmetricfeaturemerger_qterms
$VIS_GRID_HOM
E/rankprepout/tmp/
statsmerge/
reverseasymetric_
qterms
STEP 4b: FEATURE MERGING (QUERY SESSIONS) PIPELINE
$VIS_GRID_HOME
/rankprepout/spark/
qsessions/
week{WNo}/entropy
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
probability
unaryfeaturemerger1_qsessions
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary1_qsessions
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
jointprobability
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/pmi
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
kldivergenceunary
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/week{WNo}/
cosinesimilarity
unaryfeaturemerger2_qsessions
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary2_qsessions
symmetricfeaturemerger_qsessions
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
symmetric_qsessions
asymmetricfeaturemerger_qsessions
Features
Others
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_qsessions
reverseasymmetricfeaturemerger_qsessions
$VIS_GRID_HOM
E/rankprepout/tmp/
statsmerge/
reverseasymetric_
qsessions
STEP 4c: FEATURE MERGING (FLICKR TAGS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
entropy
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
probability
unaryfeaturemerger1_flickr
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary1_flickr
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
jointprobability
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/pmi
$VIS_GRID_HOME/
week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
flickr/week{WNo}/
kldivergenceunary
$VIS_GRID_HOME/
week{WNo}/
cosinesimilarity
unaryfeaturemerger2_flickr
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary2_flickr
symmetricfeaturemerger_flickr
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
symmetric_flickr
asymmetricfeaturemerger_flickr
Features
Others
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_flickr
reverseasymmetricfeaturemerger_flickr
$VIS_GRID_HOM
E/rankprepout/tmp/
statsmerge/
reverseasymetric_f
lickr
STEP 4d: FEATURE MERGING (TWEETS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
entropy
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
probability
unaryfeaturemerger1_tweets
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary1_tweets
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
jointprobability
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
pmi
$VIS_GRID_HOME/
rankprepout/spark/tweets/
week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
kldivergenceunary
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
cosinesimilarity
unaryfeaturemerger2_tweets
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary2_tweets
symmetricfeaturemerger_tweets
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
symmetric_tweets
asymmetricfeaturemerger_tweets
Features
Others
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_tweets
reverseasymmetricfeaturemerger_tweets
$VIS_GRID_HOM
E/rankprepout/tmp/
statsmerge/
reverseasymetric_t
weets
STEP 5a: FEATURE EXTRACTION AND MERGING (COMBINED FEATURES) PIPELINE
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
reverseasymetric_qsessions
combinedfeaturem
erger_qsessions
(CombinedFeature
MergerMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
combined_qsessions
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
reverseasymetric_qterms
combinedfeaturem
erger_qterms
(CombinedFeature
MergerMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
combined_qterms
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
reverseasymetric_flickr
combinedfeaturem
erger_flickr
(CombinedFeature
MergerMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
combined_flickr
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
reverseasymetric_tweets
combinedfeaturem
erger_tweets
(CombinedFeature
MergerMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
combined_tweets
joinfeatures
(JoinerMapper,
JoinerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeatures3
STEP 5b: FEATURE EXTRACTION AND MERGING (GRAPH) PIPELINE
Features
Others
$VIS_GRID_HOME
/rankprepout/tmp/
statsmerge/
joinfeatures3
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/graph/
sharedconnect
graph_sharedconnect_2
(NullValueSillyMapper,
CounterReducer)
$VIS_GRID_HOME/
rankprepout/graph/
sharedconnect_2
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_g_1
graph_popularity_rank_all
(GRPopularityRankMapper,
GRPopularityRankReducer)
$VIS_GRID_HOME/
rankprepout/graph/
popularity_rank_all
graph_popularity_rank_directed
(GRPopularityRankMapper,
GRPopularityRankReducer)
$VIS_GRID_HOME/
rankprepout/graph/
popularity_rank_directed
unaryfeaturemerger_entpopmov
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_pop1
unaryfeaturemerger_entpopmov2
$VIS_GRID_HOME
/rankprepout/tmp/
statsmerge/
joinfeature_pop2
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_pop3
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_pop4
STEP 5c: FEATURE MERGING (POPULARITY) PIPELINE
Features
Others
$VIS_GRID_HOME/
web_citation/
deep_hits
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_pop4
$VIS_GRID_HOME/
web_citation/
total_hits
WebCitationTotalHits
Normalization
(SillyMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
webcitation_totalhits
WebCitationDeepHits
Normalization
(SillyMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
webcitation_deephits
asymmetricfeaturemerger_webcitation
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_webcitation
coverage
(QueryCountMapper,
QueryCountReducer)
$VIS_GRID_HOME/
rankprepout/spark/
coverage/week{WNo}
$VIS_GRID_HOME/
rankprepout/spark/
logs/week{WNo}
joincov1
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeaturescov1
joincov2
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeaturescov2
joinwikipop1
/user/barla/Spark/
wikiResultCounts
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeatureswikipop1
joinwikipop2
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeatureswikipop2
jointypes
(EntityRelationTypeMapper,
EntityRelationTypeReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/jointypes
STEP 7: DATAPACK GENERATION PIPELINE
(DumpFirstMapper,
DumpFirstReducer)
STEP 6: RANKING PIPELINE
dumper1
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/jointypes
MLR scoring
(/homes/barla/gf_mlr/
mlr_rank)
/homes/barla/gf_mlr/
gbrank.xml
/homes/barla/gf_mlr/
header.tsv
$VIS_GRID_HOME/
rankprepout/spark/
ranking
MLR scoring
(SillyMapper,
DisambiguationReducer)
$VIS_GRID_HOME/
rankprepout/spark/
ranking-disambiguated
groupmax
(SillyMapper,
OutputMaxReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
ranking
formatranking
(RankingFormatterMapper,
RankingFormatterReducer)
$VIS_GRID_HOME/
rankprepout/spark/
rankingfinal
$VIS_GRID_HOME/
rankprepout/spark/
rankingfinal
mergerank
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfrankout
$VIS_GRID_HOME/
gfrankout/1/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
$VIS_GRID_HOME/
$VIS_GRID_HOME/
$VIS_GRID_HOME/
$VIS_GRID_HOME/
gfrankout10/geo/
$VIS_GRID_HOME/
gfstorageall/1/yalinda/
$VIS_GRID_HOME/
$VIS_GRID_HOME/
$VIS_GRID_HOME/
$VIS_GRID_HOME/
$VIS_GRID_HOME/
$VIS_GRID_HOME/
phyfacet/data
$VIS_GRID_HOME/
phyfacet/data
$VIS_GRID_HOME/
phyfacet/data
$VIS_GRID_HOME/
phyfacet/data
$VIS_GRID_HOME/
phyfacet/data
$VIS_GRID_HOME/
phyfacet/data
$VIS_GRID_HOME/
gfstorageall/
logicalobj/data
$VIS_GRID_HOME/
platdumpTmp
dumper2
(DumpSecondMapper,
DumpSecondReducer)
$VIS_GRID_HOME/
platdump
datapack
(Mapper,
YISFacetMergeReducer)
$VIS_GRID_HOME/
datapack
120

High-Level Architecture View
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
121

Spark Architecture
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
122

Entity
graph
Data Preprocessing
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
123

Entity graph
• 3.4 million entities, 160 million relations
• Locations: Internet Locality, Wikipedia, Yahoo! Travel
• Athletes, teams: Yahoo! Sports
• People, characters, movies, TV shows, albums: Dbpedia
• Example entities
• Dbpedia Brad_Pitt Brad Pitt Movie_Actor
• Dbpedia Brad_Pitt Brad Pitt Movie_Producer
• Dbpedia Brad_Pitt Brad Pitt Person
• Dbpedia Brad_Pitt Brad Pitt TV_Actor
• Dbpedia Brad_Pitt_(boxer) Brad Pitt Person
• Example relations
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor
124

Entity graph challenges
• Coverage of the query volume
– New entities and entity types
– Additional inference
– International data
– Aliases, e.g. jlo, big apple, thomas cruise mapother iv
• Freshness
– People query for a movie long before it’s released
• Irrelevant entity and relation types
– E.g. voice actors who co-acted in a movie, cities in a continent
• Data quality
– Andy Lau has never acted in Iron Man 3
125

Entity
graph
Feature Extraction
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
126

Feature extraction from text
• Text sources
– Query terms
– Query sessions
– Flickr tags
– Tweets
• Common representation
Input tweet:
Brad Pitt married to Angelina Jolie in Las Vegas
Output event:
Brad Pitt + Angelina Jolie
Brad Pitt + Las Vegas
Angelina Jolie + Las Vegas
127

Features
• Unary
– Popularity features from text: probability, entropy,
wiki id popularity …
– Graph features: PageRank on the entity graph,
Wikipedia, web graph
– Type features: entity type
• Binary
– Co-occurrence features from text: conditional
probability, joint probability …
– Graph features: common neighbors …
– Type features: relation type
128

Feature extraction challenges
• Efficiency of text tagging
– Hadoop Map/Reduce
• More features are not always better
– Can lead to over-fitting without sufficient training
data
129

Entity
graph
Data
Model Learning
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
130

Model Learning
• Training data created by editors (five grades)
400 Brandi adriana lima Brad Pitt person Embarassing
1397 David H. andy garcia Brad Pitt person Mostly Related
3037 Jennifer benicio del toro Brad Pitt person Somewhat Related
4615 Sarah burn after reading Brad Pitt person Excellent
9853 Jennifer fight club movie Brad Pitt person Perfect
• Join between the editorial data and the feature file
• Trained a regression model using GBDT
– Gradient Boosted Decision Trees
• 10-fold cross validation optimizing NDCG and tuning
– number of trees
– number of nodes per tree
131

Impact of training data
Number of training instances (judged relations)
132

linear combination of the conditional probabilities gives
best performance on the collected judgements using wqt = 2,
= 0.5, and wf t = 1. However, the editorial data was not
Model Learning challenges
substantial enough to learn a ranking with GBDT.
Click-through Rate versus Click over Expected Click
From the image search query logs, we collect the user click
• Editorial preferences not necessarily coincide with usage
data that is related to the facets. This allows us to compute the
click-through rate (CTR) on a facet for a given entity that is
detected in a user query and for which the facets were shown
– Users click a lot more on people than expected
– Image bias?
• Alternative: optimize for usage data
the user. Let cl ickse,f be the number of clicks on a facet
– Clicks turned into labels or preferences
– Size of the data is not a concern
– Gains are computed from normalized CTR/COEC
entity f show in relation to entity e, and viewse,f the number
times the facet f is shown to a user for a related entity e,
then the probability of a click on a facet entity f for a given
entity e can be modelled as ctr e,f :
ctr e,f =
cl ickse,f
viewse,f
– See van Zwol et al. Ranking Entity Facets Based on User Click
Feedback. ICSC 2010: 192-199.
(2)
In Figure 3 the conditional click-through rate is shown for
first ten positions. It shows theCTR per position for every
page view where one of the facets is clicked, aggregated over
P P
p= 1 viewse,Zhang and Jones [3] refer to expected clicks, based on the denominator expected clicks given the positions C. Gradient Boosted Decision Trees
Stochastic gradient boosted decision themost widely used learning algorithms today. Gradient tree boosting constructs regres-sion
model, utilizing decision trees One advantage over other learners trees in general is that the feature are highly interpretable. GBDT different loss functions can be used research presented here we used our loss function. In related work, pairwise and ranking specific loss well at improving search relevance[shallow decision trees, trees in stochastic on a randomly selected subset of prone to over-fitting [14]. For the 196
shown in the search engine
the ground truth for creating
test set used by the gradient
Conditional Probabilities
of the facets search expe-rience,
function rank(e, f ) that is
conditional probabilities extracted
wqs⇥Pqs(f |e)+ wf t⇥Pf t (f , e)
(1)
e) arethe conditional prob-abilities
theweights for the different
qt), query session (qs) and
editorial judgements collected for
their facets we find that
conditional probabilities gives
judgements using wqt = 2,
However, the editorial data was not
ranking with GBDT.
Click over Expected Click
logs, we collect the user click
This allows us to compute the
all entities. Observe that the CTR declines when the position
at which a facet is shown increases.
We introduce a second click model, based on the notion
of clicks over expected clicks (COEC). To allows us to deal
with the so called position bias – where facets appearing in
lower positions are less likely to be clicked even if they are
relevant [2]. This phenomenon isoften observed inWeb search
and we adopt the COEC model proposed by Chapelle and
Zhang [11]. In that model, we estimate ctr p as the aggregated
ctr – over all queries and sessions – in position p for all
positions P. Let then cl ickse,f be the number of clicks on
a facet entity f show in relation to entity e, and viewse,f p the
number of times the facet f is shown to a user for a related
entity e at position p. The probability of a click over expected
click on a facet entity f for a given entity e can then be
modelled as coece,f :
coece,f =
cl ickse,f
P P
p= 1 viewse,f p ⇥ ctr p
(3)
Zhang and Jones [3] refer to this method as clicks over
expected clicks, based on the denominator that includes the
expected clicks given the positions that the url appeared in.
C. Gradient Boosted Decision Trees
133
Stochastic gradient boosted decision trees (GBDT) is one of
themost widely used learning algorithms in machine learning

Entity
graph
Ranking and Disambiguation
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
134

• We apply the ranking function offline to the data
• Disambiguation
– How many times a given wiki id was retrieved for queries containing the entity name?
Brad Pitt Brad_Pitt 21158
Brad Pitt Brad_Pitt_(boxer) 247
XXX XXX_(movie) 1775
XXX XXX_(Asia_album) 89
XXX XXX_(ZZ_Top_album) 87
XXX XXX_(Danny_Brown_album) 67
– PageRank for disambiguating locations (wiki ids are not available)
• Expansion to query patterns
– Entity name + context, e.g. brad pitt actor
135

challenges
• Disambiguation cases that are too close to call
– Fargo Fargo_(film) 3969
– Fargo Fargo,_North_Dakota 4578
• Disambiguation across Wikipedia and other
sources
136

Evaluation #2: Side-by-side
testing
• Comparing two systems
– A/B comparison, e.g. current system under development and production system
– Scale: A is better, B is better
• Separate tests for relevance and image quality
– Image quality can significantly influence user perceptions
– Images can violate safe search rules
• Classification of errors
– Results: missing important results/contains irrelevant results, too few results, entities
are not fresh, more/less diverse, should not have triggered
– Images: bad photo choice, blurry, group shots, nude/racy etc.
• Notes
– Borderline, set one entities relate to the movie Psy but the query is most likely about
Gangnam style
– Blondie and Mickey Gilley are 70’s performers and do not belong on a list of 60’s
musicians.
– There is absolutely no relation between Finland and California.
137

Evaluation #3: Bucket testing
• Also called online evaluation
– Comparing against baseline version of the system
– Baseline does not change during the test
• Small % of search traffic redirected to test system, another
small % to the baseline system
• Data collection over at least a week, looking for stat.
significant differences that are also stable over time
• Metrics in web search
– Coverage and Click-through Rate (CTR)
– Searches per browser-cookie (SPBC)
– Other key metrics should not impacted negatively, e.g.
Abandonment and retry rate, Daily Active Users (DAU), Revenue
Per Search (RPS), etc.
138

Coverage before and after the
new system
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Days
Coverage
Coverage before Spark
Trend before Spark
Coverage after Spark
Trend after Spark
Spark is deployed
in production
Before release:
Flat, lower
After release:
Flat, higher
139

Click-through rate (CTR) before
and after the new system
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Before release:
Gradually
degrading performance
due to lack of fresh data
After release:
Learning effect:
users are starting to
use the tool again
Days
CTR
CTR before Spark
Trend before Spark
CTR after Spark
Trend after Spark
Spark is deployed
in production
140

Summary
• Spark
– System for related entity recommendations
• Knowledge base
• Extraction of signals from query logs and other user-generated
content
• Machine learned ranking
• Evaluation
• Other applications
– Recommendations on topic-entity pages
141

Future work
• New query types
– Queries with multiple entities
• adele skyfall
– Question-answering on keyword queries
• brad pitt movies
• brad pitt movies 2010
• Extending coverage
– Spark now live in CA, UK, AU, NZ, TW, HK, ES
• Even fresher data
– Stream processing of query log data
• Data quality improvements
• Online ranking with post-retrieval features
142

Web Search and time
• Information freshness adds constraints/tensions in every
layer of WSEs
• Architecture
• Crawling
• Indexing
• Distribution
• Caching
• Serving system
• Modeling
• Time-dependent user intent
• UI (how to let the user take control)
144

High-level Architecture of WSEs
results
Runtime system
Parser/
Tokenizer
Index
terms
Cache
Query
Engine
queries
Indexing pipeline
WWW
145

Adding the time dimension
• Some solutions don’t scale up anymore
• Review your architecture
• Review your algorithms
• Add more machines (~$$$)
• Some solutions don’t apply anymore
• Caching
146

Evolution
• 1999
• Index updated ~once per month
• Disk-based updates/indexing
• 2001
• In-memory indexes
• Changes the whole-game!
• 2007
• Indexing time < 1 minute
• Accept updates while serving
• Now
• Focused crawling, delayed transactions, etc.
• Batch Updates -> Incremental processing
147

Some landmarks
• Reliable distributed storage
• Some models/processes require millions of accesses
• Massive parallelization
• Map/Reduce – Hadoop
• Storm/S4 – Streaming
• Hadoop 2.0 - YARN!
• Semi-structured storage systems
• Asynchronous item updates
148

What’s going on “right now”?
149

Query temporal profiles
• Modeling
• Time-dependent user intent
• Implicitly time-qualified search queries
• SIGIR
• Dream theater barcelona
• Barcelona vs Madrid
• ….
• Interested in these tasks?
• Joho et al. A survey of temporal Web experience, WWW 2013
• Keep an eye on NTCIR next year!
150

Caching for Real-Time Indexes
• Queries are redundant (heavy-tail) and bursty
• Caching search results saves up executing ~30/60% of the queries
• Tens of machines do the work of 1000s
• Dilemma: Freshness versus Computation
• Extreme #1: do not cache at all – evaluate all queries
• 100% fresh results, lots of redundant evaluations
• Extreme #2: never invalidate the cache
• A majority of stale results – results refreshed only
due to cache replacement, no redundant work
• Middle ground: invalidate periodically (TTL)
• A time-to-live parameter is applied to each cached entry
151
Blanco et al. SIGIR 2010

CACHING FOR INCREMENTAL
•Problem:
•In fast crawling, cache not always up-to-date (stale)
•Solution:
• Cache Invalidator Predictor - looks into new documents and invalidates
queries accordingly
User
Queries
Runtime system
Index pipeline
Cache
Query
Engine
Invalidator
Synopsis
Generator
Parser/ Index
Tokenizer
Crawled
Documents
152
• Using synopsis reduces the
number of refreshes up to 30%
compared to a time-to-live
baseline
INDEXES

Time Aware ER
• In some cases the time dimension is available
– News collections
– Blog postings
• News stories evolve over time
– Entities appear/disappear
– Analyze and exploit relevance evolution
• An Entity Search system can exploit the past to find
relevant entities
– De Martini et al. TAER: Time Aware Entity Retrieval. CIKM
2010,
153

Ranking Entities Over Time
Barack Obama
Naoto Kan
Barack Obama Yokohama
Takao Sato
Michiko Sato
Kamakura
Barack Obama
United States
154

Time(ly) opportunities
Can we create new user experiences based on a deeper analysis and
exploration of the time dimension?
• Goals:
– Build an application that helps users to explore,
interact and ultimately understand existing
information about the past and the future.
– Help the user cope with the information overload
and eventually find/learn about what she’s looking
for.
156

Original Idea
• R. Baeza-Yates, Searching the Future, MF/IR 2005
– On December 1st 2003, on Google News, there were more than 100K
references to 2004 and beyond.
– E.g. 2034:
• The ownership of Dolphin Square in London must revert to an insurance company.
• Voyager 2 should run out of fuel.
• Long-term care facilities may have to house 2.1 million people in the USA.
• A human base in the moon would be in operation.
157

Time Explorer
• Public demo since August 2010
• Winner of HCIR NYT Challenge
• Goal: explore news through time and into
the future
• Using a customized Web crawl from news
and blog feeds
http://fbmya01.barcelonamedia.org:8080/future/
158
Matthews et al. HCIR 2010

Time Explorer - Motivation
 Time is important to search
 Recency, particularly in news is highly related
to relevancy
 But, what about evolution over time?
 How has a topic evolved over time?
 How did the entities (people, place, etc) evolve with respect to the topic over
time?
 How will this topic continue to evolve over the future?
 How does bias and sentiment in blogs and news change over time?
 Google Trends, Yahoo! Clues, RecordedFuture
…
 Great research playground
 Open source!
160

Analysis Pipeline
 Tokenization, Sentence Splitting, Part-of-speech
tagging, chunking with OpenNLP
 Entity extraction with SuperSense tagger
 Time expressions extracted with TimeML
 Explicit dates (August 23rd, 2008)
 Relative dates (Next year, resolved with Pub Date)
 Sentiment Analysis with LivingKnowledge
 Ontology matching with Yago
 Image Analysis – sentiment and face detection
162

Indexing/Search
• Lucene/Solr search platform to index and search
– Sentence level
– Document level
• Facets for annotations (multiple fields for faster
entity-type access)
• Index publication date and content date –extracted
dates if they exists or publication date
• Solr Faceting allows aggregation over query entity
ranking and for aggregating counts over time
• Content date enables search into the future
163

Timeline – Entity Trend
168

Other challenges
– Large scale processing
• Distributed computing
• Shift from batch (Hadoop) to online (Storm)
– Efficient extraction/retrieval, algorithmic/data
structures
• Critical for interactive exploration
– Connection with the user experience
• Measures! User engagement?
– Personalization
– Integration with Knowledge Bases (Semantic Web)
– Multilingual
172

Mining Web content for Enhanced Search

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Mining Web content for Enhanced Search

Similaire à Mining Web content for Enhanced Search (20)

Plus de Roi Blanco

Plus de Roi Blanco (7)

Dernier

Dernier (20)

Mining Web content for Enhanced Search