1. Penguins in Sweaters, or Serendipitous
Entity Search on User-generated-Content
chenwq
2014/04/16
Mounia Lalmas et al.
(Yahoo! Labs, CIKM 2013 Best Paper )
4. Why/when do penguins wear sweaters?
Entity Search
Building an entity-driven serendipitous search system based on
enriched entity networks extracted from Wikipedia and Yahoo!
Answers
Serendipity
Finding something good or useful while not specifically
looking for it
Serendipitous search systems provide relevant and
interesting results
2/23
5. What is entity search
How people become entitiesHow people become entities
3/23
6. What is entity search
Entities Extraction
Proximity Measure
between two entities
Entities Ranking
according to their
proximity to a query
entity
4/23
7. What is Serendipity
“making fortunate discoveries by accident”
M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: evaluating recommender systems
by coverage and serendipity. IRecSys 2010.
Serendipity = unexpectedness + relevance
“Expected” result baselines from web search
Serendipity = interestingness + relevance
Result interestingness given the query
Personal interest in result
P. Andre, J. Teevan, and S. T. Dumais. From x-rays to silly putty via uranus: Serendipity and its role in
web search. SIGCHI 2009.
5/23
9. What connections between entities
do web community knowledge
portals offer?
WHAT
WHY
How do they contribute to an
interesting, serendipitous browsing
experience?
Why/when do penguins wear sweaters?
6/23
10. Why/when do penguins wear sweaters?
community-driven question & answer
portal
•67M questions & 262M answers
•2 years [2010/2011]
•English-language
community-driven encyclopedia
•3 795 865 articles
•from end of December 2011
•English Wikipedia
minimally curated
opinions, gossip, personal info
variety of points of view
minimally curated
opinions, gossip, personal info
variety of points of view
curated
high-quality knowledge
variety of niche topics
curated
high-quality knowledge
variety of niche topics
7/23
12. Entity & Relationship Extraction
Entity defined as any concept having a Wikipedia page
1. Identify surface forms[http]
,
2. resolve to Wikipedia entities[Zhou]
,
3. rank entities using aboutness score[Paranjpe]
;
https://www.otexts.org/node/832
Zhou Y, Nie L, Rouhani-Kalleh O, et al. Resolving surface forms to wikipedia topics[C]//Proceedings of the 23rd
International Conference on Computational Linguistics. Association for Computational Linguistics, 2010: 1335-1343.
D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. CIKM 2009.
Relationship: Cosine similarity of tf/idf vectors
(concatenation of documents where entity appears)
9/23
15. Retrieval
Algorithm: Lazy Random walk with restart[Chung]
[1] Chung F R K. Spectral graph theory[M]. American Mathematical Soc., 1997.
12/23
16. Rank Aggregation
For a given query, combine the results from
different search engines
Simple median-rank aggregation[Sculley]
A B C D E
C D E A B
C A D B E
Sculley D. Rank Aggregation for Similar Items[C]//SDM. 2007.
13/23
18. Retrieval
Wikipedia Yahoo! Answers Combined
Precision @ 5 0.668 0.724 0.744
MAP 0.716 0.762 0.782
3 label per query-result pair
Yahoo! Answers
Jon Rubinstein
Timothy Cook
Kane Kramer
Steve Wozniak
Jerry York
Wikipedia
System 7
PowerPC G4
SuperDrive
Power Macintosh
Power Computing Corp.
Steve Jobs
Annotator agreement
(overlap): 85%
Average overlap in top 5
results: 12%
15/23
19. What connections between entities
do web community knowledge
portals offer?
WHAT
WHY
How do they contribute to an
interesting, serendipitous browsing
experience?
Why/when do penguins wear sweaters?
16/23
21. Entity Networks with Metadata
Table 5: Serendipitous across different runs
| relevant & unexpected | / | unexpected |
number of serendipitous results out of all of
the unexpected results retrieved
| relevant & unexpected | / | retrieved |
serendipitous out of all retrieved
18/23
22. User-perceived Quality
1. Which result is more relevant to the query?
2. If someone is interested in the query, would they also be interested in these
results?
3. Even if you are not interested in the query, are these results interesting to you
personally?
4. Would you learn anything new about the query?
19/23
23. Entity Networks with Metadata
Data General +Topic
Which result is more WP 0.162 0.194
relevant to the query? YA 0.336 0.374
Comb 0.201 0.222
If someone is interested in WP 0.162 0.176
the query, would they also YA 0.312 0.343
be interested in the result? Comb 0.184 0.222
Even if you are not interested WP 0.139 0.144
in the query, is the result YA 0.324 0.359
interesting to you personally? Comb 0.168 0.198
Would you learn anything WP 0.167 0.164
new about the query from YA 0.307 0.346
this result? Comb 0.184 0.203
Topical
category
constraint
promote results
of same topic
as query entity
Sentiment and
Readability
constraints
hurt performance
Table 6: Similarity (Kendall’s tau-b[Fagin]
) between result sets and reference ranking
Fagin R, Kumar R, Mahdian M, et al. Comparing and aggregating rankings with ties[C]//Proceedings of the twenty-third
ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2004: 47-58.
22/23
Those two datasets are user-generated content.
Represents the content of each data source as an entity network
The challenge including:
Extracting entities from different datasets
Building a meaningfull similarity measure
2. This approach employs a resolution model based on a rich set of both content-sensitive and content-independent features, derived from Wikipedia and various other data sources including web behavioral data.
Taking the (order-insensitive) concatenation of all the documents in C where e appears
Extract lexicon by tokenizing every document, removing stop words and applying Porter’s stemming algorithm on the obtained tokens
The two graphs are almost fully connected.
The largest connected component spans 92.5% of the nodes in YA, and 95.78% in WP.
This is due to the presence of popular entities that appear ubiquitously in the two datasets.
These entities represent very common concepts, which are not particular to the subject of a document.
These entities will be removed from the entities networks as they are not likely to be relevant to the input entity.
Reduce the candicate entities space by restricting to the pairs of entities that co-occur in at least one document.
eta = 0.9
alpha = 0.15, get worsening results. So random walks with no jump.
Stop criterion:
The F-norm of the difference between two successive iterations is < 10E-6
Reach the maximum of 30 iterations
Lazy random walks algorithm with restart achieves 67% on WP and 72% on YA
The combination of WP and YA achieves high accuracy 74% and the Mean Average Precision with 78.2%.
Top: For each query, retrieve the 5 entities that occur most frequently in the top 5 search results provided by two major commercial search engines
Top Nwq: Similar to previous case, but excluding the Wikipedia page of the input entity (if present) from the set of results returned by the search engines. The performance improved and implied that entities from WP’s entity networks contributed to the serendipitous of search results.
Rel: Return the top 5 entities in the related-query suggestions provide
Rel + Top: Return the union of the sets of entity recommendations provided by Top and Rel
Value in parentheses is always almost as high as the corresponding serendipity value, confirms that, the methods proposals by this paper indeed retrieving a considerable fraction of results that are both unexpected and relevant