State-of-the-art content sharing platforms often require users to assign tags to pieces of media in order to make them easily retrievable. Since this task is sometimes perceived as tedious or boring, annotations can be sparse. Commenting on the other hand is a frequently used means of expressing user opinion towards shared media items. We propose the use of time series analyses in order to infer potential tags and indexing terms for audio-visual content from user comments. In this way, we mitigate the vocabulary gap between queries and document descriptors. Additionally, we show how large-scale encyclopedias such as Wikipedia can aid the task of tag prediction by serving as surrogates for high-coverage natural language vocabulary lists. Our evaluation is conducted on a corpus of several million real-world user comments from the popular video sharing platform YouTube, and demonstrates significant improvements in retrieval performance.
This work together with Wen Li and Arjen P. de Vries has been accepted for full oral presentation at the 35th European Conference on Information Retrieval (ECIR) in Moscow, Russia. The full version of the article is available at: http://link.springer.com/chapter/10.1007/978-3-642-36973-5_4
APM Welcome, APM North West Network Conference, Synergies Across Sectors
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECIR'13)
1. Exploiting User Comments for Audio-visual
Content Indexing and Retrieval
Carsten Eickhoff, Wen Li and Arjen P. de Vries
March 25, 2013
Delft
University of
Technology
Challenge the future
2. Overview
• Introduction and statistics
• Harnessing user comments for content indexing
• Dealing with noise
• Retrieval experiments
User Comments for Content Indexing and Retrieval 2
3. Example
User Comments for Content Indexing and Retrieval 3
4. Content Annotation
• Audio-visual content retrieval relies on textual meta data
• Author-provided titles and descriptions are often not enough
• Collaborative tagging can provide more information
User Comments for Content Indexing and Retrieval 4
5. Available Annotation Sources
• Tagging content is a tedious task
• To make it more interesting, tagging is sometimes integrated in
games and reputation schemes
• Still, 58% of a 10,000-video sample from YouTube are annotated
with less than 140 characters of text each
• At the same time, comment threads are massive…
User Comments for Content Indexing and Retrieval 5
6. Automatic term extraction
You will get kissed on the nearest
possible Friday by the love of your omg i luv
that stuff
life.Tomorrow will be the best day
of your life.However,if you don't
post this comment to at least 3
videos,you will die within 2
days.Now uv started reading dis
dunt stop…
lol luv it luv
Cute
snoopy
User Comments for Content Indexing and Retrieval 6
7. Types of Noise
1. Uninformative comments
omg i luv
that stuff
User Comments for Content Indexing and Retrieval 7
8. Types of Noise
1. Uninformative comments You will get kissed on the nearest
possible Friday by the love of your
life.Tomorrow will be the best day
2. Unrelated comments (incl. spam) of your life.However,if you don't
post this comment to at least 3
videos,you will die within 2
days.Now uv started reading dis
dunt stop…
User Comments for Content Indexing and Retrieval 8
9. Types of Noise
1. Uninformative comments
OMG YEAH
2. Unrelated comments (incl. spam) LOL1!1!!! i luv
that part u like
3. Misspellings and chat speak robot chicken?
User Comments for Content Indexing and Retrieval 9
10. Types of Noise
1. Uninformative comments
2. Unrelated comments (incl. spam) Snoopy est
si mignon!!
3. Misspellings and chat speak
4. Foreign language utterances
User Comments for Content Indexing and Retrieval 10
11. LM-based Keyword extraction
• Find those terms that have a locally higher likelihood of
occurrence than globally in the collection
• Similar notion as tf/idf but within the LM framework
User Comments for Content Indexing and Retrieval 11
12. Bursts
• Peaks in commenting activity may contain interesting information
User Comments for Content Indexing and Retrieval 12
13. Bursts
• Peaks in commenting activity may contain interesting information
[External]:
Actor wins
an award
User Comments for Content Indexing and Retrieval 13
14. Bursts
• Peaks in commenting activity may contain interesting information
[Internal]:
Controversial
comment
User Comments for Content Indexing and Retrieval 14
15. Generalized Burst Detection
• Kleinberg [1] measured bursts per term
• We need a more general representation of activity peaks
[1] John Kleinberg. Bursty and Hierarchical Structure in Streams, 2003
User Comments for Content Indexing and Retrieval 15
16. Burst and Cause
• Capturing bursts seems to help
• But we also need its cause
• A mixture of language models
accounts for burst and pre-
burst term likelihoods
User Comments for Content Indexing and Retrieval 16
17. Vocabulary Regularization
• Currently: Discriminative terms are good
• As a result: Misspellings and non-English terms are recommended
• Wikipedia can help identify such cases:
Snoopy
User Comments for Content Indexing and Retrieval 17
18. Vocabulary Regularization
• Currently: Discriminative terms are good
• As a result: Misspellings and non-English terms are recommended
• Wikipedia can help identify such cases:
Yeah!!1% Wait, that’s
not a word…
User Comments for Content Indexing and Retrieval 18
19. Data Set
• 10,000 YouTube videos crawled in 2009/10
• 20 seed queries, following “related videos” link
• 4.7 M user comments
• On average 360 comments per video (σ = 984)
User Comments for Content Indexing and Retrieval 19
20. Retrieval experiments
• TREC-style retrieval experiment
• 40 manually constructed topics
• Pooled top 10 results evaluated via crowdsourcing
• BM25F models with fields per source (title, description, etc.)
User Comments for Content Indexing and Retrieval 20
26. Experiments under Sparsity
• 58% of all video descriptions are shorter than 140 characters
• 50% of all titles are shorter than 35 characters
• We limit our corpus to videos with short titles and/or descriptors
• This affects 77% of all videos in our sample…
User Comments for Content Indexing and Retrieval 26
30. Conclusion
• User comments can enhance content annotation if we deal with
the domain-inherent noise appropriately
• Modeling commenting activity bursts, we can find informative
on-topic comments
• Through the use of Wikipedia, misspellings and foreign language
utterances can be reliably identified
User Comments for Content Indexing and Retrieval 30
31. Future Directions
• Additional regularization resources (e.g., Delicious, WordNet)
• New domains (e.g., social media streams linked to TV)
• Content-aware term extraction
• Cold start problem
• Cross-language ability
User Comments for Content Indexing and Retrieval 31
32. Thank You!
User Comments for Content Indexing and Retrieval 32