2. Overview
• Introduction
• Motivation
• System architecture
• Classification of relevant comments
• Ranking of relevant comments
• Results
• Conclusions
22.09.13 Sesiunea de Licenţe - Iulie 2012 2
3. Introduction
• Text classification and raking for comments on
YouTube videos
– First: classification whether the comment is relevant or
not for the given video file
– Second: ranking the relevant comments
• Focus on identifying relevant information
• Comments have a very small number of words –
sometimes less than 10, on average of the order of
tens
• Relevance is evaluated with respect to the information
collected from other online sources about the video
22.09.13 CSCS 2013 – Bucharest, Romania 3
4. Existing research
• We have not been able to identify any previous
research in the direction of identifying relevant
comments
• YouTube research
– Identify the relevant features of community acceptance
(comments with many “likes”)
– Extract the sentiment orientation
– Differentiate between clean and noisy comments
• Other research
– Ranking Comments on the Social Web (uses Digg)
22.09.13 CSCS 2013 – Bucharest, Romania 4
6. Motivation
• Most commented video
• “10 questions that every intelligent Christian
must answer”
• 1,429,425 comments on 30th
May 2013 (early
morning)
• How many of these comments are spam?
• Which ones would be most relevant to the
video?
22.09.13 CSCS 2013 – Bucharest, Romania 6
7. Solution
• Ranking of the comments according to relevance
• Steps:
1. Automatically link video with other online sources
relevant to it
2. Filter comments to remove noisy comments
3. Rank the remaining comments according to
relevance computed using NLP techniques
• Our solution works for music videos
22.09.13 CSCS 2013 – Bucharest, Romania 7
10. Preprocessing
• Comments retrieved with YouTube Data API
– Only used last 100 comments per video
• Filter comments not written in English using
JLangDetect
• Extracted the main topics for each comment
using Mallet => 5 topics per comment
• Expanding the topics with synonyms and
hypernyms from WordNet
22.09.13 CSCS 2013 – Bucharest, Romania 10
11. Pre-classification of comments
• Objective: to reduce the number of comments considered for
ranking by identifying noise
• Classification based on a neural network by using a set of
simple linguistic features
• Multilayered Perceptron implemented in Weka
• Features
– Number of non-ASCII characters
– Number of capital letters
– Number of newlines
– Number of digits
– Number of trivial and swear-words
– Number of words in comment
– Average word size
– Number of punctuation marks
– Common text spam count
22.09.13 CSCS 2013 – Bucharest, Romania 11
12. Pre-classification of comments
• Trained on a small corpus with 100 relevant
comments and 100 noisy comments
• Examples of noisy comments:
– "Step 1: Pause this video
Step 2: Google 'Rainymood'
Step 3: Click the first link
Step 4: Unpause this video
Step 5: Thumbs? up this comment, enjoy and thank
me later"
– "Those 3,175 haters listen to? 'Techno'. “
– " IF YOU LIKE DIRTY DIANA SONG THE SINGER '' STEFANO GIORGINI '' DID A GREAT? REMAKE
STEFANO IS A VERY GOOD SINGER SONGWRITER I THINK YOU WILL LIKE HIS VERSION JUST LOOK FOR
'' STEFANO GIORGINI '' DIRTY DIANA" "
22.09.13 CSCS 2013 – Bucharest, Romania 12
13. Pre-classification of comments
• Results of pre-classification stage
22.09.13 CSCS 2013 – Bucharest, Romania 13
Type of Instances
No.
Instances
%
Correctly Classified Instances 174 87.46
Incorrectly Classified Instances 26 12.54
Total Number of Instances 200 -
14. Relevance scoring stage
• Initial approach
• Extract topics from comments as previously
mentioned (Mallet + WordNet)
• Fetch Wikipedia articles for artist and song name
• Score computed based in number of appearances
of the topics from the comments in the articles
22.09.13 CSCS 2013 – Bucharest, Romania 14
15. Relevance scoring stage
• Second approach: topic-based scoring
• Similar to the previous one, but topics are also
extracted from the Wikipedia articles with
Mallet
• Scoring is done based on:
– Number of topics extracted from each comment
– Wikipedia topic matches for each comment
22.09.13 CSCS 2013 – Bucharest, Romania 15
16. Relevance scoring stage
• Third approach
• Multiple-source topic-based scoring
• Additional source added to the Wikipedia articles
– Information from allmusic.com website on artists and
songs
– Information from song lyrics
• Topics matched between comments and Wikipedia +
Allmusic articles, plus exact match of lyrics
• Final relevance score is a weighted sum of the previous
factors
22.09.13 CSCS 2013 – Bucharest, Romania 16
17. Results
22.09.13 CSCS 2013 – Bucharest, Romania 17
Comment Relevance
maybe your friend should know that being english, have a picture in abbey road
and "sing" all you need is love" won't make one direction? a group like the
beatles...
662
my mom said she doesn't like the beatles and she said that john was only good
to look at? not to hear. my dad said, " haha so true!." i'm an orphan now.
968
you shouldn't be listening to the beatles since these seem to turn your friends
into enemies! beatles are all about peace!? you are not getting their message!
983
please read this ! hey i know u just wanna listen to the song but i still have to
write this hoping someone will see it and that someone will care .i'm a? young
musician from croatia so this spam is my only chance to get noticed.please check
out my channel and i promise u won't be sorry.i appreciate your time because
music means everything to me, thank you! ?
1309
i didn't mean fight other places. i meant focus on the hurt people in your own
country first, then expand to the others. if people don't agree with peace that's
an opinion. not a fact, and people often take offense to opinions. there isn't?
anything to take offense to, they say something that's all it is. they said it, don't
put meaning to it. world peace - i meant the whole world having peace there
1639
18. Results
• Difficult to assess whether the impact of the
relevance measure
• Interpreting the comments is subjective
– Need human annotators
• The order of the comments is completely
different from the one presented now on
YouTube (correlation lower than 0.031 for the
first 100 comments)
• Method 1 is also not correlated with the other
two methods
• Methods 2 and 3 have a higher correlation: 0.124
22.09.13 CSCS 2013 – Bucharest, Romania 18
19. Conclusions
• 2-stage method for ranking comments on YouTube
• The first stage removes noisy comments
• The second stage tries to link the comments with
information from other web pages relevant for the
video
• Relevance is computed based on topic-modeling with
Mallet
•
• Results are encouraging, but need to find a more
rigorous method of assessing them
• Results are better than the usual results provided by
YouTube, however the processing time for each video
should not be neglected
22.09.13 CSCS 2013 – Bucharest, Romania 19