Slides accompanying the VLDB 2010 Journal paper -
Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social Intelligence in a Real-Time Dashboard System to appear in a special issue of the VLDB Journal on "Data Management and Mining for Social Networks and Social Media"
1. http://www.almaden.ibm.com/cs/projects/iis/sound/
BBC SoundIndex
Pulse of the Online Music Populace
Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social
Intelligence in a Real-Time Dashboard System to appear in a special issue of the VLDB Journal on
"Data Management and Mining for Social Networks and Social Media", 2010
2. ! Netizens do not always
The Vision buy their music, let alone
buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
! Traditional sales figures
are a poor indicator of
What is ‘really’ hot? music popularity.
3. ! Netizens do not always
The Vision buy their music, let alone
buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
! Traditional sales figures
are a poor indicator of
What is ‘really’ hot? music popularity.
BBC: Are online music communities
good proxies for popular music listings?!
4. ! Netizens do not always
The Vision buy their music, let alone
buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
! Traditional sales figures
are a poor indicator of
What is ‘really’ hot? music popularity.
BBC: Are online music communities
good proxies for popular music listings?!
IBM: Well, lets build and find out!
5. ! Netizens do not always
The Vision buy their music, let alone
buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
! Traditional sales figures
are a poor indicator of
What is ‘really’ hot? music popularity.
BBC: Are online music communities
good proxies for popular music listings?!
IBM: Well, lets build and find out!
BBC SoundIndex - “A pioneering project to tap into
the online buzz surrounding artists and songs, by
leveraging several popular online sources”
6. ! Netizens do not always
The Vision buy their music, let alone
buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
! Traditional sales figures
are a poor indicator of
What is ‘really’ hot? music popularity.
BBC: Are online music communities
good proxies for popular music listings?!
IBM: Well, lets build and find out!
BBC SoundIndex - “A pioneering project to tap into
the online buzz surrounding artists and songs, by
leveraging several popular online sources”
7. ! Netizens do not always
The Vision buy their music, let alone
buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
! Traditional sales figures
are a poor indicator of
What is ‘really’ hot? music popularity.
BBC: Are online music communities
good proxies for popular music listings?!
IBM: Well, lets build and find out!
BBC SoundIndex - “A pioneering project to tap into
the online buzz surrounding artists and songs, by
leveraging several popular online sources”
“one chart for everyone” is so old!
8. “Multimodal Social Intelligence in a Real-Time
Dashboard System”, VLDB Journal 2010 Special Issue:
Data Management and Mining for Social Networks and
Social Media.
User metadata, unstructured,
Artist/Track structured attention
Metadata metadata
right data source, right crowd, timeliness of data..?
9. “Multimodal Social Intelligence in a Real-Time
Dashboard System”, VLDB Journal 2010 Special Issue:
Data Management and Mining for Social Networks and
Social Media.
Album/Track identification
Sentiment Identification
Spam and off-topic comments
UIMA Analytics Environment
right data source, right crowd, timeliness of data..?
10. “Multimodal Social Intelligence in a Real-Time
Dashboard System”, VLDB Journal 2010 Special Issue:
Data Management and Mining for Social Networks and
Social Media.
Exracted concepts into
explorable datastructures
right data source, right crowd, timeliness of data..?
11. “Multimodal Social Intelligence in a Real-Time
Dashboard System”, VLDB Journal 2010 Special Issue:
Data Management and Mining for Social Networks and
Social Media.
What are 18 year olds in London
listening to?
right data source, right crowd, timeliness of data..?
12. “Multimodal Social Intelligence in a Real-Time
Dashboard System”, VLDB Journal 2010 Special Issue:
Data Management and Mining for Social Networks and
Social Media.
What are 18 year olds in London
listening to?
Validating crowd-sourced preferences
right data source, right crowd, timeliness of data..?
13. Imagine doing this for a local business!
Pulse of the ‘foodie’ populace!
Where are 20 something year olds going? Why?
14. 4
SoundIndex Architecture
Fig. 2 SoundIndex architecture. Data sources are ingested and if necessary transformed into structured data using the MusicBrainz
RDF and data miners. The resulting structured data is stored in a database and periodically extracted to update the front end.
Is the radio still the most popular medium for music? In of “plays” such as YouTube videos and LastFM tracks,
2007 less than half of all teenagers purchased a CD6 and and purely structured data such as the number of sales
with the advent of portable MP3 players, fewer people from iTunes. We then clean this data, using spam re-
are listening to the radio. moval, off-topic detection, and performed analytics to
With the rise of new ways in which communities spot songs, artists and sentiment expressions. The cleaned
and annotated data is combined, adjudicating for dif-
16. UIMA Annotators
MySpace user comments (~ Twitter)
Named Entity Recognition
Sentiment Recognition
Spam Elimination
I heart your new song Umbrella..
madge..ur pics on celebration concert with jesus r awesome!
Challenges, intuitions, findings, results..
17. Recognizing Named Entities
Cultural entities, Informal text, Context-
poor utterances, restricted to the music
sense..
“Ohh these sour times... rock!”
18. Recognizing Named Entities
Cultural entities, Informal text, Context-
poor utterances, restricted to the music
sense..
Problem Defn: Semantic Annotation of
album/track names (using
MusicBrainz) [at ISWC ‘09]
“Ohh these sour times... rock!”
19. Got to Be There
Ben
NER Music and Me
Forever Michael
Off the Wall
Thriller
Bad
Dangerous
20. Got to Be There
Ben
NER Music and Me
Forever Michael
Off the Wall
Thriller
Spot and Disambiguate Paradigm
Bad
Generate Candidate entities Dangerous
“Thriller was my most fav MJ album”
“this is bad news, ill miss you MJ”
Disambiguate spots/mentions in
context
21. Got to Be There
Ben
NER Music and Me
Forever Michael
Off the Wall
Thriller
Spot and Disambiguate Paradigm
Bad
Generate Candidate entities Dangerous
“Thriller was my most fav MJ album”
“this is bad news, ill miss you MJ”
Disambiguate spots/mentions in
context
Disambiguation Intensive Approach
22. Challenge 1: Multiple Senses, Same
Domain
60 songs with Merry
Christmas
3600 songs with Yesterday
195 releases of American Pie
31 artists covering
American Pie
“Happy 25th! Loved
your song Smile ..”
23. Challenge 1: Multiple Senses, Same
Domain
60 songs with Merry
Christmas
3600 songs with Yesterday
195 releases of American Pie
31 artists covering
American Pie
“Happy 25th! Loved
your song Smile ..”
24. Intuition: Scoped Graphs
This new Merry Christmas
tune.. SO GOOD!
Scoped Relationship graphs
using cues from the content,
webpage title, url..
Which ‘Merry Christmas’?
9
‘So Good’ is also a song!
25. Intuition: Scoped Graphs
This new Merry Christmas
tune.. SO GOOD!
Scoped Relationship graphs
using cues from the content,
webpage title, url..
Reduce potential entity spot size
Generate candidate entities
Disambiguate in context
Which ‘Merry Christmas’?
9
‘So Good’ is also a song!
26. What Content Cues to Exploit?
“I heart your new album Habits”
“Happy 25th lilly, love ur song smile”
”Congrats on your rising star award”..Experimenting with
Restrictions
Career Length
Releases that are not new; Album Release
artists who are at least 25;
No. of Albums
new careers..
...
Specific Artist
27. Gold Truth Dataset
1800+ spots in MySpace user comments from
artist pages
Keep your SMILE on!
good spot, bad spot, inconclusive spot?
4-way annotator agreements across spots
Madonna 90% agreement
Rihanna 84% agreement
Lily Allen 53% agreement
28. Sample Restrictions, Spot Precision
3 artists, 1800+ spots
Experimenting with
Restrictions
Career Length
Album Release
No. of Albums
...
Specific Artist
29. Sample Restrictions, Spot Precision
From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots
to tracks of one artist
!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ Experimenting with
%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**
#""$ Restrictions
!"#$%&%'()'*)+,#)-.'++#"
*****789!9$*,/):0+0-%; #"$ Career Length
)A&:.23*8*&2>?@+
&.*2)&+.*8*&2>?@+
Album Release
#$
No. of Albums
!#$
&/.0+.+*<5-+)*=0/+.*&2>?@*<&+* ...
0%*.5)*,&+.*8*3)&/+
!"#$ Specific Artist
*&22*&/.0+.+*<5-*/)2)&+)1*&%*
&2>?@*0%*.5)*,&+.*8*3)&/+ !""#$
*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
!"""#$
30. Sample Restrictions, Spot Precision
Closely follows distribution of random restrictions, conforming loosely to
a Zipf distribution
From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots
to tracks of one artist
!"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"#
!"!!!#$ !"!!#$ !"!#$ !"#$ #$ #!$ #!!$ Experimenting with
#!!$
Restrictions
#!$ Career Length
!#"$404(%'()'&*"'53(&&"#
#$
Album Release
No. of Albums
!"#$
...
%&'()*+(*,-'.,-.(/&
012,-,34.51'&6.,- !"!#$ Specific Artist
7)(*'(.8/1)1+(&)*'(*+'
19&)1:&.&2;&+(&6 !"!!#$
*3;),9&3&-(..<=*;>
6*'()*5?(*,-@
!"!!!#$
31. Sample Restrictions, Spot Precision
Closely follows distribution of random restrictions, conforming loosely to
a Zipf distribution
From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots
to tracks of one artist
!"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"#
!"!!!#$ !"!!#$ !"!#$ !"#$ #$ #!$ #!!$ Experimenting with
#!!$
Restrictions
#!$ Career Length
!#"$404(%'()'&*"'53(&&"#
#$
Album Release
No. of Albums
!"#$
...
%&'()*+(*,-'.,-.(/&
012,-,34.51'&6.,- !"!#$ Specific Artist
7)(*'(.8/1)1+(&)*'(*+'
19&)1:&.&2;&+(&6 !"!!#$
*3;),9&3&-(..<=*;>
6*'()*5?(*,-@
!"!!!#$
Choosing which constraints to implement is simple - pick easiest first
32. Madonna’s
Scoped Entity Lists tracks
User comments are on MySpace artist pages
Restriction: Artist name
Assumption: no other artist/work mention
33. Madonna’s
Scoped Entity Lists tracks
User comments are on MySpace artist pages
Restriction: Artist name
Assumption: no other artist/work mention
Naive spotter has advantage of spotting all
possible mentions (modulo spelling errors)
Generates several false positives
“this is bad news, ill miss you MJ”
34. Challenge 2: Disambiguating Spots
Got your new album Smile. Loved it!
Keep your SMILE on!
Separating Valid and invalid mentions
of music named entities.
35. Challenge 2: Disambiguating Spots
Got your new album Smile. Loved it!
Keep your SMILE on!
Separating Valid and invalid mentions
of music named entities.
36. Intuition: Using Natural Language Cues
Got your new album Smile. Loved it!
Syntactic features Notation-S
+
POS tag of s s.POS
POS tag of one token before s s.POSb
POS tag of one token after s s.POSa
Typed dependency between s and sentiment word s.POS-TDsent ∗
Typed dependency between s and domain-specific term s.POS-TDdom ∗
Boolean Typed dependency between s and sentiment s.B-TDsent ∗
Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗
Word-level features Notation-W
+
Capitalization of spot s s.allCaps
+
Capitalization of first letter of s s.firstCaps
+
s in Quotes s.inQuotes
Domain-specific features Notation-D
Sentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment s.Csent
Domain-related term in the same sentence as s s.Sdom
Domain-related term elsewhere in the comment s.Cdom
+
Refers to basic features, others are advanced features
∗
These features apply only to one-word-long spots.
Generic syntactic, by the SVM learner
Table 6. Features used spot-level, domain features
15
37. Intuition: Using Natural Language Cues
Got your new album Smile. Loved it!
Syntactic features 1. Notation-S Expressions:
Sentiment
+
POS tag of s Slang sentiment gazetteer
s.POS
POS tag of one token before s s.POSb
POS tag of one token after s usinga Urban Dictionary
s.POS
Typed dependency between s and sentiment word s.POS-TDsent ∗
Typed dependency between s and domain-specific term 2. s.POS-TDdom ∗
Domain specific terms
Boolean Typed dependency between s and sentiment music,sent ∗
s.B-TD album, concert..
Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗
Word-level features Notation-W
+
Capitalization of spot s s.allCaps
+
Capitalization of first letter of s s.firstCaps
+
s in Quotes s.inQuotes
Domain-specific features Notation-D
Sentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment s.Csent
Domain-related term in the same sentence as s s.Sdom
Domain-related term elsewhere in the comment s.Cdom
+
Refers to basic features, others are advanced features
∗
These features apply only to one-word-long spots.
Generic syntactic, by the SVM learner
Table 6. Features used spot-level, domain features
15
38. Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗
Word-level features Notation-W
+
Capitalization of spot s s.allCaps
+
Capitalization of first letter of s s.firstCaps
Intuition: Using Natural Language Cues
+
s in Quotes s.inQuotes
Domain-specific features Notation-D
Got yourexpression in the same sentenceit! s
Sentiment new album Smile. Loved as s.Ssent
1. Sentiment Expressions:
Syntactic features Sentiment expression elsewhere in the comment Notation-S s.Csent
Domain-related term in the same sentence as s s.POS s.Sdom
+
POS tag of s
Domain-related term elsewhere in the comment Slang sentimentdom
s.C gazetteer
POS tag of one token+before s s.POSb
Refers to basic features, others are advanced features Urban Dictionary
usinga
POS tag of one token∗ after s s.POS
These features apply only to one-word-long spots.
Typed dependency between s and sentiment word s.POS-TDsent ∗
2. s.POS-TDdom
Domain specific terms
Typed dependency between s and domain-specificFeatures used by the SVM learner∗
Table 6. term
Boolean Typed dependency between s and sentiment music,sent ∗
s.B-TD album, concert..
Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗
Word-level features Typed Dependencies: Valid Notation-W new album Smile.
spot: Got your
+
Capitalization of spot sWe also captured the typed de- Simplys.allCaps
loved it!
+
Capitalization of first letter of paths (grammatical rela-
pendency s Encoding: nsubj(loved-8, Smile-5) imply-
s.firstCaps
+
s in Quotes tions) via the s.POS- ing that Smile is the nominal subject of
s.inQuotes
the expression loved.
Domain-specific features and s.POS-TDdom boolean
TDsent Notation-D
Sentiment expressionfeatures. These were as s
in the same sentence obtained be- s.Ssent
Invalid spot: Keep your smile on. You’ll
Sentiment expressiontween a spot and comment
elsewhere in the co-occurring senti- s.C
do great! sent
Domain-related termment and domain-specific words by
in the same sentence as s s.Sdom
Encoding: No typed dependency between
Domain-related termthe Stanford the comment
elsewhere in parser[12] (see exam- smile s.Cdom
and great
+ ple in 7). We also encode a boolean
Refers to basic features, others are advanced features
∗
These features apply only indicating whether a relation
value to one-word-long spots. Table 7. Typed Dependencies Example
was found at all using the s.B-TDsent
Generic syntactic, byThis SVM learneraccommodate parse errors given the
Table 6. Features used spot-level, us to
and s.B-TDdom features. the allows domain features
15
informal and often non-grammatical English in this corpus.
39. Valid Music
Binary Classification mention or not?
Keep ur SMILE on!
Got your new album Smile. Loved it!
SVM Binary Classifiers
Training Set:
Positive examples (+1): 550 valid spots
Negative examples (-1): 550 invalid spots
Test Set:
120 valid spots, 458 invalid spots
41. Efficacy of Features
Precision intensive
42-91
78-50
90-35 Identified 90% of valid spots
Eliminated 35% of invalid spots
Recall intensive
42. Efficacy of Features
Precision intensive
42-91
- Feature combinations were
most stable, best performing
- Gazetteer matched domain words
78-50
and sentiment expressions proved
to be useful
90-35 Identified 90% of valid spots
Eliminated 35% of invalid spots
Recall intensive
43. Efficacy of Features
Precision intensive
42-91
- Feature combinations were
most stable, best performing
- Gazetteer matched domain words
78-50
and sentiment expressions proved
to be useful
90-35 Identified 90% of valid spots
Eliminated 35% of invalid spots
Recall intensive
PR tradeoffs: choosing feature combinations depending on end application
44. How did we do overall ?
'!!"
&!"
5('*%$%63)7)8'*#""
%!"
$!"
#!" 71,89-9/(:;/1:<9=>:?==,(
Madonna’s 71,89-9/(:;/1:@9A)(()
71,89-9/(:;/1:B)C/(()
@,8)==:D)==:0A1,,E
track spots- !"
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
23% precision
()*+,
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
Step 1: Spot with naive spotter,
restricted knowledge base
45. How did we do overall ?
'!!"
&!"
5('*%$%63)7)8'*#""
%!"
$!"
#!" 71,89-9/(:;/1:<9=>:?==,(
Madonna’s 71,89-9/(:;/1:@9A)(()
71,89-9/(:;/1:B)C/(()
@,8)==:D)==:0A1,,E
track spots- !"
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
23% precision 42-91 : “All
()*+,
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
features” setting
Step 1: Spot with naive spotter, Step 2: Disambiguate using NL
restricted knowledge base features (SVM classifier)
46. How did we do overall ?
'!!"
&!"
5('*%$%63)7)8'*#""
%!" Madonna’s
track spots-
$!"
~60%
#!" 71,89-9/(:;/1:<9=>:?==,( precision
Madonna’s 71,89-9/(:;/1:@9A)(()
71,89-9/(:;/1:B)C/(()
@,8)==:D)==:0A1,,E
track spots- !"
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
23% precision 42-91 : “All
()*+,
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
features” setting
Step 1: Spot with naive spotter, Step 2: Disambiguate using NL
restricted knowledge base features (SVM classifier)
47. Lessons Learned..
Using domain knowledge especially for
cultural entities is non-trivial
%!!
Computationally intensive NLP is prohibitive
23*$,&3(#&)#$",$
$!! 0129;4+32-.84-
Two stage approach: NL learners over
'!+,-./+012/+324/+56,176+,-.+8.7967:
#!!
dictionary-based naive spotters
"!! allows more time-intensive NL analytics to run on less than
the full set of input data
!
control over precision"(!! recall of final result
"&!! "'!! and ")!! "*!! #!!!
!"#$%&'()#&*+&)#$",$&*#&+*-./".0&'()#&*+&1)./
48. Lessons Learned..
Using domain knowledge especially for
cultural entities is non-trivial
Computationally intensive NLP is prohibitive
Two stage approach: NL learners over
dictionary-based naive spotters
allows more time-intensive NL analytics to run on less than
the full set of input data
control over precision and recall of final result
49. Sentiment Expressions
NER allows us to track and
trend online mentions
Popularity assessment: sentiment
associated with the target entity
Exploratory Project: what kinds of
sentiments, distribution of positive and
negative expressions..
50. Observations, Challenges
Negations, sarcasms, refutations are rare
Short sentences: OK to overlook target attachment
Target demographic: Teenagers!
Slang expressions: Wicked, tight, “the shit”,
tropical, bad..
What are common positive and negative
expressions this demographic uses?
Urbandictionary.com to the rescue!
51. Urbandictionary.com
Slang
Dictionary!
Related tags
bad appears with
terrible and good!
Glossary
52. Mining a Slang Sentiment Dictionary
Map frequently used sentiment expressions to their orientations
53. Mining a Slang Sentiment Dictionary
Map frequently used sentiment expressions to their orientations
Seed positive and negative oriented dictionary
entries
good, terrible..
54. Mining a Slang Sentiment Dictionary
Map frequently used sentiment expressions to their orientations
Seed positive and negative oriented dictionary
entries
good, terrible..
Step 1:For each seed, query UD, get related tags
Good ->awesome, sweet, fun, bad, rad..
Terrible ->bad, horrible, awful, shit..
55. Mining a Slang Sentiment Dictionary
Good ->awesome, sweet, fun, bad, rad
Step 2: Calculate semantic orientation of each
related tag [Turney]
SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”)
56. Mining a Slang Sentiment Dictionary
Good ->awesome, sweet, fun, bad, rad
Step 2: Calculate semantic orientation of each
related tag [Turney]
SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”)
PMI over the Web vs. UD
SOweb(rock) = -0.112 SOud(rock) = 0.513
57. Mining a Slang Sentiment Dictionary
Step 3: Record orientation; add to dictionary
{good, rad}
{terrible}
Continue for other related tags and all new
entries ..
Mined dictionary: 300+ entries (manually
verified for noise)
58. Sentiment Expressions in Text
“Your new album is wicked”
Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ
wicked/JJ
verbs, adjectives [Hatzivassiloglou 97]
59. Sentiment Expressions in Text
“Your new album is wicked”
Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ
wicked/JJ
verbs, adjectives [Hatzivassiloglou 97]
Look up word in mined dictionary, record
orientation
60. Sentiment Expressions in Text
“Your new album is wicked”
Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ
wicked/JJ
verbs, adjectives [Hatzivassiloglou 97]
Look up word in mined dictionary, record
orientation
Why dont we just spot using the dictionary!?
fuuunn! (coverage issues)
“super man is in town”; “i heard amy at tropical cafe yday..”
61. Dictionary Coverage Issues
Resort to corpus (transliterations)
Co-occurence in corpus (tight:top scoring
dictionary entries)
“tight-awesome”:456, “tight-sweet”:136,
“tight-hot” 429
62. Miscellaneous
Presence of entity improves confidence of
identified sentiment
Short sentences, high coherence
Associate sentiment with all entities
If no entity spotted, associate with artist (whose
page we are on)
“you are awesome!” on Madonna’s page
63. based on experiments. Table 3 shows the accuracy of
the annotator and illustrates the importance of using
transliterations in such corpora. Dictionary +
Evaluations - Mined
Identifying Orientation
Annotator Type Precision Recall
Positive Sentiment 0.81 0.9
Negative Sentiment 0.5 1.0
PS excluding transliterations 0.84 0.67
Table 3 Transliteration accuracy impact
These results indicate that the syntax and seman-
tics of sentiment expression in informal text is difficult
to determine. Words that were incorrectly identified as
64. based on experiments. Table 3 shows the accuracy of
the annotator and illustrates the importance of using
transliterations in such corpora. Dictionary +
Evaluations - Mined
Identifying Orientation
Annotator Type Precision Recall
Positive Sentiment 0.81 0.9
Negative Sentiment 0.5 1.0
PS excluding transliterations 0.84 0.67
Negative sentiments: slang orientations
Table 3 Transliteration accuracy impact out of
context!
you were terrible last night at SNL but I still <3 you!
These results indicate that the syntax and seman-
tics of sentiment expression in informal text is difficult
to determine. Words that were incorrectly identified as
65. based on experiments. Table 3 shows the accuracy of
the annotator and illustrates the importance of using
transliterations in such corpora. Dictionary +
Evaluations - Mined
Identifying Orientation
Annotator Type Precision Recall
Positive Sentiment 0.81 0.9
Negative Sentiment 0.5 1.0
PS excluding transliterations 0.84 0.67
Negative sentiments: slang orientations
Table 3 Transliteration accuracy impact out of
context!
you were terrible last night at SNL but I still <3 you!
These results indicate that the syntax and seman-
Precision excluding corpus transliterations:
tics of sentiment expression in informal text is difficult
Incorrect NL parses, selectivity
to determine. Words that were incorrectly identified as
67. Forget about the sentiment
Crucial Finding annotator! Just use entity
mention volumes!
< 4%
-ve +ve
Sentiment Expression Frequencies
in 60k comments, 26 weeks
68. Spam, Off-topic, Promotions
Special type of spam: unrelated to artists’ work
Paul McCartney’s divorce; Rihanna’s Abuse; Madge
and Jesus
Self-Promotions
“check out my new cool sounding tracks..”
music domain, similar keywords, harder to tell apart
Standard Spam
“Buy cheap cellphones here..”
69. 16
Observations
SPAM: 80% have 0 sentiments
CHECK US OUT!!! ADD US!!!
PLZ ADD ME!
60k comments, 26 IF YOU LIKE THESE GUYS ADD US!!!
weeks
NON-SPAM: 50% have at least 3 sentiments
Common 4 grams
Your music is really bangin!
pulled out several You’re a genius! Keep droppin bombs!
‘spammy’ phrases u doin it up 4 real. i really love the album.
keep doin wat u do best. u r so bad!
Phrases -> Spam, Non- hey just hittin you up showin love to one of
spam buckets chi-town’s own. MADD LOVE.
The spam annotator should Examples of of other annotatornon-spam com
Fig. 8 be aware sentiment in spam and results!
70. Spam Elimination
Aggregate function
Phrases indicative of spam (4-way
annotator agreements)
Rules over previous annotator results
if a spam phrase, artist/track name and a
positive sentiment were spotted, the
comment is probably not spam.
71. Performance
Annotator Type Precision Recall
Spam 0.76 0.8
Non-Spam 0.83 0.88
Directly proportional to previous annotator
Table 4: Spam annotator performance
results
incorrect entity, sentiment spots
the comment did not have a spam pattern, and the fir
annotator spotted incorrect tracks, the spam annotator in
Some self-promotions are just clever!
terpreted the comment to be related to music and classifie
”like umbrella, ull love this song. . . “
t as non-spam. Other cases included more clever prom
72. Do not Confuse Activity with Popularity!
non-spam spam Artist popularity ranking
changed dramatically
after excluding spam!
ns. Tun- noise could significantly impact the data analysi
rmined ordering of artists if it is not accounted for.
racy of % Spam, 60k comments over 26 weeks
of using Gorillaz 54% Placebo 39%
Coldplay 42% Amy Winehouse 38%
Lily Allen 40% Lady Sovereign 37%
Keane 40% Joss Stone 36%
all
74. Hypercube, List Projection
Exploring dimensions of popularity
Data Hypercube (DB2) from
structured data in posts (user demographics, location,
age, gender, timestamp)
unstructured data (entity, sentiment, spam flag)
Intersection of dimensions => non-spam
comment counts
75. artist), and from the measurements generated by the above
annotator methods. ThisRecall
Annotator Type Precision annotates eachwhich is then used to sorta a
comment with the
Spam 0.76 0.8 gregate and analyze the hyperc
series of tags from unstructured and structured data. The on
Non-Spam 0.83 0.88 dimensional data operations
resulting tuple is then placed into a startially custom in which for
schema popular lists
Hypercube to One-Dimensional List
Table 4: Spam annotator performance
sort the artists, tracks, etc. to to musical bill
addition
which is then used to relevance with regardsWe can ag-
the primary measure is a
the traditional
can slice and project (margina
topics. didgregate equivalent the defining a function. asof multi- hot in
This is andspam pattern, hypercube usinglistsvariety “What is
analyze
of and the first a such
the comment dimensional data operations on it to derive males?” and “Who are th
not have a old what are essen-
annotator spotted incorrect tracks, the spam annotator in-
tially custom popular lists for particularSan Francisco?” They transla
musical topics. In
terpreted the commentGender, Location, T ime, Artist, ...) → M
M : (Age, to be relatedtraditional billboard “Top Artist” lists, we(1)
nce the to music and classified
addition toincluded more clever promo-
it as non-spam. Other cases
operations:
tional comments can included the actual artist tracks, gen-
that slice have stored the aggregation ofthe cube for
and project (marginalize dimensions) X
In ourRoll-ups as “What is hot(e.g. New York 1 (X) : forM (A year G = M
case,very limited spam content. in ‘like
we occurrences
lists such
uine sentiments and L City 19 = 19,
nd umbrella ull love this song. . . ’). As is evident, theintersecting dimension from
theof non-spammales?” and at theare the most popular T,...
first old comment “Who amount of artists values
nnotator in- available at hand in addition to translatethis way makes it easy to
of the WhatFrancisco?” They grammatically 19 year X
San is hot in New York for
information hypercube. Storing the data to following mathematical
poor sentences necessitates more sophisticated techniques
nd classified
old
operations:
examine rankings over various time intervals,(X) : L2 weight variousL =
M (A, G,
ever promo- males?
for spam identification.
T,A,G,...
Given the amount of effort involved etc. Once (and if) a total ordering
the obvious question
tracks,dimensions differently, goal of using comment
gen- X
arises – why filter spam? For our end where,
(X) : on the list, = 19, Gspam is key. “N ewY orkCity , might
Lis fixed this intermediateLdata staging stepT, X, ...)
. (e.g. approach positions
‘like
counts to lead to 1 M (A filtering = M, =
e amount corroborated by the volume of spam and non-spam
This is of
be eliminated. T,...
content observed over a period of 26 weeks for 8 artists; see
ammatically (2)
X = Name of the artist
X
Table 5. The chart indicates that some artists had at least T = Timestamp
4.4 Projecting to a list
techniques
L2 (X) : M (A, G, L = “SanF rancisco ,of the commenter
half as many spam as non-spam comments on their page. A = Age X, ...) (3)
This level of noise would significantly impact the ordering
T,A,G,... G = Gender
us questionif it were not accounted seeking to generate
Ultimately we are for.
of artists a “one dimensional”
L = Location
ng comment where,
76. artist), and from the measurements generated by the above
annotator methods. ThisRecall
Annotator Type Precision annotates eachwhich is then used to sorta a
comment with the
Spam 0.76 0.8 gregate and analyze the hyperc
series of tags from unstructured and structured data. The on
Non-Spam 0.83 0.88 dimensional data operations
resulting tuple is then placed into a startially custom in which for
schema popular lists
Hypercube to One-Dimensional List
Table 4: Spam annotator performance
sort the artists, tracks, etc. to to musical bill
addition
which is then used to relevance with regardsWe can ag-
the primary measure is a
the traditional
can slice and project (margina
topics. didgregate equivalent the defining a function. asof multi- hot in
This is andspam pattern, hypercube usinglistsvariety “What is
analyze
of and the first a such
the comment dimensional data operations on it to derive males?” and “Who are th
not have a old what are essen-
annotator spotted incorrect tracks, the spam annotator in-
tially custom popular lists for particularSan Francisco?” They transla
musical topics. In
terpreted the commentGender, Location, T ime, Artist, ...) → M
M : (Age, to be relatedtraditional billboard “Top Artist” lists, we(1)
nce the to music and classified
addition toincluded more clever promo-
it as non-spam. Other cases
operations:
tional comments can included the actual artist tracks, gen-
that slice have stored the aggregation ofthe cube for
and project (marginalize dimensions) X
In ourRoll-ups as “What is hot(e.g. New York 1 (X) : forM (A year G = M
case,very limited spam content. in ‘like
we occurrences
lists such
uine sentiments and L City 19 = 19,
nd umbrella ull love this song. . . ’). As is evident, theintersecting dimension from
theof non-spammales?” and at theare the most popular T,...
first old comment “Who amount of artists values
nnotator in- available at hand in addition to translatethis way makes it easy to
of the WhatFrancisco?” They grammatically 19 year X
San is hot in New York for
information hypercube. Storing the data to following mathematical
poor sentences necessitates more sophisticated techniques
nd classified
old
operations:
examine rankings over various time intervals,(X) : L2 weight variousL =
M (A, G,
ever promo- males?
for spam identification.
T,A,G,...
Given the amount of effort involved etc. Once (and if) a total ordering
the obvious question
tracks,dimensions differently, goal of using comment
gen- X
arises – why filter spam? For our end where,
(X) : on the list, = 19, Gspam is key. “N ewY orkCity , might
Lis fixed this intermediateLdata staging stepT, X, ...)
. (e.g. approach positions
‘like
counts to lead to 1 M (A filtering = M, =
e amount corroborated by the volume of spam and non-spam
This is of
be eliminated. T,...
content observed over a period of 26 weeks for 8 artists; see
ammatically (2)
X = Name of the artist
X
Table 5. The chart indicates that some artists had at least T = Timestamp
4.4 Projecting to a list
techniques
L2 (X) : M (A, G, L = “SanF rancisco ,of the commenter
half as many spam as non-spam comments on their page. A = Age X, ...) (3)
This level of noise would significantly impact the ordering
T,A,G,... G = Gender
us questionif it were not accounted seeking to generate
Ultimately we are for.
of artists a “one dimensional”
L = Location
ng comment where,
77. The Word on the Street
comments Billboards Top 50 Singles
were spam Billboard.com MySpace Analysis
comments
comments chart during the week of
had positive sentiments
had negative sentiments Soulja Boy T.I.
comments
Sept 22-28 2007
had no identifiable sentiments
on Statistics
Kanye West
Timbaland
Soulja Boy
Fall Out Boy
Fergie Rihanna
(Unique) Top 45 J. Holiday Keyshia Cole
50 Cent Avril Lavigne
MySpace artist pages
in Section 8, the structured metadata Keyshia Cole Timbaland
mestamp, etc.) and annotation results Nickelback Pink
m, sentiment, etc.) were loaded in Hypercube,
Crawl, Annotate, Build the Pink 50 Cent
Colbie Caillat Alicia Keys
Project lists
resented by each cell of the cube is the Table 8 Billboard’s Top Artists vs. our generated list
ents for a given artist. The dimension- Showing Top 10
e is dependent on what variables we
1 was comprised of respondents between ages 8
78. The Word on the Street
* Top artists appear in both lists
* Several Overlaps 50 Singles
Billboards Top
comments were spam Billboard.com MySpace Analysis
* Artists with long the week of
chart during history/body
comments had positive sentiments
comments had negative sentiments Soulja Boy T.I.
of work vs. ‘up and coming’ artists
Sept 22-28 2007
comments had no identifiable sentiments Kanye West Soulja Boy
on Statistics Timbaland Fall Out Boy
Fergie Rihanna
*(Unique) Top of MySpace -
Predictive power 45 J. Holiday Keyshia Cole
50 Cent Avril Lavigne
MySpace artist pages
in Section 8, the next week looked a lot like
Billboard structured metadata Keyshia Cole Timbaland
mestamp, etc.) MySpace this week..
and annotation results Nickelback Pink
m, sentiment, etc.) were loaded in Hypercube,
Crawl, Annotate, Build the Pink 50 Cent
Colbie Caillat Alicia Keys
Project lists big music influencers
Teenagerscell of the cube is the
are
resented by each Table 8 Billboard’s Top Artists vs. our generated list
ents for a given [MediaMark2004]
artist. The dimension- Showing Top 10
e is dependent on what variables we
1 was comprised of respondents between ages 8
79. Casual Preference Poll Results
“Which list more accurately reflects the
artists that were more popular last week?”
75 participants
Overall 2:1 preference for MySpace list
Younger age groups: 6:1 (8-15 yrs)
38% of total comments were spam Billboard.com MySpace Analysis
61% of total comments had positive sentiments
4% of total comments had negative sentiments Soulja Boy T.I.
35% of total comments had no identifiable sentiments Kanye West Soulja Boy
Table 7 Annotation Statistics Timbaland Fall Out Boy
Fergie Rihanna
J. Holiday Keyshia Cole
50 Cent Avril Lavigne
As described in Section 8, the structured metadata
Challenging traditional polling methods!
(artist name, timestamp, etc.) and annotation results
(spam/non-spam, sentiment, etc.) were loaded in the
Keyshia Cole
Nickelback
Pink
Timbaland
Pink
50 Cent
Colbie Caillat Alicia Keys
hypercube.
The data represented by each cell of the cube is the Table 8 Billboard’s Top Artists vs. our generated list
80. Lessons Learned, Miles to go
Informal (teen-authored) text is not your
average blog content..
Quality check is not on a day to day spot/
comment level precision, but at a system level
- are we missing a source/artist, is the crawl behaving,
adjudication techniques for multiple data sources..
Leveraging SI for BI is difficult - Validation is key,
continuous fine tweaking with experts