CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
Sentiments Improvement
1. Sentiment improvements
Proposed ideas:
Part I. Data preprocessing
Part II. PMI-IR approach
Team members:
Denys Astanin
Mykhailo Kozik
2. Data preprocessing
Raw data
Preprocessed data
Narrowing
Long words
Emoticons
Decoding
Spell
Correction
Abbreviations
Decoding
Tags
Detection
:'( → cry
@Alex nice photo
#photoworld
goooood → good
lol → laughing out loud
I am shure that is realy exsellent plece
|
I am sure that is really excellent place
3. Narrowing long words
Using regexp narrow more than 2 duplicate letters in word to just 2
goooooood → good (correct narrowing)
baaaaaad → baad (incorrect narrowing, but will be corrected with spell-checker)
This hotel so goooooooood! This hotel so good!
NEUTRAL POSITIVE
This place not coooooool! This place not cool!
NEUTRAL NEGATIVE
Try this regexp: http://regexr.com?30abm
4. Narrowing long words. Examples
dancing with the stars and two and a half men toniiiight
@BrunoMars you were AMAZINGGGGGG at the vma's need to see you!
RT @BriannaStull13: I hateeeeeee pandora ads....
It was sooo badddd
Woooooooooooow I Like that, very nice and big like
Thts cooool
i hack any thing but for moneyyyy
who know how hacked one add fb??? pleaseeee
5. Narrowing long words. Performance
10K 100K 1M
Long words 83.13 msec 828.30 msec 8370.97 msec
~8 sec.
Normal words 31.92 msec 275.34 msec 2763.77 msec
~3 sec.
Mixed words* 35.23 msec 339.23 msec 3370.31 msec
~3 sec.
* assume that 1% of words are long words
6. Emoticons decoding
Using map of smile meanings convert smile to word that it means
<3 → love
:( → sad
Look at her http://t.co/12345 <3 Look at her http://t.co/12345 love
NEUTRAL POSITIVE
I will be out of work tomorrow :( I will be out of work tomorrow sad
NEUTRAL NEGATIVE
List of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
7. Emoticons decoding. Examples
Awww He is Too cute :) Thanks bae next weekend..
@LenovoDoTour I have missed these two days in Belgrade :(
Katie Holmes <3 #VMA
ahaha just to warn you!! ;)
it's amazing how Oracle can do so much! I'm loving it <3
please someone help me i need to finish this im out of time!! thank!! :D
Boa noite, viajantes! Menos um diazinho nessa semana =)
:-( don't have my Mcard number required to fill out form
8. Emoticons decoding. Performance
10K 100K 1M
1 smile list 45.03 msec 444.62 msec 4426.74 msec
~4 sec.
5 smile list 189.87 msec 1304.10 msec
~1 sec.
12355.37 msec
~12 sec.
10 smile list 227.26 msec 2325.23 msec
~2 sec.
26954.26 msec
~27 sec.
We have so poor performance when smile list grow up due to method that perform
replacements. Better results can achieved with using state machines or regexps
9. Abbreviations decoding
Using map of abbreviations convert abbr to word that it means
lol → laughing out loud
thx → thanks
Got it! lol Got it! laughing out loud
NEUTRAL POSITIVE
I was DWI, haha I was driving while intoxicated, haha
NEUTRAL NEGATIVE
List of abbreviations: http://www.smartdefine.org/internet_slang/abbreviations/r
10. Abbreviation decoding. Examples
No offense though.. Lol
O lmao!
http://t.co/Evvh4hj ROFL
JFYI #blackcarpet
Nice code LOL
TNX you Rose! We appreciate it!
OMG, FML!
Wait me, i will be AFK
11. Emoticons and Abbreviations
Alternative approach
Abbreviations, acronyms, slang words are already parsed as tokens
Parse smiles as tokens also in FX
Now we can use ”Tune sentiments” on these tokens
12. Spell correction
Perform spell correction on data before sentiment calculation
I lov this hotel! I love this hotel!
NEUTRAL POSITIVE
They have terryble servic They have terrible service
NEUTRAL NEGATIVE
13. Spell corection. Examples
i hope @ladygaga will take some rest now becauce of...
But its still also hilarioouss
Shoukd i wast my money?
Business eviroment
It's impossibru!
I like dansing! <3
You can dowload the data from http://to.download/file
Coleguaues, lets keep it clean.
17. Spell correction. Accuracy
Test data 1 Test data 2
1 edit 61.8% 67.2%
2 edits 71.2% 74.1%
Test data 1: Wikipedia – Common misspelled words (~4k)
http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
Test data 2: Birkbeck spelling error corpus (270)
http://www.ota.ox.ac.uk/headers/0643.xml
18. Spell correction. Performance
10K 100K 1M
1 edit 11350.52 msec
~11 sec.
117261.12 msec
~2 min.
1252882.23 msec
~20 min.
2 edits 4300631.29 msec
~70 min.
Due to quadratic complexity these tests
make no sense
Spell-check complexity for word:
Edit distance 1: O(C·n)
Edit distance 2: O(C²·n²)
* n – length of word
** C ~= 50
19. Spell correction. Improvements
Performance
Memoize correction (Best → O(1))
Give ability to user to perform spell-correction
Improve train data
Coverage & Accuracy
Use more edits candidates
Use common mispelling rules
Use weights for edit operations
Hit part of speech
Hit context
Improve train data
20. Tags detection
Process differently source-specific information (twitter)
● Hashtag (#music) use word splitter
● Username (@LadyGaga) just ignore it
I say to @love hello! I say to - hello!
POSITIVE NEUTRAL
I mean that i #hatetwitter I mean that i hate twitter
NEUTRAL NEGATIVE
21. Tags detection. Examples
@INevaTrustEm ok :) we need to make a date for this
Watching @danieltosh #toofunny
#lovetolaugh
#sick
Avatar, #wasteofmoney
#soft #thissucks
#happytweet
RT @BriannaStull13: what do you mean?
22. Tags detection. Words splitting
Dynamic programming
Statistical approach due to ambiguity
#orcore → [orc_ore], [or_core]
#expertsexchange → [expert_sex_change], [experts_exchange]
Train data
Dictionary (default linux ~100K words)
24. Tags detection. Performance
100 400 800
Time 4019.73 msec
~4 sec.
6429.19
~6 sec.
7897.23
~8 sec.
Accuracy 83.00% 86.25% 84.88%
Main problems:
● Train set not often solves ambiguity problem
● Dictionary hits filter lot of right candidates
#rapnotamusic → [ra_p_not_a_music]
25. Words splitting. Improvements
Performance
Memoize splitting
Prefix tree approach
Viterbi algorithm (http://en.wikipedia.org/wiki/Viterbi_algorithm)
Improve train data
Accuracy
Use famous names, geographic locations, slang, abbreviations,
acronyms,...
Big dictionary
Improve train data (twitter-specific)
26. Preprocessing performance
Input conditions:
Data: 2.4K (incorrect) of 15.8K (total) from Omniture15K.xls file (15%)
Emoticons size: 14 most common smiles
Abbreviations size: 8 most common abbrs
Spell-correction distance: 1
Train data: big.txt
Dictionary: linux-words.txt
Results:
Sentence count: 2412
Preprocessing time: 29214.88 msec (~29 sec.)
Number of corrected sentences: 368
Percent of corrected to incorrect data: 15.28%
Percent of corrected to total data: 2.33%