SlideShare a Scribd company logo
1 of 28
Download to read offline
Sentiment improvements 
Proposed ideas: 
Part I. Data preprocessing 
Part II. PMI-IR approach 
Team members: 
Denys Astanin 
Mykhailo Kozik
Data preprocessing 
Raw data 
Preprocessed data 
Narrowing 
Long words 
Emoticons 
Decoding 
Spell 
Correction 
Abbreviations 
Decoding 
Tags 
Detection 
:'( → cry 
@Alex nice photo 
#photoworld 
goooood → good 
lol → laughing out loud 
I am shure that is realy exsellent plece 
| 
I am sure that is really excellent place
Narrowing long words 
Using regexp narrow more than 2 duplicate letters in word to just 2 
goooooood → good (correct narrowing) 
baaaaaad → baad (incorrect narrowing, but will be corrected with spell-checker) 
This hotel so goooooooood! This hotel so good! 
NEUTRAL POSITIVE 
This place not coooooool! This place not cool! 
NEUTRAL NEGATIVE 
Try this regexp: http://regexr.com?30abm
Narrowing long words. Examples 
dancing with the stars and two and a half men toniiiight 
@BrunoMars you were AMAZINGGGGGG at the vma's need to see you! 
RT @BriannaStull13: I hateeeeeee pandora ads.... 
It was sooo badddd 
Woooooooooooow I Like that, very nice and big like 
Thts cooool 
i hack any thing but for moneyyyy 
who know how hacked one add fb??? pleaseeee
Narrowing long words. Performance 
10K 100K 1M 
Long words 83.13 msec 828.30 msec 8370.97 msec 
~8 sec. 
Normal words 31.92 msec 275.34 msec 2763.77 msec 
~3 sec. 
Mixed words* 35.23 msec 339.23 msec 3370.31 msec 
~3 sec. 
* assume that 1% of words are long words
Emoticons decoding 
Using map of smile meanings convert smile to word that it means 
<3 → love 
:( → sad 
Look at her http://t.co/12345 <3 Look at her http://t.co/12345 love 
NEUTRAL POSITIVE 
I will be out of work tomorrow :( I will be out of work tomorrow sad 
NEUTRAL NEGATIVE 
List of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
Emoticons decoding. Examples 
Awww He is Too cute :) Thanks bae next weekend.. 
@LenovoDoTour I have missed these two days in Belgrade :( 
Katie Holmes <3 #VMA 
ahaha just to warn you!! ;) 
it's amazing how Oracle can do so much! I'm loving it <3 
please someone help me i need to finish this im out of time!! thank!! :D 
Boa noite, viajantes! Menos um diazinho nessa semana =) 
:-( don't have my Mcard number required to fill out form
Emoticons decoding. Performance 
10K 100K 1M 
1 smile list 45.03 msec 444.62 msec 4426.74 msec 
~4 sec. 
5 smile list 189.87 msec 1304.10 msec 
~1 sec. 
12355.37 msec 
~12 sec. 
10 smile list 227.26 msec 2325.23 msec 
~2 sec. 
26954.26 msec 
~27 sec. 
We have so poor performance when smile list grow up due to method that perform 
replacements. Better results can achieved with using state machines or regexps
Abbreviations decoding 
Using map of abbreviations convert abbr to word that it means 
lol → laughing out loud 
thx → thanks 
Got it! lol Got it! laughing out loud 
NEUTRAL POSITIVE 
I was DWI, haha I was driving while intoxicated, haha 
NEUTRAL NEGATIVE 
List of abbreviations: http://www.smartdefine.org/internet_slang/abbreviations/r
Abbreviation decoding. Examples 
No offense though.. Lol 
O lmao! 
http://t.co/Evvh4hj ROFL 
JFYI #blackcarpet 
Nice code LOL 
TNX you Rose! We appreciate it! 
OMG, FML! 
Wait me, i will be AFK
Emoticons and Abbreviations 
 Alternative approach 
 Abbreviations, acronyms, slang words are already parsed as tokens 
 Parse smiles as tokens also in FX 
 Now we can use ”Tune sentiments” on these tokens
Spell correction 
Perform spell correction on data before sentiment calculation 
I lov this hotel! I love this hotel! 
NEUTRAL POSITIVE 
They have terryble servic They have terrible service 
NEUTRAL NEGATIVE
Spell corection. Examples 
i hope @ladygaga will take some rest now becauce of... 
But its still also hilarioouss 
Shoukd i wast my money? 
Business eviroment 
It's impossibru! 
I like dansing! <3 
You can dowload the data from http://to.download/file 
Coleguaues, lets keep it clean.
Spell correction. Edit distance 
 Edit types: 
 Deletion beauetiful → beautiful 
 Insertion speling → spelling 
 Substitution performanse → performance 
 Swaping yaer → year 
 Examples 
unsucesful → unsuccesful → unsuccessful (2 edits) 
wardoub → wardroub → wardrobu → wardrobe (3 edits)
Spell correction. Algorithm 
 Peter Norvig's spelling corrector 
 Bayes rule approach 
 Train data 
 Simple implementation 
 High performance 
 Low accuracy 
More theory: http://norvig.com/spell-correct.html 
Train data: http://norvig.com/big.txt
Spell correction. Coverage 
Edit1 + Edit2 covers 98%!!!
Spell correction. Accuracy 
Test data 1 Test data 2 
1 edit 61.8% 67.2% 
2 edits 71.2% 74.1% 
Test data 1: Wikipedia – Common misspelled words (~4k) 
http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines 
Test data 2: Birkbeck spelling error corpus (270) 
http://www.ota.ox.ac.uk/headers/0643.xml
Spell correction. Performance 
10K 100K 1M 
1 edit 11350.52 msec 
~11 sec. 
117261.12 msec 
~2 min. 
1252882.23 msec 
~20 min. 
2 edits 4300631.29 msec 
~70 min. 
Due to quadratic complexity these tests 
make no sense 
Spell-check complexity for word: 
Edit distance 1: O(C·n) 
Edit distance 2: O(C²·n²) 
* n – length of word 
** C ~= 50
Spell correction. Improvements 
 Performance 
 Memoize correction (Best → O(1)) 
 Give ability to user to perform spell-correction 
 Improve train data 
 Coverage & Accuracy 
 Use more edits candidates 
 Use common mispelling rules 
 Use weights for edit operations 
 Hit part of speech 
 Hit context 
 Improve train data
Tags detection 
Process differently source-specific information (twitter) 
● Hashtag (#music) use word splitter 
● Username (@LadyGaga) just ignore it 
I say to @love hello! I say to - hello! 
POSITIVE NEUTRAL 
I mean that i #hatetwitter I mean that i hate twitter 
NEUTRAL NEGATIVE
Tags detection. Examples 
@INevaTrustEm ok :) we need to make a date for this 
Watching @danieltosh #toofunny 
#lovetolaugh 
#sick 
Avatar, #wasteofmoney 
#soft #thissucks 
#happytweet 
RT @BriannaStull13: what do you mean?
Tags detection. Words splitting 
 Dynamic programming 
 Statistical approach due to ambiguity 
#orcore → [orc_ore], [or_core] 
#expertsexchange → [expert_sex_change], [experts_exchange] 
 Train data 
 Dictionary (default linux ~100K words)
Tags detection. Twitter hashtags 
Twitter hashtags crawled from (~800): 
http://hashtags.org/ 
http://kingnetforums.weebly.com/twitter-hashtags-lists.html 
http://edudemic.com/2011/10/twitter-hashtag-dictionary/ 
http://nicolehumphrey.net/60-favorite-twitter-hashtags-for-writers-clickable-list/ 
http://www.dailywritingtips.com/40-twitter-hashtags-for-writers/ 
http://greeneconomypost.com/green-twitter-hashtag-17290.htm
Tags detection. Performance 
100 400 800 
Time 4019.73 msec 
~4 sec. 
6429.19 
~6 sec. 
7897.23 
~8 sec. 
Accuracy 83.00% 86.25% 84.88% 
Main problems: 
● Train set not often solves ambiguity problem 
● Dictionary hits filter lot of right candidates 
#rapnotamusic → [ra_p_not_a_music]
Words splitting. Improvements 
 Performance 
 Memoize splitting 
 Prefix tree approach 
 Viterbi algorithm (http://en.wikipedia.org/wiki/Viterbi_algorithm) 
 Improve train data 
 Accuracy 
 Use famous names, geographic locations, slang, abbreviations, 
acronyms,... 
 Big dictionary 
 Improve train data (twitter-specific)
Preprocessing performance 
Input conditions: 
Data: 2.4K (incorrect) of 15.8K (total) from Omniture15K.xls file (15%) 
Emoticons size: 14 most common smiles 
Abbreviations size: 8 most common abbrs 
Spell-correction distance: 1 
Train data: big.txt 
Dictionary: linux-words.txt 
Results: 
Sentence count: 2412 
Preprocessing time: 29214.88 msec (~29 sec.) 
Number of corrected sentences: 368 
Percent of corrected to incorrect data: 15.28% 
Percent of corrected to total data: 2.33%
Data preprocessing. Future. 
 Sentence breaker
Environment 
 Hardware 
 CPU: 2 x Intel Pentium Dual T2370 @ 1.73GHz 
 RAM: 2.0 GB 
 Software 
 OS: Ubuntu 11.04 
 Kernel: Linux 2.6.38-13-generic 
 IDE: Emacs 23.2.1 
 Programming: Clojure 1.3

More Related Content

Viewers also liked

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
Political sentiment analysis using twitter data
Political sentiment analysis using twitter dataPolitical sentiment analysis using twitter data
Political sentiment analysis using twitter dataAmal Mahmoud
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltkWei-Ting Kuo
 
Drone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFiDrone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFiTimothy Spann
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisJaganadh Gopinadhan
 
Twitter Sentiment Analysis - Mozilla Brown Bag Talk
Twitter Sentiment Analysis - Mozilla Brown Bag TalkTwitter Sentiment Analysis - Mozilla Brown Bag Talk
Twitter Sentiment Analysis - Mozilla Brown Bag TalkRobin Hawkes
 
Railway Oriented Programming
Railway Oriented ProgrammingRailway Oriented Programming
Railway Oriented ProgrammingScott Wlaschin
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Rachit Goel
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 

Viewers also liked (13)

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Political sentiment analysis using twitter data
Political sentiment analysis using twitter dataPolitical sentiment analysis using twitter data
Political sentiment analysis using twitter data
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
 
Drone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFiDrone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFi
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Twitter Sentiment Analysis - Mozilla Brown Bag Talk
Twitter Sentiment Analysis - Mozilla Brown Bag TalkTwitter Sentiment Analysis - Mozilla Brown Bag Talk
Twitter Sentiment Analysis - Mozilla Brown Bag Talk
 
Railway Oriented Programming
Railway Oriented ProgrammingRailway Oriented Programming
Railway Oriented Programming
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 

Similar to Sentiments Improvement

Approaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemApproaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemAbhishek Thakur
 
SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightFred Moyer
 
Api anti patterns
Api anti patternsApi anti patterns
Api anti patternsMike Pearce
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightFred Moyer
 
Understanding the History of EncryptionUnderstanding the
Understanding the History of EncryptionUnderstanding theUnderstanding the History of EncryptionUnderstanding the
Understanding the History of EncryptionUnderstanding thecorbing9ttj
 
Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Alexey Grigorev
 
Password Storage Sucks!
Password Storage Sucks!Password Storage Sucks!
Password Storage Sucks!nerdybeardo
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Jeongkyu Shin
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingApache MXNet
 
Efficient JavaScript Development
Efficient JavaScript DevelopmentEfficient JavaScript Development
Efficient JavaScript Developmentwolframkriesing
 
Combined 2 Bank Compiled Post: ADA Date: 25.09.2021 Taker: AUST
Combined 2 Bank Compiled Post: ADA Date: 25.09.2021  Taker: AUSTCombined 2 Bank Compiled Post: ADA Date: 25.09.2021  Taker: AUST
Combined 2 Bank Compiled Post: ADA Date: 25.09.2021 Taker: AUSTEngr. Md. Jamal Uddin Rayhan
 
Variables & Expressions
Variables & ExpressionsVariables & Expressions
Variables & ExpressionsRich Price
 
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)Андрей Новиков
 
Part Numbering and ID codes: general considerations and check digits
Part Numbering and ID codes: general considerations and check digitsPart Numbering and ID codes: general considerations and check digits
Part Numbering and ID codes: general considerations and check digitsjohnhwoodsslideshare
 
Knee-deep in C++ s... code
Knee-deep in C++ s... codeKnee-deep in C++ s... code
Knee-deep in C++ s... codePVS-Studio
 
Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)Kobkrit Viriyayudhakorn
 

Similar to Sentiments Improvement (20)

Approaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemApproaching (almost) Any NLP Problem
Approaching (almost) Any NLP Problem
 
SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done Right
 
Api anti patterns
Api anti patternsApi anti patterns
Api anti patterns
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done Right
 
Understanding the History of EncryptionUnderstanding the
Understanding the History of EncryptionUnderstanding theUnderstanding the History of EncryptionUnderstanding the
Understanding the History of EncryptionUnderstanding the
 
Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)
 
XSS and How to Escape
XSS and How to EscapeXSS and How to Escape
XSS and How to Escape
 
Password Storage Sucks!
Password Storage Sucks!Password Storage Sucks!
Password Storage Sucks!
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
ACL17_Sakaguchi
ACL17_SakaguchiACL17_Sakaguchi
ACL17_Sakaguchi
 
Efficient JavaScript Development
Efficient JavaScript DevelopmentEfficient JavaScript Development
Efficient JavaScript Development
 
Combined 2 Bank Compiled Post: ADA Date: 25.09.2021 Taker: AUST
Combined 2 Bank Compiled Post: ADA Date: 25.09.2021  Taker: AUSTCombined 2 Bank Compiled Post: ADA Date: 25.09.2021  Taker: AUST
Combined 2 Bank Compiled Post: ADA Date: 25.09.2021 Taker: AUST
 
Variables & Expressions
Variables & ExpressionsVariables & Expressions
Variables & Expressions
 
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
 
Password War Games Webinar
Password War Games Webinar Password War Games Webinar
Password War Games Webinar
 
Part Numbering and ID codes: general considerations and check digits
Part Numbering and ID codes: general considerations and check digitsPart Numbering and ID codes: general considerations and check digits
Part Numbering and ID codes: general considerations and check digits
 
Knee-deep in C++ s... code
Knee-deep in C++ s... codeKnee-deep in C++ s... code
Knee-deep in C++ s... code
 
Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)
 

More from Misha Kozik

More from Misha Kozik (6)

QBIC
QBICQBIC
QBIC
 
DSL in Clojure
DSL in ClojureDSL in Clojure
DSL in Clojure
 
Timezone Mess
Timezone MessTimezone Mess
Timezone Mess
 
Implementing STM in Java
Implementing STM in JavaImplementing STM in Java
Implementing STM in Java
 
Clojure Intro
Clojure IntroClojure Intro
Clojure Intro
 
Unsafe Java
Unsafe JavaUnsafe Java
Unsafe Java
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 

Sentiments Improvement

  • 1. Sentiment improvements Proposed ideas: Part I. Data preprocessing Part II. PMI-IR approach Team members: Denys Astanin Mykhailo Kozik
  • 2. Data preprocessing Raw data Preprocessed data Narrowing Long words Emoticons Decoding Spell Correction Abbreviations Decoding Tags Detection :'( → cry @Alex nice photo #photoworld goooood → good lol → laughing out loud I am shure that is realy exsellent plece | I am sure that is really excellent place
  • 3. Narrowing long words Using regexp narrow more than 2 duplicate letters in word to just 2 goooooood → good (correct narrowing) baaaaaad → baad (incorrect narrowing, but will be corrected with spell-checker) This hotel so goooooooood! This hotel so good! NEUTRAL POSITIVE This place not coooooool! This place not cool! NEUTRAL NEGATIVE Try this regexp: http://regexr.com?30abm
  • 4. Narrowing long words. Examples dancing with the stars and two and a half men toniiiight @BrunoMars you were AMAZINGGGGGG at the vma's need to see you! RT @BriannaStull13: I hateeeeeee pandora ads.... It was sooo badddd Woooooooooooow I Like that, very nice and big like Thts cooool i hack any thing but for moneyyyy who know how hacked one add fb??? pleaseeee
  • 5. Narrowing long words. Performance 10K 100K 1M Long words 83.13 msec 828.30 msec 8370.97 msec ~8 sec. Normal words 31.92 msec 275.34 msec 2763.77 msec ~3 sec. Mixed words* 35.23 msec 339.23 msec 3370.31 msec ~3 sec. * assume that 1% of words are long words
  • 6. Emoticons decoding Using map of smile meanings convert smile to word that it means <3 → love :( → sad Look at her http://t.co/12345 <3 Look at her http://t.co/12345 love NEUTRAL POSITIVE I will be out of work tomorrow :( I will be out of work tomorrow sad NEUTRAL NEGATIVE List of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
  • 7. Emoticons decoding. Examples Awww He is Too cute :) Thanks bae next weekend.. @LenovoDoTour I have missed these two days in Belgrade :( Katie Holmes <3 #VMA ahaha just to warn you!! ;) it's amazing how Oracle can do so much! I'm loving it <3 please someone help me i need to finish this im out of time!! thank!! :D Boa noite, viajantes! Menos um diazinho nessa semana =) :-( don't have my Mcard number required to fill out form
  • 8. Emoticons decoding. Performance 10K 100K 1M 1 smile list 45.03 msec 444.62 msec 4426.74 msec ~4 sec. 5 smile list 189.87 msec 1304.10 msec ~1 sec. 12355.37 msec ~12 sec. 10 smile list 227.26 msec 2325.23 msec ~2 sec. 26954.26 msec ~27 sec. We have so poor performance when smile list grow up due to method that perform replacements. Better results can achieved with using state machines or regexps
  • 9. Abbreviations decoding Using map of abbreviations convert abbr to word that it means lol → laughing out loud thx → thanks Got it! lol Got it! laughing out loud NEUTRAL POSITIVE I was DWI, haha I was driving while intoxicated, haha NEUTRAL NEGATIVE List of abbreviations: http://www.smartdefine.org/internet_slang/abbreviations/r
  • 10. Abbreviation decoding. Examples No offense though.. Lol O lmao! http://t.co/Evvh4hj ROFL JFYI #blackcarpet Nice code LOL TNX you Rose! We appreciate it! OMG, FML! Wait me, i will be AFK
  • 11. Emoticons and Abbreviations  Alternative approach  Abbreviations, acronyms, slang words are already parsed as tokens  Parse smiles as tokens also in FX  Now we can use ”Tune sentiments” on these tokens
  • 12. Spell correction Perform spell correction on data before sentiment calculation I lov this hotel! I love this hotel! NEUTRAL POSITIVE They have terryble servic They have terrible service NEUTRAL NEGATIVE
  • 13. Spell corection. Examples i hope @ladygaga will take some rest now becauce of... But its still also hilarioouss Shoukd i wast my money? Business eviroment It's impossibru! I like dansing! <3 You can dowload the data from http://to.download/file Coleguaues, lets keep it clean.
  • 14. Spell correction. Edit distance  Edit types:  Deletion beauetiful → beautiful  Insertion speling → spelling  Substitution performanse → performance  Swaping yaer → year  Examples unsucesful → unsuccesful → unsuccessful (2 edits) wardoub → wardroub → wardrobu → wardrobe (3 edits)
  • 15. Spell correction. Algorithm  Peter Norvig's spelling corrector  Bayes rule approach  Train data  Simple implementation  High performance  Low accuracy More theory: http://norvig.com/spell-correct.html Train data: http://norvig.com/big.txt
  • 16. Spell correction. Coverage Edit1 + Edit2 covers 98%!!!
  • 17. Spell correction. Accuracy Test data 1 Test data 2 1 edit 61.8% 67.2% 2 edits 71.2% 74.1% Test data 1: Wikipedia – Common misspelled words (~4k) http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines Test data 2: Birkbeck spelling error corpus (270) http://www.ota.ox.ac.uk/headers/0643.xml
  • 18. Spell correction. Performance 10K 100K 1M 1 edit 11350.52 msec ~11 sec. 117261.12 msec ~2 min. 1252882.23 msec ~20 min. 2 edits 4300631.29 msec ~70 min. Due to quadratic complexity these tests make no sense Spell-check complexity for word: Edit distance 1: O(C·n) Edit distance 2: O(C²·n²) * n – length of word ** C ~= 50
  • 19. Spell correction. Improvements  Performance  Memoize correction (Best → O(1))  Give ability to user to perform spell-correction  Improve train data  Coverage & Accuracy  Use more edits candidates  Use common mispelling rules  Use weights for edit operations  Hit part of speech  Hit context  Improve train data
  • 20. Tags detection Process differently source-specific information (twitter) ● Hashtag (#music) use word splitter ● Username (@LadyGaga) just ignore it I say to @love hello! I say to - hello! POSITIVE NEUTRAL I mean that i #hatetwitter I mean that i hate twitter NEUTRAL NEGATIVE
  • 21. Tags detection. Examples @INevaTrustEm ok :) we need to make a date for this Watching @danieltosh #toofunny #lovetolaugh #sick Avatar, #wasteofmoney #soft #thissucks #happytweet RT @BriannaStull13: what do you mean?
  • 22. Tags detection. Words splitting  Dynamic programming  Statistical approach due to ambiguity #orcore → [orc_ore], [or_core] #expertsexchange → [expert_sex_change], [experts_exchange]  Train data  Dictionary (default linux ~100K words)
  • 23. Tags detection. Twitter hashtags Twitter hashtags crawled from (~800): http://hashtags.org/ http://kingnetforums.weebly.com/twitter-hashtags-lists.html http://edudemic.com/2011/10/twitter-hashtag-dictionary/ http://nicolehumphrey.net/60-favorite-twitter-hashtags-for-writers-clickable-list/ http://www.dailywritingtips.com/40-twitter-hashtags-for-writers/ http://greeneconomypost.com/green-twitter-hashtag-17290.htm
  • 24. Tags detection. Performance 100 400 800 Time 4019.73 msec ~4 sec. 6429.19 ~6 sec. 7897.23 ~8 sec. Accuracy 83.00% 86.25% 84.88% Main problems: ● Train set not often solves ambiguity problem ● Dictionary hits filter lot of right candidates #rapnotamusic → [ra_p_not_a_music]
  • 25. Words splitting. Improvements  Performance  Memoize splitting  Prefix tree approach  Viterbi algorithm (http://en.wikipedia.org/wiki/Viterbi_algorithm)  Improve train data  Accuracy  Use famous names, geographic locations, slang, abbreviations, acronyms,...  Big dictionary  Improve train data (twitter-specific)
  • 26. Preprocessing performance Input conditions: Data: 2.4K (incorrect) of 15.8K (total) from Omniture15K.xls file (15%) Emoticons size: 14 most common smiles Abbreviations size: 8 most common abbrs Spell-correction distance: 1 Train data: big.txt Dictionary: linux-words.txt Results: Sentence count: 2412 Preprocessing time: 29214.88 msec (~29 sec.) Number of corrected sentences: 368 Percent of corrected to incorrect data: 15.28% Percent of corrected to total data: 2.33%
  • 27. Data preprocessing. Future.  Sentence breaker
  • 28. Environment  Hardware  CPU: 2 x Intel Pentium Dual T2370 @ 1.73GHz  RAM: 2.0 GB  Software  OS: Ubuntu 11.04  Kernel: Linux 2.6.38-13-generic  IDE: Emacs 23.2.1  Programming: Clojure 1.3