SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
#like or #fail
How Can Computers Tell the Difference?
Mark Cieliebak
27.3.2014
Institute of Applied Information Systems (InIT)
ZHAW
Zurich University of Applied Sciences
- 10'000 students, 230 professors (FTE)
InIT
Institute of Applied Information Systems
- more than 30 professors
- Labs on Cloud Computing, Data Science etc.
2
[11]
27.3.2014 Mark Cieliebak
About Me
27.3.2014 Mark Cieliebak 3
Mark Cieliebak
Institute of Applied Information Technology (InIT)
ZHAW, Winterthur
Email: ciel@zhaw.ch, Website: www.zhaw.ch/~ciel
Text
Analytics
Open
Data
Automated
Test
Generation
Research
Interests
Social
Network
Analysis
What is "Text Analytics"?
Synonyms: Text Mining, Natural Language Processing (NLP)
4Picture Sources: see slide "References"
[6]
[5]
[3]
[2]
[1]
[4]
27.3.2014 Mark Cieliebak
Sample Application: Social Media
Monitoring
Text Analytics Components:
• Find relevant documents
• Hot topic Analysis
• sentiment analysis
27.3.2014 Mark Cieliebak 5
[7]
Sentiment Analysis for Product Reviews
"… WiFi Analytics is a free Android app that I find very
handy when it comes to troubleshooting and monitoring
a home network. " [8]
27.3.2014 Mark Cieliebak 6
Flavours of Sentiment Analysis
• Document Based
• Sentence Based
• Target-Specific
• Rating Prediction
27.3.2014 Mark Cieliebak 7
Simple Sentiment Analysis
Idea: Count number of positive and negative words
"This analysis is good[+1]." +1 (pos)
"I find it beautiful[+1] and good[+1]." +2 (pos)
"It looks terrible[-1]." -1 (neg)
"This car has a blue color." 0 (neu)
POSITIVE:
good
love
nice
...
NEUTRAL:
hello
see
I
…
NEGATIVE:
bad
hate
ugly
...
Use Sentiment-Dictionary:
27.3.2014 Mark Cieliebak 8
Improvements
Sample Text
• "This analysis is good."
"This analysis is excellent."
• "The car is really very
expensive."
• "This car has an appealing
design and comfortable
seats, but it is expensive."
Solution
• Fine-Grained Dictionaries:
"This analysis is good [+2]. "
"This analysis is excellent [+3].
• Detect Booster Words:
"The car is really very
expensive[-1 -1 -2] ."
• New Category "Mixed":
"This car has an appealing[+1]
design and comfortable[+1]
seats, but it is expensive[-1]. "
 mix(+2/-1)
27.3.2014 Mark Cieliebak 9
Improvements 2
Sample Text
• This analysis is not good.
• The car is appealing and I do
not find it expensive.
• I do not find the car expensive
and it is appealing.
Solution
• Invert all Scores:
"This analysis is not[*-1]
good[+3]."
• Invert only score of words
occuring after the negation:
"The car is appealing[+3] and I
do not[*-1] find it expensive[-2]"
 Need to “understand” the
sentence
 Need linguistic analysis!
27.3.2014 Mark Cieliebak 10
Linguistic Analysis
-> RULE: Invert scores of words being in the same phrases as negation.
“I do not find the car expensive[+2]
and it is appealing[+3].” → +5 (pos)
Sentence
Sentence Conj. Sentence
Noun
Phrase
Verb Phrase
Verb Adverb Verb Noun
Phrase
Adj. Noun
Phrase
Verb Phrase
Det. Det Noun Det. Verb Participle
I do not find the car expensive and it is appealing
27.3.2014 Mark Cieliebak 11
Technology Stack
for Sentiment Analysis on Web Documents using Linguistic Analysis
Linguistic Framework
• Gate: text analysis framework with experimentation GUI; includes sentence splitter, tokenizer, chunker, pos-tagger, etc; plugin-
mechanism for new libraries; very widely used. http://gate.ac.uk/
• OpenNLP: Java-based, easy to use; includes tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, etc;
by Apache Foundation. http://opennlp.apache.org/
Web Boilerplate Extraction
• Boilerpipe: removes ads, menus, HTML-tags etc. http://boilerpipe-web.appspot.com/
Language Analysis
• Nutch language identifier plugin: 14 languages preimplemented, „easy“ to add more. http://wiki.apache.org/nutch/LanguageIdentifierPlugin
Sentence Splitting
• OpenNLP: very easy to use, in Java. http://opennlp.apache.org/
• Punkt. In Python http://nltk.org/api/nltk.tokenize.html
Part-of-Speech Tagger
• TreeTagger: language-independent, several languages implemented;. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
• TweetNLP: POS-tagger just for Twitter. http://www.ark.cs.cmu.edu/TweetNLP/
Chunker
• TreeTagger: supports English, German, French. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Dictionaries
• WordNet: large lexical database of English. http://wordnet.princeton.edu/
• SentiWordNet: English dictionalry with sentiment annotations. http://sentiwordnet.isti.cnr.it/
27.3.2014 Mark Cieliebak 12
Main Approaches to Sentiment Analysis
Rule-Based Corpus-Based
27.3.2014 Mark Cieliebak 13
Predicted
Label
[9]
[10]
Corpus-Based Sentiment Analysis
27.3.2014 Mark Cieliebak 14
Predicted
Label
[10]
Corpus-Based Sentiment Analysis
Annotated Corpus
Sentence Polarity
This analysis is good. Pos
It looks awful. Neg
This car has a blue color. Neu
This car has an appealing design, comfortable seats, but it is expensive. Mix
This car has a very appealing design, comfortable seats, but it is really expensive. Mix
This analysis is not good. Neg
This car has an appealing design, comfortable seats and it is not expensive. Mix
This movie was like a horror event. Neg
This car is appealing and is not expensive. Mix
... ...
27.3.2014 Mark Cieliebak 15
Sample Features for Tweets
• Word ngrams: presence or absence of contiguous sequences of 1, 2, 3,
and 4 tokens; noncontiguous ngrams
• POS: the number of occurrences of each part-of-speech tag
• Sentiment Lexica: each word annotated with tonality score (-1..0..+1)
• Negation: the number of negated contexts
• Punctuation: the number of contiguous sequences of exclamation marks,
question marks, and both exclamation and question marks
• Emoticons: presence or absence, last token is a positive or negative
emoticon;
• Hashtags: the number of hashtags;
• Elongated words: the number of words with one character repeated (e.g.
‘soooo’)
from: Mohammad et al., SemEval 2013
27.3.2014 Mark Cieliebak 16
How good are Sentiment Analysis Tools?
27.3.2014 Mark Cieliebak 17
Evaluation
joint work with Oliver Dürr and Fatih Uzdilli [15]
• 7 Public Text Corpora
– Single Statements
– Different Media Types
• Tweet, News, Review,
Speech Transcript
– Total: 28653 Texts
• 9 Commercial APIs
– Stand-alone
– Free for this evaluation
– Arbitrary Text
NEGATIVEPOSITIVE
OTHER
( neutral / mixed )
1827.3.2014 Mark Cieliebak
Sentiment Analysis Tools
AlchemyAPI www.alchemyapi.com
Lymbix www.lymbix.com
ML Analyzer www.mashape.com/mlanalyzer/ml-analyzer
Repustate www.repustate.com
Semantria www.semantria.com
Sentigem www.sentigem.com
Skyttle www.skyttle.com
Textalytics core.textalytics.com
Text-processing www.text-processing.com
27.3.2014 Mark Cieliebak 19
Tool Accuracy
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
Best Tool per Corpus
Worst Tool per Corpus
20
61%
40%
Avg.
27.3.2014 Mark Cieliebak
Tool Accuracy
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
Best Tool per Corpus
Worst Tool per Corpus
Overall Best Tool
21
61%
40%
59%
Avg.
27.3.2014 Mark Cieliebak
Summary
"They all suck…and we suck, too."
CEO of a Sentiment
Analysis Company
27.3.2014 Mark Cieliebak 22
State of the Art in Academia
SemEval 2013: Competition on Semantic Analysis
• Task 2A: Contextual Polarity Disambiguation
• Task 2B: Message Polarity Classification
Corpus: 15'000 tweets from 2012 on various topics
annotated via Mechanical Turk
27.3.2014 Mark Cieliebak 23
SemEval 2013 Task 2B
Results
• 36 Submissions
• Average F1 = 54%
• Only 9 teams above 50%, all others below!
• Best Team: F1 = 69%
• For comparison: best commercial tool has F1 = 62%
27.3.2014 Mark Cieliebak 24
Can a Meta-Classifier do better?
1st Approach: Majority Classifier
 Sentiment with most votes chosen
Illustration:
2527.3.2014 Mark Cieliebak
api1 api2 api3 api4 api5 api6 api7 Majority
Text 1 + + - o - + o +
Text 2 - + + - - - - -
Text 3 - o + + + + - +
Text n o o + o - o o o
Tool Accuracy
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
Best Tool per Corpus
Worst Tool per Corpus
2627.3.2014 Mark Cieliebak
Majority Classifier Close to Best Tool
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
Best Tool per Corpus
Worst Tool per Corpus
Majority Classifier
2727.3.2014 Mark Cieliebak
2nd Approach: Random-Forest
2827.3.2014 Mark Cieliebak
api1 api2 api3 api9 annotation
Text 1 + - + o +
Text 2 - + o + -
Text 3 - o - + -
Text 4 + o + - +
Text 5 + o + o o
Text 6 + o o - o
Text 7 + - + o unknown
Text 8 + + o - unknown
Text 9 o - + o unknown
Random
Forest
Classifier
+
+
o
Train
Train
Train
Train
Predict
Predict
Predict
Train
Train
Before Random Forest
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
Best Tool per Corpus
Worst Tool per Corpus
Majority Classifier
2927.3.2014 Mark Cieliebak
Random Forest beats Best Single Tool
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
Best Tool per Corpus
Worst Tool per Corpus
Majority Classifier
Random Forest Classifier
3027.3.2014 Mark Cieliebak
Challenges in Sentiment Analysis
• "Salaries for software engineers are extremely high."
 Context Dependencies
• “What a great car, it stopped working the second day.”
 Sarcastic Statements
• "#YouCantDateMe if u still sag ur pants super hard...dat shit is
played the fuck out!!! "
 Informal Language
• Huge Effort for New Languages
27.3.2014 Mark Cieliebak
Talk in Short!
1. Sentiment analysis is an
important challenge
2. Commercial tools classify 6
out of 10 docs wrong
3. Academic systems are
not really better
4. Random forest outperform
even the best tools
27.3.2014 Mark Cieliebak 32
[14]
[12]
[13]
References
[1] http://www.teleportmyjob.com/blog/wp-content/uploads/2013/05/find_job.jpg
[2] http://screenshots.de.sftcdn.net/de/scrn/3340000/3340669/swiftkey-23-630x535.png
[3] https://lh4.ggpht.com/Fs4HLaMCgmY6O7BWGKZB8LdOM8GmiRAbKUcdJysNp0k74xmPMTSGVHigwvaLy_Tw=w300
[4] http://researcher.watson.ibm.com/researcher/files/us-bansal/watson2.jpg
[5] http://www.betadaily.com/wp-content/uploads/2011/02/BusinessPlanExecutiveSummary.jpg
[6] http://everycook.org/cms/en/hardware-en/pictures-en#
[7] http://cdn.fulltraffic.net/images/blog/social-media-monitoring-process.png
[8] http://www.pcmag.com/article2/0,2817,2426416,00.asp
[9] http://i1059.photobucket.com/albums/t424/shmi11y/UoS%20LingSite/tompushedthecar.png
[10] http://bigsnarf.files.wordpress.com/2013/04/supervised.png
[11] http://www.alumni.sml.zhaw.ch/portals/0/Volkartgeb%C3%A4ude%20gek%C3%BCrzt%202.jpg
[12] http://www.bitspeed.com/wp-content/uploads/2011/09/Sign-multi-cast.jpg
[13] http://wwwdelivery.superstock.com/WI/223/1829/PreviewComp/SuperStock_1829-47184.jpg
[14] http://www.odditycentral.com/art/arborsculpture-the-art-of-turning-young-trees-into-living-works-of-art.html
[15] Cieliebak, Dürr, Uzdilli, ESSEM 2013
27.3.2014 Mark Cieliebak 33

Contenu connexe

Similaire à #like or #fail - How Can Computers Tell the Difference?

Case Study: We're Watching You: How and Why Researchers Study Open Source And...
Case Study: We're Watching You: How and Why Researchers Study Open Source And...Case Study: We're Watching You: How and Why Researchers Study Open Source And...
Case Study: We're Watching You: How and Why Researchers Study Open Source And...All Things Open
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and TweetsSentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and Tweets🧑‍💻 Manuel Coppotelli
 
AppSec Pipelines and Event based Security
AppSec Pipelines and Event based SecurityAppSec Pipelines and Event based Security
AppSec Pipelines and Event based SecurityMatt Tesauro
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialVitomir Kovanovic
 
OSMC 2022 | Open Source: Open Choice – A DevOps Guide for OSS Adoption by Hil...
OSMC 2022 | Open Source: Open Choice – A DevOps Guide for OSS Adoption by Hil...OSMC 2022 | Open Source: Open Choice – A DevOps Guide for OSS Adoption by Hil...
OSMC 2022 | Open Source: Open Choice – A DevOps Guide for OSS Adoption by Hil...NETWAYS
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Visual Data Collection - Mike Morgan - REcon 18
Visual Data Collection - Mike Morgan - REcon 18Visual Data Collection - Mike Morgan - REcon 18
Visual Data Collection - Mike Morgan - REcon 18UX INXS
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningDavid Walker, CSM,CSD,MCP,MCAD,MCSD,MVP
 
Sentiment Analysis and Classification of Tweets using Data Mining
Sentiment Analysis and Classification of Tweets using Data MiningSentiment Analysis and Classification of Tweets using Data Mining
Sentiment Analysis and Classification of Tweets using Data MiningIRJET Journal
 
Twitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachTwitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachIRJET Journal
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support SystemKavita Ganesan
 
Linked Open Data about Springer Nature conferences. The story so far
Linked Open Data about Springer Nature conferences. The story so farLinked Open Data about Springer Nature conferences. The story so far
Linked Open Data about Springer Nature conferences. The story so farAliaksandr Birukou
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningDavid Walker, CSM,CSD,MCP,MCAD,MCSD,MVP
 
ds 1.pptx
ds 1.pptxds 1.pptx
ds 1.pptxvaru9
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_ReportUrjit Patel
 
Empirical Model of Supervised Learning Approach for Opinion Mining
Empirical Model of Supervised Learning Approach for Opinion MiningEmpirical Model of Supervised Learning Approach for Opinion Mining
Empirical Model of Supervised Learning Approach for Opinion MiningIRJET Journal
 

Similaire à #like or #fail - How Can Computers Tell the Difference? (20)

Raising the Bar
Raising the BarRaising the Bar
Raising the Bar
 
Case Study: We're Watching You: How and Why Researchers Study Open Source And...
Case Study: We're Watching You: How and Why Researchers Study Open Source And...Case Study: We're Watching You: How and Why Researchers Study Open Source And...
Case Study: We're Watching You: How and Why Researchers Study Open Source And...
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and TweetsSentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
 
AppSec Pipelines and Event based Security
AppSec Pipelines and Event based SecurityAppSec Pipelines and Event based Security
AppSec Pipelines and Event based Security
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
 
OSMC 2022 | Open Source: Open Choice – A DevOps Guide for OSS Adoption by Hil...
OSMC 2022 | Open Source: Open Choice – A DevOps Guide for OSS Adoption by Hil...OSMC 2022 | Open Source: Open Choice – A DevOps Guide for OSS Adoption by Hil...
OSMC 2022 | Open Source: Open Choice – A DevOps Guide for OSS Adoption by Hil...
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Visual Data Collection - Mike Morgan - REcon 18
Visual Data Collection - Mike Morgan - REcon 18Visual Data Collection - Mike Morgan - REcon 18
Visual Data Collection - Mike Morgan - REcon 18
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
 
Sentiment Analysis and Classification of Tweets using Data Mining
Sentiment Analysis and Classification of Tweets using Data MiningSentiment Analysis and Classification of Tweets using Data Mining
Sentiment Analysis and Classification of Tweets using Data Mining
 
Twitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachTwitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised Approach
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
Linked Open Data about Springer Nature conferences. The story so far
Linked Open Data about Springer Nature conferences. The story so farLinked Open Data about Springer Nature conferences. The story so far
Linked Open Data about Springer Nature conferences. The story so far
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
INTERNSHIP PPT.pptx
INTERNSHIP PPT.pptxINTERNSHIP PPT.pptx
INTERNSHIP PPT.pptx
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
 
ds 1.pptx
ds 1.pptxds 1.pptx
ds 1.pptx
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
 
Empirical Model of Supervised Learning Approach for Opinion Mining
Empirical Model of Supervised Learning Approach for Opinion MiningEmpirical Model of Supervised Learning Approach for Opinion Mining
Empirical Model of Supervised Learning Approach for Opinion Mining
 

Dernier

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Dernier (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

#like or #fail - How Can Computers Tell the Difference?

  • 1. #like or #fail How Can Computers Tell the Difference? Mark Cieliebak 27.3.2014
  • 2. Institute of Applied Information Systems (InIT) ZHAW Zurich University of Applied Sciences - 10'000 students, 230 professors (FTE) InIT Institute of Applied Information Systems - more than 30 professors - Labs on Cloud Computing, Data Science etc. 2 [11] 27.3.2014 Mark Cieliebak
  • 3. About Me 27.3.2014 Mark Cieliebak 3 Mark Cieliebak Institute of Applied Information Technology (InIT) ZHAW, Winterthur Email: ciel@zhaw.ch, Website: www.zhaw.ch/~ciel Text Analytics Open Data Automated Test Generation Research Interests Social Network Analysis
  • 4. What is "Text Analytics"? Synonyms: Text Mining, Natural Language Processing (NLP) 4Picture Sources: see slide "References" [6] [5] [3] [2] [1] [4] 27.3.2014 Mark Cieliebak
  • 5. Sample Application: Social Media Monitoring Text Analytics Components: • Find relevant documents • Hot topic Analysis • sentiment analysis 27.3.2014 Mark Cieliebak 5 [7]
  • 6. Sentiment Analysis for Product Reviews "… WiFi Analytics is a free Android app that I find very handy when it comes to troubleshooting and monitoring a home network. " [8] 27.3.2014 Mark Cieliebak 6
  • 7. Flavours of Sentiment Analysis • Document Based • Sentence Based • Target-Specific • Rating Prediction 27.3.2014 Mark Cieliebak 7
  • 8. Simple Sentiment Analysis Idea: Count number of positive and negative words "This analysis is good[+1]." +1 (pos) "I find it beautiful[+1] and good[+1]." +2 (pos) "It looks terrible[-1]." -1 (neg) "This car has a blue color." 0 (neu) POSITIVE: good love nice ... NEUTRAL: hello see I … NEGATIVE: bad hate ugly ... Use Sentiment-Dictionary: 27.3.2014 Mark Cieliebak 8
  • 9. Improvements Sample Text • "This analysis is good." "This analysis is excellent." • "The car is really very expensive." • "This car has an appealing design and comfortable seats, but it is expensive." Solution • Fine-Grained Dictionaries: "This analysis is good [+2]. " "This analysis is excellent [+3]. • Detect Booster Words: "The car is really very expensive[-1 -1 -2] ." • New Category "Mixed": "This car has an appealing[+1] design and comfortable[+1] seats, but it is expensive[-1]. "  mix(+2/-1) 27.3.2014 Mark Cieliebak 9
  • 10. Improvements 2 Sample Text • This analysis is not good. • The car is appealing and I do not find it expensive. • I do not find the car expensive and it is appealing. Solution • Invert all Scores: "This analysis is not[*-1] good[+3]." • Invert only score of words occuring after the negation: "The car is appealing[+3] and I do not[*-1] find it expensive[-2]"  Need to “understand” the sentence  Need linguistic analysis! 27.3.2014 Mark Cieliebak 10
  • 11. Linguistic Analysis -> RULE: Invert scores of words being in the same phrases as negation. “I do not find the car expensive[+2] and it is appealing[+3].” → +5 (pos) Sentence Sentence Conj. Sentence Noun Phrase Verb Phrase Verb Adverb Verb Noun Phrase Adj. Noun Phrase Verb Phrase Det. Det Noun Det. Verb Participle I do not find the car expensive and it is appealing 27.3.2014 Mark Cieliebak 11
  • 12. Technology Stack for Sentiment Analysis on Web Documents using Linguistic Analysis Linguistic Framework • Gate: text analysis framework with experimentation GUI; includes sentence splitter, tokenizer, chunker, pos-tagger, etc; plugin- mechanism for new libraries; very widely used. http://gate.ac.uk/ • OpenNLP: Java-based, easy to use; includes tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, etc; by Apache Foundation. http://opennlp.apache.org/ Web Boilerplate Extraction • Boilerpipe: removes ads, menus, HTML-tags etc. http://boilerpipe-web.appspot.com/ Language Analysis • Nutch language identifier plugin: 14 languages preimplemented, „easy“ to add more. http://wiki.apache.org/nutch/LanguageIdentifierPlugin Sentence Splitting • OpenNLP: very easy to use, in Java. http://opennlp.apache.org/ • Punkt. In Python http://nltk.org/api/nltk.tokenize.html Part-of-Speech Tagger • TreeTagger: language-independent, several languages implemented;. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ • TweetNLP: POS-tagger just for Twitter. http://www.ark.cs.cmu.edu/TweetNLP/ Chunker • TreeTagger: supports English, German, French. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ Dictionaries • WordNet: large lexical database of English. http://wordnet.princeton.edu/ • SentiWordNet: English dictionalry with sentiment annotations. http://sentiwordnet.isti.cnr.it/ 27.3.2014 Mark Cieliebak 12
  • 13. Main Approaches to Sentiment Analysis Rule-Based Corpus-Based 27.3.2014 Mark Cieliebak 13 Predicted Label [9] [10]
  • 14. Corpus-Based Sentiment Analysis 27.3.2014 Mark Cieliebak 14 Predicted Label [10]
  • 15. Corpus-Based Sentiment Analysis Annotated Corpus Sentence Polarity This analysis is good. Pos It looks awful. Neg This car has a blue color. Neu This car has an appealing design, comfortable seats, but it is expensive. Mix This car has a very appealing design, comfortable seats, but it is really expensive. Mix This analysis is not good. Neg This car has an appealing design, comfortable seats and it is not expensive. Mix This movie was like a horror event. Neg This car is appealing and is not expensive. Mix ... ... 27.3.2014 Mark Cieliebak 15
  • 16. Sample Features for Tweets • Word ngrams: presence or absence of contiguous sequences of 1, 2, 3, and 4 tokens; noncontiguous ngrams • POS: the number of occurrences of each part-of-speech tag • Sentiment Lexica: each word annotated with tonality score (-1..0..+1) • Negation: the number of negated contexts • Punctuation: the number of contiguous sequences of exclamation marks, question marks, and both exclamation and question marks • Emoticons: presence or absence, last token is a positive or negative emoticon; • Hashtags: the number of hashtags; • Elongated words: the number of words with one character repeated (e.g. ‘soooo’) from: Mohammad et al., SemEval 2013 27.3.2014 Mark Cieliebak 16
  • 17. How good are Sentiment Analysis Tools? 27.3.2014 Mark Cieliebak 17
  • 18. Evaluation joint work with Oliver Dürr and Fatih Uzdilli [15] • 7 Public Text Corpora – Single Statements – Different Media Types • Tweet, News, Review, Speech Transcript – Total: 28653 Texts • 9 Commercial APIs – Stand-alone – Free for this evaluation – Arbitrary Text NEGATIVEPOSITIVE OTHER ( neutral / mixed ) 1827.3.2014 Mark Cieliebak
  • 19. Sentiment Analysis Tools AlchemyAPI www.alchemyapi.com Lymbix www.lymbix.com ML Analyzer www.mashape.com/mlanalyzer/ml-analyzer Repustate www.repustate.com Semantria www.semantria.com Sentigem www.sentigem.com Skyttle www.skyttle.com Textalytics core.textalytics.com Text-processing www.text-processing.com 27.3.2014 Mark Cieliebak 19
  • 20. Tool Accuracy 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Best Tool per Corpus Worst Tool per Corpus 20 61% 40% Avg. 27.3.2014 Mark Cieliebak
  • 21. Tool Accuracy 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Best Tool per Corpus Worst Tool per Corpus Overall Best Tool 21 61% 40% 59% Avg. 27.3.2014 Mark Cieliebak
  • 22. Summary "They all suck…and we suck, too." CEO of a Sentiment Analysis Company 27.3.2014 Mark Cieliebak 22
  • 23. State of the Art in Academia SemEval 2013: Competition on Semantic Analysis • Task 2A: Contextual Polarity Disambiguation • Task 2B: Message Polarity Classification Corpus: 15'000 tweets from 2012 on various topics annotated via Mechanical Turk 27.3.2014 Mark Cieliebak 23
  • 24. SemEval 2013 Task 2B Results • 36 Submissions • Average F1 = 54% • Only 9 teams above 50%, all others below! • Best Team: F1 = 69% • For comparison: best commercial tool has F1 = 62% 27.3.2014 Mark Cieliebak 24
  • 25. Can a Meta-Classifier do better? 1st Approach: Majority Classifier  Sentiment with most votes chosen Illustration: 2527.3.2014 Mark Cieliebak api1 api2 api3 api4 api5 api6 api7 Majority Text 1 + + - o - + o + Text 2 - + + - - - - - Text 3 - o + + + + - + Text n o o + o - o o o
  • 26. Tool Accuracy 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Best Tool per Corpus Worst Tool per Corpus 2627.3.2014 Mark Cieliebak
  • 27. Majority Classifier Close to Best Tool 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Best Tool per Corpus Worst Tool per Corpus Majority Classifier 2727.3.2014 Mark Cieliebak
  • 28. 2nd Approach: Random-Forest 2827.3.2014 Mark Cieliebak api1 api2 api3 api9 annotation Text 1 + - + o + Text 2 - + o + - Text 3 - o - + - Text 4 + o + - + Text 5 + o + o o Text 6 + o o - o Text 7 + - + o unknown Text 8 + + o - unknown Text 9 o - + o unknown Random Forest Classifier + + o Train Train Train Train Predict Predict Predict Train Train
  • 29. Before Random Forest 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Best Tool per Corpus Worst Tool per Corpus Majority Classifier 2927.3.2014 Mark Cieliebak
  • 30. Random Forest beats Best Single Tool 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Best Tool per Corpus Worst Tool per Corpus Majority Classifier Random Forest Classifier 3027.3.2014 Mark Cieliebak
  • 31. Challenges in Sentiment Analysis • "Salaries for software engineers are extremely high."  Context Dependencies • “What a great car, it stopped working the second day.”  Sarcastic Statements • "#YouCantDateMe if u still sag ur pants super hard...dat shit is played the fuck out!!! "  Informal Language • Huge Effort for New Languages 27.3.2014 Mark Cieliebak
  • 32. Talk in Short! 1. Sentiment analysis is an important challenge 2. Commercial tools classify 6 out of 10 docs wrong 3. Academic systems are not really better 4. Random forest outperform even the best tools 27.3.2014 Mark Cieliebak 32 [14] [12] [13]
  • 33. References [1] http://www.teleportmyjob.com/blog/wp-content/uploads/2013/05/find_job.jpg [2] http://screenshots.de.sftcdn.net/de/scrn/3340000/3340669/swiftkey-23-630x535.png [3] https://lh4.ggpht.com/Fs4HLaMCgmY6O7BWGKZB8LdOM8GmiRAbKUcdJysNp0k74xmPMTSGVHigwvaLy_Tw=w300 [4] http://researcher.watson.ibm.com/researcher/files/us-bansal/watson2.jpg [5] http://www.betadaily.com/wp-content/uploads/2011/02/BusinessPlanExecutiveSummary.jpg [6] http://everycook.org/cms/en/hardware-en/pictures-en# [7] http://cdn.fulltraffic.net/images/blog/social-media-monitoring-process.png [8] http://www.pcmag.com/article2/0,2817,2426416,00.asp [9] http://i1059.photobucket.com/albums/t424/shmi11y/UoS%20LingSite/tompushedthecar.png [10] http://bigsnarf.files.wordpress.com/2013/04/supervised.png [11] http://www.alumni.sml.zhaw.ch/portals/0/Volkartgeb%C3%A4ude%20gek%C3%BCrzt%202.jpg [12] http://www.bitspeed.com/wp-content/uploads/2011/09/Sign-multi-cast.jpg [13] http://wwwdelivery.superstock.com/WI/223/1829/PreviewComp/SuperStock_1829-47184.jpg [14] http://www.odditycentral.com/art/arborsculpture-the-art-of-turning-young-trees-into-living-works-of-art.html [15] Cieliebak, Dürr, Uzdilli, ESSEM 2013 27.3.2014 Mark Cieliebak 33