SlideShare une entreprise Scribd logo
1  sur  22
Language-Independent Twitter Sentiment Analysis
Sascha Narr, Michael Hülfenhaus, Sahin Albayrak


Sascha Narr
Competence Center Information Retrieval & Machine Learning


KDML 2012, LWA, Dortmund, Germany
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   2
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   3
1. Sentiment Analysis on Social Media


►   Why Sentiment Analysis?
       People’s opinions and sentiments about products and events
        in large numbers are invaluable:
       Market research, product feedback and more
       Sentiment Analysis allows to automatically collect such data

►   Why Twitter?
       400 Million tweets posted each day[1]
       Shorter text lengths encourage people to
        “just write” what they think
       Tweets are often informal and contain lots of opinions


                      [1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/

              18. September 2012         Language-Independent Twitter Sentiment Analysis                                    4
1. Methods for Sentiment Classification

► Sentiment classification goals:
      Subjectivity: “Does the tweet contain an opinion?”
      Polarity: “Is the expressed opinion positive or negative?”
► Classifiers used:

      Naive Bayes, Maximum Entropy, Support Vector Machines
► Features used:

      n-grams, WordNet semantics, part-of-speech information

►   Tweet texts have unique properties:
       Informal, contain slang, emoticons, misspellings



              18. September 2012   Language-Independent Twitter Sentiment Analysis   5
1. Multilingual Sentiment Analysis

►Less than 40% of tweets are English [1]
►Natural language processing methods are often

 designed specifically for one language

►   Increase coverage of sentiment analysis by using a
    language-independent approach:
       No extra effort for additional languages
       Is the approach really effective for all languages?



                                  [1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter


             18. September 2012      Language-Independent Twitter Sentiment Analysis                        6
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   7
2. Creation of a Multilingual Evaluation Dataset


►   We created a hand-annotated sentiment evaluation
    dataset of over 12000 tweets
       4 languages: English, German, French, Portuguese
►Used the Amazon Mechanical Turk platform for
 annotation
►Each tweet was annotated by 3 different workers:

       Labels: “positive”, “neutral”, “negative”
       Added validation tweets to try to ensure the quality of the
        annotations




             18. September 2012   Language-Independent Twitter Sentiment Analysis   8
2. Our Multilingual Evaluation Dataset

►   Observed a low inter-annotator agreement in our dataset
       Sentiment classification is a hard task, even for humans
       Tweets that humans disagree on are harder to classify as
        well
►   The dataset is publicly available for research purposes




              Table 1: Tweet counts for the complete annotated dataset




             18. September 2012   Language-Independent Twitter Sentiment Analysis   9
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   10
3. A Language-Independent Heuristic

► To train a sentiment classifier, a large amount of labeled
  training data is needed
      Can be obtained without human effort using a previously
       proposed heuristic
► The heuristic uses emoticons in tweets as noisy labels




►   Heuristic: If a tweet contains only positive emoticons, label its
    whole text as positive (and vice versa for negative).

►   Examples of emoticons we used:
           Positive:       :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ
           Negative:       :( :-( :(( -.- >:-( D: :/


              18. September 2012   Language-Independent Twitter Sentiment Analysis   11
3. Heuristic for Semi-Supervised Learning

► Heuristic can be applied to almost any language, since
  emoticons are used extensively on Twitter
► Amount of tweets with emoticons differs among languages

     Caused by many factors like language-specific ways to
      express sentiments or different distributions of “formal”
      tweets




            Table 2: Number of tweets containing emoticons for each language




            18. September 2012   Language-Independent Twitter Sentiment Analysis   12
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   13
4. Experiments – Sentiment Classification

►   Data:
       Training: From ~ 800M random tweets of mixed languages:
           Filter for languages: English, German, French, Portuguese
           Use emoticon heuristic to select and label training data
        Evaluation: 12597 hand-annotated tweets (4 languages)

►   Setup:
        Classification: Sentiment polarity only
        Classifier: Naive Bayes
        Features: 1-grams and 1, 2-grams
        Trained 4 classifiers for en, de, fr, pt
                  1 classifier for combined en+de+fr+pt


              18. September 2012   Language-Independent Twitter Sentiment Analysis   14
4. Experiments: Evaluation Dataset

► 2 variations of our evaluation set for the experiments:
      agree-3: Tweets all 3 annotators agreed on for a sentiment
      agree-2: Tweets at least 2 annotators agreed on
► Baseline: always guess “positive” (more pos. tweets than neg.)




               Table 3: Tweet counts for the evaluation datasets



           18. September 2012   Language-Independent Twitter Sentiment Analysis   15
4. Results – English Classifier

► Best results: English classifier using 1-grams, on the 3-agree set
      81.3% accuracy (500k trained tweets)
► Performance on 2-agree set constantly lower than 3-agree



                                                                en




            18. September 2012   Language-Independent Twitter Sentiment Analysis   16
4. Results – All Languages
                              en                                                de




                              fr                                                pt




         18. September 2012   Language-Independent Twitter Sentiment Analysis        17
4. Evaluation – All Languages Compared
                                                                 en                                 de
► Strong differences
  between languages
► Differences do not

  correlate with number
  of emoticons in each                                             fr                                   pt
  language

► Emoticon heuristic better
  fit for some languages,
  may depend on the style of
  expressing sentiment in it
► “muito engraçado kkkkkkkk”

                                          Table3: Tweet counts containing emoticons for each language



           18. September 2012   Language-Independent Twitter Sentiment Analysis                         18
4. Evaluation – Multi-language Classifier
► Tested on combined 4 language evaluation set
► Highest Performance: 71.5% accuracy

      Slightly less than using 4 individual classifiers (73.9% accuracy)
► Usefulness of combined classifier can outweigh performance

  degradation
                                                   en+de+fr+pt




            18. September 2012   Language-Independent Twitter Sentiment Analysis   19
Conclusions

►   We presented and evaluated a language-independent
    sentiment classification approach on 4 languages
        A language-independent classifier can be trained given only
         raw tweets, using a noisy label heuristic
        Good performances across languages, varies for each
        Classifiers need a very large number of tweets for training
        Mixed-language classifiers are viable

►   Future work:
        Currently we only classify sentiment polarity
        Classifying subjectivity in tweets is important, but finding a
         good heuristic to label “neutral” tweets is a challenge

               18. September 2012   Language-Independent Twitter Sentiment Analysis   20
Language-Independent Twitter Sentiment Analysis




         Thanks for your attention!

                            Questions?



           18. September 2012   Language-Independent Twitter Sentiment Analysis   21
Contact


Sascha Narr                                            DAI-Labor
Dipl.-Inform.                                          Technische Universität Berlin




                                                       Fakultät IV –
Competence Center Information Retrieval &              Elektrontechnik & Informatik
Machine Learning

sascha.narr@dai-labor.de                               Sekretariat TEL 14
Fon +49 (0) 30 / 314 – 74 138                          Ernst Reuter Platz 7
Fax +49 (0) 30 / 314 – 74 003                          10587 Berlin




                                                        www.dai-labor.de

                18. September 2012   Language-Independent Twitter Sentiment Analysis   22

Contenu connexe

Similaire à Language-Independent Twitter Sentiment Analysis

Sentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in ItalySentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in ItalyCorrado Monti
 
D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index Istituto nazionale di statistica
 
Affect Level Opinion Mining
Affect Level Opinion MiningAffect Level Opinion Mining
Affect Level Opinion MiningYasas Senarath
 
Rethinking Social Media Measurement
Rethinking Social Media MeasurementRethinking Social Media Measurement
Rethinking Social Media MeasurementMasood Akhtar
 
A tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysisA tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysisDiana Maynard
 
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Eirini Ntoutsi
 
This assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docxThis assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docxhowardh5
 
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...IRJET Journal
 
Detecting insults in social media conversations
Detecting insults in social media conversationsDetecting insults in social media conversations
Detecting insults in social media conversationsraj
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesKarol Chlasta
 
Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011MayaMar
 
Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011Maya Marashlian
 
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingA User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingElena Daehnhardt
 
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment AnalysisTo Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment AnalysisNicole Novielli
 
Exciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation InstructionExciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation InstructionMeagen Farrell
 
VenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher DevelopmentVenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher DevelopmentAndrés Ramos
 
Twitter, sentiment and finance: how qualitative information and markets are r...
Twitter, sentiment and finance: how qualitative information and markets are r...Twitter, sentiment and finance: how qualitative information and markets are r...
Twitter, sentiment and finance: how qualitative information and markets are r...Giacomo Carozza
 
Hate speech detection on Indonesian text using word embedding method-global v...
Hate speech detection on Indonesian text using word embedding method-global v...Hate speech detection on Indonesian text using word embedding method-global v...
Hate speech detection on Indonesian text using word embedding method-global v...IAESIJAI
 

Similaire à Language-Independent Twitter Sentiment Analysis (20)

Sentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in ItalySentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in Italy
 
D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index
 
Affect Level Opinion Mining
Affect Level Opinion MiningAffect Level Opinion Mining
Affect Level Opinion Mining
 
Rethinking Social Media Measurement
Rethinking Social Media MeasurementRethinking Social Media Measurement
Rethinking Social Media Measurement
 
A tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysisA tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysis
 
Project report
Project reportProject report
Project report
 
Perspective pitch
Perspective pitchPerspective pitch
Perspective pitch
 
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
 
This assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docxThis assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docx
 
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...
 
Detecting insults in social media conversations
Detecting insults in social media conversationsDetecting insults in social media conversations
Detecting insults in social media conversations
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
 
Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011
 
Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011
 
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingA User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
 
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment AnalysisTo Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
 
Exciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation InstructionExciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation Instruction
 
VenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher DevelopmentVenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher Development
 
Twitter, sentiment and finance: how qualitative information and markets are r...
Twitter, sentiment and finance: how qualitative information and markets are r...Twitter, sentiment and finance: how qualitative information and markets are r...
Twitter, sentiment and finance: how qualitative information and markets are r...
 
Hate speech detection on Indonesian text using word embedding method-global v...
Hate speech detection on Indonesian text using word embedding method-global v...Hate speech detection on Indonesian text using word embedding method-global v...
Hate speech detection on Indonesian text using word embedding method-global v...
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Language-Independent Twitter Sentiment Analysis

  • 1. Language-Independent Twitter Sentiment Analysis Sascha Narr, Michael Hülfenhaus, Sahin Albayrak Sascha Narr Competence Center Information Retrieval & Machine Learning KDML 2012, LWA, Dortmund, Germany
  • 2. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 2
  • 3. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 3
  • 4. 1. Sentiment Analysis on Social Media ► Why Sentiment Analysis?  People’s opinions and sentiments about products and events in large numbers are invaluable:  Market research, product feedback and more  Sentiment Analysis allows to automatically collect such data ► Why Twitter?  400 Million tweets posted each day[1]  Shorter text lengths encourage people to “just write” what they think  Tweets are often informal and contain lots of opinions [1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 4
  • 5. 1. Methods for Sentiment Classification ► Sentiment classification goals:  Subjectivity: “Does the tweet contain an opinion?”  Polarity: “Is the expressed opinion positive or negative?” ► Classifiers used:  Naive Bayes, Maximum Entropy, Support Vector Machines ► Features used:  n-grams, WordNet semantics, part-of-speech information ► Tweet texts have unique properties:  Informal, contain slang, emoticons, misspellings 18. September 2012 Language-Independent Twitter Sentiment Analysis 5
  • 6. 1. Multilingual Sentiment Analysis ►Less than 40% of tweets are English [1] ►Natural language processing methods are often designed specifically for one language ► Increase coverage of sentiment analysis by using a language-independent approach: No extra effort for additional languages Is the approach really effective for all languages? [1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter 18. September 2012 Language-Independent Twitter Sentiment Analysis 6
  • 7. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 7
  • 8. 2. Creation of a Multilingual Evaluation Dataset ► We created a hand-annotated sentiment evaluation dataset of over 12000 tweets  4 languages: English, German, French, Portuguese ►Used the Amazon Mechanical Turk platform for annotation ►Each tweet was annotated by 3 different workers:  Labels: “positive”, “neutral”, “negative”  Added validation tweets to try to ensure the quality of the annotations 18. September 2012 Language-Independent Twitter Sentiment Analysis 8
  • 9. 2. Our Multilingual Evaluation Dataset ► Observed a low inter-annotator agreement in our dataset  Sentiment classification is a hard task, even for humans  Tweets that humans disagree on are harder to classify as well ► The dataset is publicly available for research purposes Table 1: Tweet counts for the complete annotated dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 9
  • 10. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 10
  • 11. 3. A Language-Independent Heuristic ► To train a sentiment classifier, a large amount of labeled training data is needed  Can be obtained without human effort using a previously proposed heuristic ► The heuristic uses emoticons in tweets as noisy labels ► Heuristic: If a tweet contains only positive emoticons, label its whole text as positive (and vice versa for negative). ► Examples of emoticons we used:  Positive: :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ  Negative: :( :-( :(( -.- >:-( D: :/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 11
  • 12. 3. Heuristic for Semi-Supervised Learning ► Heuristic can be applied to almost any language, since emoticons are used extensively on Twitter ► Amount of tweets with emoticons differs among languages  Caused by many factors like language-specific ways to express sentiments or different distributions of “formal” tweets Table 2: Number of tweets containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 12
  • 13. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 13
  • 14. 4. Experiments – Sentiment Classification ► Data:  Training: From ~ 800M random tweets of mixed languages:  Filter for languages: English, German, French, Portuguese  Use emoticon heuristic to select and label training data  Evaluation: 12597 hand-annotated tweets (4 languages) ► Setup:  Classification: Sentiment polarity only  Classifier: Naive Bayes  Features: 1-grams and 1, 2-grams  Trained 4 classifiers for en, de, fr, pt 1 classifier for combined en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 14
  • 15. 4. Experiments: Evaluation Dataset ► 2 variations of our evaluation set for the experiments:  agree-3: Tweets all 3 annotators agreed on for a sentiment  agree-2: Tweets at least 2 annotators agreed on ► Baseline: always guess “positive” (more pos. tweets than neg.) Table 3: Tweet counts for the evaluation datasets 18. September 2012 Language-Independent Twitter Sentiment Analysis 15
  • 16. 4. Results – English Classifier ► Best results: English classifier using 1-grams, on the 3-agree set  81.3% accuracy (500k trained tweets) ► Performance on 2-agree set constantly lower than 3-agree en 18. September 2012 Language-Independent Twitter Sentiment Analysis 16
  • 17. 4. Results – All Languages en de fr pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 17
  • 18. 4. Evaluation – All Languages Compared en de ► Strong differences between languages ► Differences do not correlate with number of emoticons in each fr pt language ► Emoticon heuristic better fit for some languages, may depend on the style of expressing sentiment in it ► “muito engraçado kkkkkkkk” Table3: Tweet counts containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 18
  • 19. 4. Evaluation – Multi-language Classifier ► Tested on combined 4 language evaluation set ► Highest Performance: 71.5% accuracy  Slightly less than using 4 individual classifiers (73.9% accuracy) ► Usefulness of combined classifier can outweigh performance degradation en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 19
  • 20. Conclusions ► We presented and evaluated a language-independent sentiment classification approach on 4 languages  A language-independent classifier can be trained given only raw tweets, using a noisy label heuristic  Good performances across languages, varies for each  Classifiers need a very large number of tweets for training  Mixed-language classifiers are viable ► Future work:  Currently we only classify sentiment polarity  Classifying subjectivity in tweets is important, but finding a good heuristic to label “neutral” tweets is a challenge 18. September 2012 Language-Independent Twitter Sentiment Analysis 20
  • 21. Language-Independent Twitter Sentiment Analysis Thanks for your attention! Questions? 18. September 2012 Language-Independent Twitter Sentiment Analysis 21
  • 22. Contact Sascha Narr DAI-Labor Dipl.-Inform. Technische Universität Berlin Fakultät IV – Competence Center Information Retrieval & Elektrontechnik & Informatik Machine Learning sascha.narr@dai-labor.de Sekretariat TEL 14 Fon +49 (0) 30 / 314 – 74 138 Ernst Reuter Platz 7 Fax +49 (0) 30 / 314 – 74 003 10587 Berlin www.dai-labor.de 18. September 2012 Language-Independent Twitter Sentiment Analysis 22