SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
On Stopwords, Filtering and Data Sparsity for
Sentiment Analysis of Twitter
Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani
Knowledge Media Institute, The Open University,
Milton Keynes, United Kingdom
The 9th edition of the Language Resources and Evaluation
Conference, Reykjavik, Iceland
• Sentiment Analysis
• Twitter
• Stopwords Removal Methods
• Comparative Study
• Conclusion
Outline
“Sentiment analysis is the task of identifying
positive and negative opinions, emotions and
evaluations in text”
3
The main dish was
delicious
It is a Syrian dish
The main dish was
salty and horrible
Opinion OpinionFact
Sentiment Analysis
On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of  Twitter
On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of  Twitter
Stopwords Removal
Stopwords Removal in Twitter Sentiment Analysis
- Kouloumpis et al. 2011
- Pak & Paroubek, 2010
- Asiaee et al., 2012
- Bollen et al., 2011
- Bifet and Frank, 2010
- Speriosu et al., 2011
- Zhang & Yuan, 2013
- Gokulakrishnan et al 2012
- Saif et al., 2012
- Hu et al., 2013
- Camara et al., 2013
Removing
Stopwords
is USEFUL
NOYES
• Precompiled
• Very popular
• Outdated
• Domain-
Independent
Classic Stopword Lists
• Unsupervised Methods
– Term Frequency
– Term-based Random Sampling
• Supervised
– Term Entropy Measures
– Maximum Likelihood Estimation
Automatic Stopwords Generation Methods
Stopwords Removal
for Twitter Sentiment
Analysis
Stopword Analysis Set-Up (1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
OMD
HCR
STS
SemEval
WAB
GASP
OMD HCR STS SemEval WAB GASP
Negative 688 957 1402 1590 2580 5235
Positive 393 397 632 3781 2915 1050
Datasets
Stopword Analysis Set-Up (2)
Stopwords Removal Methods
1. The Baseline Method
– (non removal of stopwords)
1. The Classic Method
– This method is based on removing stopwords
obtained from pre-compiled lists
– Van Stoplist
Stopword Analysis Set-Up (3)
Stopwords Removal Methods
3. Methods based on Zipf’s
Law
- TF-High Method
Removing most frequent
- TF1 Method
Removing singleton words (i.e.,
words that occur once in tweets)
- IDF Method
Removing words with low inverse
document frequency (IDF)
Stopword Analysis Set-Up (4)
Stopwords Removal Methods
4. Term-based Random Sampling (TBRS)
5. The Mutual Information Method (MI)
Stopword Analysis Set-Up (5)
Twitter Sentiment Classifiers
– Two Supervised Classifiers:
• Maximum Entropy (MaxEnt)
• Naïve Bayes (NB)
– Measure the performance in Accuracy and F1
measure
– 10 fold cross validation
Experimental Results
Assess the impact of removing
stopwords by observing fluctuations on:
- Classification Performance
- Feature space
- Data Sparsity
Experimental Results (1)
1. Classification Performance
70
75
80
85
90
95
OMD HCR STS-Gold SemEval WAB GASP
Accuracy(%)
MaxEnt NB
60
65
70
75
80
85
90
OMD HCR STS-Gold SemEval WAB GASP
F1(%)
MaxEnt NB
The baseline classification performance in Accuracy and F-measure
of MaxEnt and NB classifiers across all datasets
Accuracy F-Measure
Experimental Results (2)
1. Classification Performance
60
65
70
75
80
85
90
Baseline Classic TF1 TF-High IDF TBRS MI
Accuracy(%)
MaxEnt NB
50
55
60
65
70
75
80
85
Baseline Classic TF1 TF-High IDF TBRS MIF1(%)
MaxEnt NB
Accuracy F-Measure
Average Accuracy and F-measure of MaxEnt and NB classifiers using
different stoplists
Experimental Results (3)
2. Feature Space
0.00
5.50
65.24
0.82
11.22
6.06
19.34
Baseline Classic TF1 TF-High IDF TBRS MI
Reduction rate on the feature space of
the various stoplists
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OMD HCR STS-Gold SemEval WAB GASP
TF=1 TF>1
The number of singleton words to the number
non singleton words in all datasets
Experimental Results (4)
3. Data Sparsity
0.98800
0.99000
0.99200
0.99400
0.99600
0.99800
1.00000
Baseline Classic TF1 TF-High IDF TBRS MI
SparsityDegree
OMD HCR STS-Gold SemEval WAB GASP
Stoplist impact on the sparsity degree of all datasets
The Ideal Stoplist (1)
• The ideal stopword removal method is the
one which:
– Helps maintaining a high classification
performance,
– Leads to shrinking the classifier’s feature space
– Reduces the data sparseness
– Has low runtime and storage complexity
– Has minimal human supervision
The Ideal Stoplist (2)
Average accuracy, F1, reduction rate on feature space and data sparsity of the six stoplist
methods. Positive sparsity values refer to an increase in the sparsity degree while negative
values refer to a decrease in the sparsity degree.
Overall Analysis Results
Conclusion
• We studied how six different stopword removal methods
affect the sentiment polarity classification on Twitter.
• The use of pre-compiled (classic) Stoplist has a negative
impact on the classification performance.
• TF1 stopword removal method is the one that obtains the
best trade-off:
– Reducing the feature space by nearly 65%,
– Decreasing the data sparsity degree up to 0.37%, and
– Maintaining a high classification performance.

Contenu connexe

Tendances

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Heuristic Search Techniques Unit -II.ppt
Heuristic Search Techniques Unit -II.pptHeuristic Search Techniques Unit -II.ppt
Heuristic Search Techniques Unit -II.pptkarthikaparthasarath
 
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Nirav Raje
 
Transport layer interface
Transport layer interface Transport layer interface
Transport layer interface CEC Landran
 
From ELIZA to Alexa and Beyond
From ELIZA to Alexa and BeyondFrom ELIZA to Alexa and Beyond
From ELIZA to Alexa and BeyondCharmi Chokshi
 
Test suite minimization
Test suite minimizationTest suite minimization
Test suite minimizationGaurav Saxena
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classificationDr-Dipali Meher
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingRishikese MR
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prologHarry Potter
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 

Tendances (20)

NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Heuristic Search Techniques Unit -II.ppt
Heuristic Search Techniques Unit -II.pptHeuristic Search Techniques Unit -II.ppt
Heuristic Search Techniques Unit -II.ppt
 
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
 
Transport layer interface
Transport layer interface Transport layer interface
Transport layer interface
 
Scaling and Normalization
Scaling and NormalizationScaling and Normalization
Scaling and Normalization
 
From ELIZA to Alexa and Beyond
From ELIZA to Alexa and BeyondFrom ELIZA to Alexa and Beyond
From ELIZA to Alexa and Beyond
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
Test suite minimization
Test suite minimizationTest suite minimization
Test suite minimization
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Sudoku Solver
Sudoku SolverSudoku Solver
Sudoku Solver
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Data mining
Data miningData mining
Data mining
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prolog
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 

En vedette

Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisJaganadh Gopinadhan
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Intrusion Detection with Neural Networks
Intrusion Detection with Neural NetworksIntrusion Detection with Neural Networks
Intrusion Detection with Neural Networksantoniomorancardenas
 
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...Cataldo Musto
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing SystemsShuyo Nakatani
 
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)Meagan Louie
 
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisSupervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisTharindu Kumara
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012Michael Wilde
 
Short Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShort Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShuyo Nakatani
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltkWei-Ting Kuo
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTrilok Sharma
 

En vedette (20)

Semantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of TwitterSemantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of Twitter
 
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
 
Intrusion Detection with Neural Networks
Intrusion Detection with Neural NetworksIntrusion Detection with Neural Networks
Intrusion Detection with Neural Networks
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
 
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
 
Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
 
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
 
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisSupervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012
 
Short Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShort Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-Gram
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
NLP
NLPNLP
NLP
 

Similaire à On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

Improving rapid access to reports of RCTs from EMBASE: innovative methods to...
Improving rapid access to reports of RCTs from EMBASE: innovative  methods to...Improving rapid access to reports of RCTs from EMBASE: innovative  methods to...
Improving rapid access to reports of RCTs from EMBASE: innovative methods to...York Health Economics Consortium (YHEC)
 
Multimedia Geocoding: The RECOD 2014 Approach
Multimedia Geocoding: The RECOD 2014 ApproachMultimedia Geocoding: The RECOD 2014 Approach
Multimedia Geocoding: The RECOD 2014 Approachmultimediaeval
 
High Throughput Screening for ARV Drugs in HIV Prevention Studies
High Throughput Screening for ARV Drugs in HIV Prevention StudiesHigh Throughput Screening for ARV Drugs in HIV Prevention Studies
High Throughput Screening for ARV Drugs in HIV Prevention StudiesHopkinsCFAR
 
Information retrieval in systematic reviews: a case study of the crime preven...
Information retrieval in systematic reviews: a case study of the crime preven...Information retrieval in systematic reviews: a case study of the crime preven...
Information retrieval in systematic reviews: a case study of the crime preven...Lisa Tompson
 
Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)Saibur Rahman
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)r-kor
 
Statistics for linguistics
Statistics for linguisticsStatistics for linguistics
Statistics for linguisticsaiaioo
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Survey Method in Research
Survey Method in ResearchSurvey Method in Research
Survey Method in ResearchJasmin Cruz
 
44publicspkeaking06
44publicspkeaking0644publicspkeaking06
44publicspkeaking06emailtuanh
 
POPULATION AND SAMPLING.pptx
POPULATION AND SAMPLING.pptxPOPULATION AND SAMPLING.pptx
POPULATION AND SAMPLING.pptxMartMantilla1
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test Meghna Baid
 
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Weiyang Tong
 
Classifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsClassifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsJinho Choi
 
The painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS dataThe painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS dataCSIRO
 

Similaire à On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter (20)

Tabu search
Tabu searchTabu search
Tabu search
 
Improving rapid access to reports of RCTs from EMBASE: innovative methods to...
Improving rapid access to reports of RCTs from EMBASE: innovative  methods to...Improving rapid access to reports of RCTs from EMBASE: innovative  methods to...
Improving rapid access to reports of RCTs from EMBASE: innovative methods to...
 
Multimedia Geocoding: The RECOD 2014 Approach
Multimedia Geocoding: The RECOD 2014 ApproachMultimedia Geocoding: The RECOD 2014 Approach
Multimedia Geocoding: The RECOD 2014 Approach
 
High Throughput Screening for ARV Drugs in HIV Prevention Studies
High Throughput Screening for ARV Drugs in HIV Prevention StudiesHigh Throughput Screening for ARV Drugs in HIV Prevention Studies
High Throughput Screening for ARV Drugs in HIV Prevention Studies
 
Information retrieval in systematic reviews: a case study of the crime preven...
Information retrieval in systematic reviews: a case study of the crime preven...Information retrieval in systematic reviews: a case study of the crime preven...
Information retrieval in systematic reviews: a case study of the crime preven...
 
Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
 
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
 
Statistics for linguistics
Statistics for linguisticsStatistics for linguistics
Statistics for linguistics
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Bioanalytical ppt
Bioanalytical pptBioanalytical ppt
Bioanalytical ppt
 
Survey Method in Research
Survey Method in ResearchSurvey Method in Research
Survey Method in Research
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
44publicspkeaking06
44publicspkeaking0644publicspkeaking06
44publicspkeaking06
 
POPULATION AND SAMPLING.pptx
POPULATION AND SAMPLING.pptxPOPULATION AND SAMPLING.pptx
POPULATION AND SAMPLING.pptx
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test
 
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
 
Classifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsClassifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer Pairs
 
The painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS dataThe painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS data
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Dernier

Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Sérgio Sacani
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPirithiRaju
 
PSP3 employability assessment form .docx
PSP3 employability assessment form .docxPSP3 employability assessment form .docx
PSP3 employability assessment form .docxmarwaahmad357
 
Physics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersPhysics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersAndreaLucarelli
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPirithiRaju
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearmarwaahmad357
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsHassan Jolany
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPirithiRaju
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceLayne Sadler
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsSumathi Arumugam
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabkiyorndlab
 
Genetic Engineering in bacteria for resistance.pptx
Genetic Engineering in bacteria for resistance.pptxGenetic Engineering in bacteria for resistance.pptx
Genetic Engineering in bacteria for resistance.pptxaishnasrivastava
 
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrashi Coaching
 
Exploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchExploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchPrachya Adhyayan
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...marwaahmad357
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfsantiagojoderickdoma
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba caohsadfeeling
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WaySérgio Sacani
 
TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)chatterjeesoumili50
 
Basic Concepts in Pharmacology in molecular .pptx
Basic Concepts in Pharmacology in molecular  .pptxBasic Concepts in Pharmacology in molecular  .pptx
Basic Concepts in Pharmacology in molecular .pptxVijayaKumarR28
 

Dernier (20)

Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
 
PSP3 employability assessment form .docx
PSP3 employability assessment form .docxPSP3 employability assessment form .docx
PSP3 employability assessment form .docx
 
Physics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersPhysics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and Engineers
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPR
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final year
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbits
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data science
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery Systems
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlab
 
Genetic Engineering in bacteria for resistance.pptx
Genetic Engineering in bacteria for resistance.pptxGenetic Engineering in bacteria for resistance.pptx
Genetic Engineering in bacteria for resistance.pptx
 
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
 
Exploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchExploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & Research
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba ca
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
 
TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)
 
Basic Concepts in Pharmacology in molecular .pptx
Basic Concepts in Pharmacology in molecular  .pptxBasic Concepts in Pharmacology in molecular  .pptx
Basic Concepts in Pharmacology in molecular .pptx
 

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

  • 1. On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The 9th edition of the Language Resources and Evaluation Conference, Reykjavik, Iceland
  • 2. • Sentiment Analysis • Twitter • Stopwords Removal Methods • Comparative Study • Conclusion Outline
  • 3. “Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text” 3 The main dish was delicious It is a Syrian dish The main dish was salty and horrible Opinion OpinionFact Sentiment Analysis
  • 7. Stopwords Removal in Twitter Sentiment Analysis - Kouloumpis et al. 2011 - Pak & Paroubek, 2010 - Asiaee et al., 2012 - Bollen et al., 2011 - Bifet and Frank, 2010 - Speriosu et al., 2011 - Zhang & Yuan, 2013 - Gokulakrishnan et al 2012 - Saif et al., 2012 - Hu et al., 2013 - Camara et al., 2013 Removing Stopwords is USEFUL NOYES
  • 8. • Precompiled • Very popular • Outdated • Domain- Independent Classic Stopword Lists
  • 9. • Unsupervised Methods – Term Frequency – Term-based Random Sampling • Supervised – Term Entropy Measures – Maximum Likelihood Estimation Automatic Stopwords Generation Methods
  • 10. Stopwords Removal for Twitter Sentiment Analysis
  • 11. Stopword Analysis Set-Up (1) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% OMD HCR STS SemEval WAB GASP OMD HCR STS SemEval WAB GASP Negative 688 957 1402 1590 2580 5235 Positive 393 397 632 3781 2915 1050 Datasets
  • 12. Stopword Analysis Set-Up (2) Stopwords Removal Methods 1. The Baseline Method – (non removal of stopwords) 1. The Classic Method – This method is based on removing stopwords obtained from pre-compiled lists – Van Stoplist
  • 13. Stopword Analysis Set-Up (3) Stopwords Removal Methods 3. Methods based on Zipf’s Law - TF-High Method Removing most frequent - TF1 Method Removing singleton words (i.e., words that occur once in tweets) - IDF Method Removing words with low inverse document frequency (IDF)
  • 14. Stopword Analysis Set-Up (4) Stopwords Removal Methods 4. Term-based Random Sampling (TBRS) 5. The Mutual Information Method (MI)
  • 15. Stopword Analysis Set-Up (5) Twitter Sentiment Classifiers – Two Supervised Classifiers: • Maximum Entropy (MaxEnt) • Naïve Bayes (NB) – Measure the performance in Accuracy and F1 measure – 10 fold cross validation
  • 16. Experimental Results Assess the impact of removing stopwords by observing fluctuations on: - Classification Performance - Feature space - Data Sparsity
  • 17. Experimental Results (1) 1. Classification Performance 70 75 80 85 90 95 OMD HCR STS-Gold SemEval WAB GASP Accuracy(%) MaxEnt NB 60 65 70 75 80 85 90 OMD HCR STS-Gold SemEval WAB GASP F1(%) MaxEnt NB The baseline classification performance in Accuracy and F-measure of MaxEnt and NB classifiers across all datasets Accuracy F-Measure
  • 18. Experimental Results (2) 1. Classification Performance 60 65 70 75 80 85 90 Baseline Classic TF1 TF-High IDF TBRS MI Accuracy(%) MaxEnt NB 50 55 60 65 70 75 80 85 Baseline Classic TF1 TF-High IDF TBRS MIF1(%) MaxEnt NB Accuracy F-Measure Average Accuracy and F-measure of MaxEnt and NB classifiers using different stoplists
  • 19. Experimental Results (3) 2. Feature Space 0.00 5.50 65.24 0.82 11.22 6.06 19.34 Baseline Classic TF1 TF-High IDF TBRS MI Reduction rate on the feature space of the various stoplists 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% OMD HCR STS-Gold SemEval WAB GASP TF=1 TF>1 The number of singleton words to the number non singleton words in all datasets
  • 20. Experimental Results (4) 3. Data Sparsity 0.98800 0.99000 0.99200 0.99400 0.99600 0.99800 1.00000 Baseline Classic TF1 TF-High IDF TBRS MI SparsityDegree OMD HCR STS-Gold SemEval WAB GASP Stoplist impact on the sparsity degree of all datasets
  • 21. The Ideal Stoplist (1) • The ideal stopword removal method is the one which: – Helps maintaining a high classification performance, – Leads to shrinking the classifier’s feature space – Reduces the data sparseness – Has low runtime and storage complexity – Has minimal human supervision
  • 22. The Ideal Stoplist (2) Average accuracy, F1, reduction rate on feature space and data sparsity of the six stoplist methods. Positive sparsity values refer to an increase in the sparsity degree while negative values refer to a decrease in the sparsity degree. Overall Analysis Results
  • 23. Conclusion • We studied how six different stopword removal methods affect the sentiment polarity classification on Twitter. • The use of pre-compiled (classic) Stoplist has a negative impact on the classification performance. • TF1 stopword removal method is the one that obtains the best trade-off: – Reducing the feature space by nearly 65%, – Decreasing the data sparsity degree up to 0.37%, and – Maintaining a high classification performance.

Notes de l'éditeur

  1. Hi everyone, My name is Hassan Saif, a PhD student KMI in UK. Today I’m gonna present our work on Evaluation datasets for Twitter Sentiment Analysis.. Surveying the pre-existed datasets and proposing a new dataset. the STS-Gold.
  2. I’m gonna start with some basic definitions about the sentiment analysis task on Twitter. then talking about the motivation behind our study. Next I will give a quick overview about the existed evaluation datasets and preseting our new dataset the STS-Gold. Afterwards I’m gonna talk talk about the our results obtained from a comparative study we conducted on the these datasets. Our study is three main parts: in the first part I’m gonna give an overview of some of the most widely used evaluation datasets for Twitter sentiment analysis, pointing out their limitation In the second part I’m gonna present our new gold standard dataset which overcome some of the limitations of the pre-existed datasets The third part is about a comparative study we performed on the all the datasets in terms of 4 different aspects.
  3. Early work on Sentiment analysis focused mainly on extracting sentiment from conventional text such as movie reviews, blogs, news articles and open forums Textual content in these type of media sources is linguistically rich, consists of well structured and formal sentences, and discusses specific topic or domain (e.g., movie reviews)
  4. However, with the emergent of social media networks and microblogging platforms, especially Twitter, research interests shifted to analyzing and extracting sentiment from theses new sources. Nevertheless, One of the key challenges that Twitter sentiment analysis methods have to confront is the noisy nature of Twitter generated data. Twitter allows only for 140 characters in each post, which influences the use of abbreviations, irregular expressions and infrequent words. This phenomena increases the level of data sparsity, affecting the performance of Twitter sentiment classifiers
  5. A well known method to reduce the noise of textual data is the removal of stopwords. This method is based on the idea that discarding non-discriminative words reduces the feature space of the classifiers and helps them to produce more accurate results
  6. This pre-processing method, widely used in the literature of document classification and retrieval, has been applied to Twitter in the context of sentiment analysis obtaining contradictory results. While some works support their removal (RED Box), others claim that stopwords in- deed carry sentiment information and removing them harms the performance of Twitter sentiment classifiers
  7. In addition, most of the works that have applied stopword removal for Twitter sentiment classification use pre-compiled stopwords lists, such as the Van stoplist However, these stoplists have been criticized for: being outdated (a phenomena that may affect specially Twitter data, where new information and terms are continuously emerging) (ii) for not accounting for the specificities of the domain under analysis since non-discriminative words in some domain or corpus may have discriminative power in different domain.
  8. Aiming to solve these limitations several approaches have emerged in the areas of document retrieval and classification that aim to dynamically build stopword lists from the corpus under analysis. These approaches measure the discriminative power of terms by using different methods including Unsupervised methods such as those based on the terms’ frequencies or Supervised methods such as term entropy measures and Maximum Likelihood Estimation
  9. In our work, we studied the effect of different stopword removal methods for polarity classification of tweets and whether removing stopwords affects the performance of Twitter sentiment classifiers.
  10. To this end, we use six Twitter datasets obtained from the literature of Twitter sentiment classification. As can be noted, these datasets have different size and different number of positive and negative tweets.
  11. The Baseline method for this analysis is the non removal of stopwords. We also assess the influence of six different stopword removal methods using six stopwords removal methods including: The Classic Method: which is based on removing stopwords obtained from pre-compiled lists. In our analysis we use the classic Van stoplist
  12. In addition to the classic Stoplist, we use three stopword generation methods inspired by Zipf’s law including: removing most frequent words (TF-High) and removing words that occur once, i.e. singleton words (TF1). We also consider re- moving words with low inverse document frequency (IDF).
  13. 4- Term-based Random Sampling : This method works by iterating over separate chunks of data ran- domly selected. It then ranks terms in each chunk based on their informativeness values using the Kullback-Leibler divergence measure 5- The Mutual Information Method: The mutual information method (MI) is a supervised method that works by computing the mutual information between a given term and a document class (e.g., positive, negative), providing an indication of how much in- formation the term can tell about a given class. Low mutual information suggests that the term has low discrimination power and hence it should be easily removed.
  14. To assess the effect of stopwords in sentiment classification we use two of the most popular supervised classifiers used in the literature of sentiment analysis, Maximum Entropy (MaxEnt) and Naive Bayes (NB) from Mallet. We report the performance of both classifiers in accuracy and aver- age F-measure using a 10-fold cross validation. Also, note that we use unigram features to train both classifiers in our experiments.
  15. We assess the impact of removing stopwords by observing fluctuations (increases and decreases) on three different as- pects of the sentiment classification task: the classification performance, measured in terms of accuracy and F-masure, the size of the classifier’s feature space and the level of data sparsity. Our baseline for comparison is not removing stopwords.
  16. The first aspect that we study is how removing stopwords affects the classification performance This figure shows the baseline classification performance in accuracy (a) and F-measure (b) for the MaxEnt and NB classifiers across all the datasets. As we can see, when no stopwords are removed, the MaxEnt classifier always outperforms the NB classifier in accuracy and F1 measure on all datasets.
  17. This figure shows the average performances in accuracy and F-measure obtained from the MaxEnt and NB classifiers by using the six stopword removal methods on all datasets - Here we notice a significant loss in accuracy and in F-measure is encountered when using the IDF stoplist, while the highest performance is always obtained when using the MI stoplist. Also, using the classic stoplist gives lower performance than the baseline with an average loss of 1.04% and 1.24% in accuracy and F-measure respectively On the contrary, removing singleton words (the TF1 stoplist) improves the accuracy by 1.15% and F-measure by 2.65% compared to the classic stoplist. We also notice that the TF1 stoplist gives slightly lower accuracy and F-measure than the MI stoplist respectively. Nonetheless, generating TF1 stoplists is much simpler than generating the MI ones in the sense that the former, as opposed to the latter, does not required any labelled data. Finally, it seems that NB is more sensitive to removing stopwords than MaxEnt. NB faces more dramatic changes in accuracy than MaxEnt across the different stoplists.
  18. The second aspect we study is the average reduction rate on the classifier’s feature space caused by each of the studied stopword removal methods - As we can see, Removing singleton words reduces the feature space substantially by 65.24%. MI comes next with a reduction rate of 19.34%. On the other hand, removing the most frequent words (TF-High) has no actual effect on the feature space. All other stoplists reduces the number of features by less than 12%. - From the figure on the right, we can observe that singleton words constitute two-thirds of the vocabulary size of all datasets. In other words, the ratio of singleton words to non singleton words is two to one for all datasets. This two-to-one ratio explains the large reduction rate in the feature space when removing singleton words.
  19. The third aspect we study is the reduction on the data sparseness caused by our 6 stopwords removal methods on all datasets. Previous work on Sentiment analysis showed that Twitter Twitter data are sparser than other types of data (e.g., movie review data) due to the large number of infrequent words present within tweets. Therefore, an important effect of a stoplist for Twitter sentiment analysis is to help in reducing the sparsity degree of the data. Our analysis showed that our Twitter datasets are very sparse indeed, where the average sparsity degree of the baseline is 0.997. Compared to the baseline, using the TF1 method lowers the sparsity degree on all datasets by 0.37% on average. On the other hand, the effect of the TBRS stoplists is barely noticeable Also All other stopword removal methods in- crease the sparsity effect with different degrees, including the classic, TF-High, IDF and MI.
  20. After our evaluation, the question remains – What is the best or the ideal stopword method for Sentiment analysis on Twitter? Broadly speaking, the ideal stopword removal method is the one which helps maintaining a high classification performance, leads to shrinking the classifier’s feature space and effectively reducing the data sparseness. Moreover, since Twitter operates in streaming fashion (i.e., millions of tweets are generated, sent and discarded instantly), the ideal stoplist method is required to have low runtime and storage complexity and to cope with the continuous shift in the sentiment class distribution in tweets. Lastly and most importantly, the human supervision factor (e.g., threshold setup, data annotation, manual validation, etc.) in the method’s workflow should be minimal.
  21. - This table shows the average performances of the evaluated stoplist methods in terms of the sentiment classification accuracy and F-measure, reduction on the feature space and the data sparseness, and the type of the human supervision required. - According to these results, the MI and the TF1 methods show very competitive performances comparing to other methods; the MI method comes first in accuracy and F1 measure while the TF1 method outperform all other methods in the amount of reduction on feature space and data sparseness. looking at the human supervision factor, the TF1 method seems a simpler and more effective choice than the MI method. Firstly, because the notion behind TF1 is rather simple - “stopwords are those which occur once in tweets”, and hence, the computational complexity of generating TF1 stoplists is generally low. Secondly, the TF1 method is fully unsupervised while the MI method needs two major human supervisions including: (i) deciding on the size of the generated stoplists, which is usually done empirically and (ii) manually annotating tweet messages with their sentiment class label in order to calculate the informativeness values of terms as described in Equation 2.