6. Sentiment Analysis Testing
Data
• 100,000 movie reviews from IMDB
• Training set of 12,500 positive
reviews (7-10 stars)
• Training set of 12,500 negative
reviews (<5 stars)
• 30 or fewer reviews per movie
Methods
• Bag of words (Sklearn TF/IDF)
• Word2Vec (Gensim)
• Doc2Vec (Gensim)
• Pattern
• Indico (by sentence)
• Indico (by document)
• Indico (High Quality Sentiment)
7. Speed
• How long does it take to
train?
◦ Preparing text
◦ Training machine learning
model
◦ Some models are pre-trained
• How quickly does it analyze?
◦ Preparing text
◦ Running text through the
trained model
8. How well does it do?
• Accuracy
◦ Did we correctly name
positive sentiment as
positive?
◦ Did we correctly name
negative sentiment as
negative?
◦ Better for even class
distribution
• F1 = (Precision + Recall) / 2
◦ Precision = percent of things
we called positive that were
actually positive
◦ Recall = percent of things that
were actually positive that we
called positive
◦ Better for uneven class
distribution
9. Bag Of Words (Sklearn TFIDF)
• Simple algorithm
• Fast to train (10 minutes)
• Fast to apply
• 85.3% accuracy
• 85.3% F1
10. Word2Vec (Gensim)
• More complex algorithm
• Computationally intensive
• Better results with larger
training sets, multiple epochs
• Slow to train (2 hours)
• Slow to apply
• 81.9% accuracy
• 82.2% F1
11. Doc2Vec (Gensim)
• More complex algorithm
• Computationally intensive
• Better results with larger
training sets, multiple epochs
• Slow to train (4 hours)
• Slow to apply
• 82.8% accuracy
• 82.8% F1
• Distributed bag of words
• (other models 70% and 82%
accuracy rates)
12. Pattern (built-in)
• Simple algorithm
• Part of the Pattern module
• No training required
• Fast to apply
• 76.4% accuracy
• 76.9% F1
• (lowest scores)
13. Indico (by sentence)
• More complex algorithm
• API calls to proprietary
system
• No training required
• Fast to apply
• 89.1% accuracy
• 88.9% F1
14. Indico (by document)
• Simple algorithm
• API calls to proprietary
system
• No training required
• Fast to apply
• 90.1% accuracy
• 90.0% F1
15. Indico (High Quality Sentiment)
• Simple algorithm
• API calls to proprietary
system
• No training required
• Slow to apply
• 93.2% accuracy
• 93.2% F1
• (highest scores)
16. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Indico Sent_HQ
Indico by doc.
Doc2Vec
Pattern
Indico by sent.
Word2Vec
Bag of Words
Accuracy F1
Comparison of Sentiment Prediction
18. Customer Segmentation Testing
Data
• 129,809 movie reviews from IMDB
• 3,323 different movies
• 510 different combinations of genre
• 40 or fewer reviews per movie
Methodology
◦ Transform each review into an
Indico document vector
◦ Test success of different
document criteria
◦ Test success of different models
◦ Optimize best model
19. What Is Random Chance?
• 510 different genre combinations
• Heavily weighted to negative
• F1 random chance < 20%
Animated Action Comedy Drama
Family Fantasy Horror Musical
Mystery Romance Sci-Fi Thriller
War Western
20. More Reviews or More Words?
Number of Reviews Word Length F1 Score
129,809 All Reviews .54
72,691 200+ .56
19,914 500+ .57
Conclusion: Longer reviews work better
21. Optimizing Models
Model F1 Score
Tuned Random Forest .57
Initial Logistic Regression, Initial Linear SVC .62
Tuned Logistic Regression, Tuned Linear SVC .63
Initial Gradient Boost .63
Tuned Gradient Boost .67
Conclusion: Choosing the right model matters more than tuning
22. More From Customer Segmentation
Genre
Review 1
Review 2
Review 3
Feature
1
Feature
2
Feature
3
Genre
23. Some Customer Segments Overlap
Animated Family
Both: 1297 reviews Animated: 395 reviews Family: 1090 reviews
25. Machine Learning From Movie Reviews
• See the complete set of word clouds at:
◦ Github/JenniferDunne
• Contact:
◦ Jennifer.dunne.co@gmail.com
◦ Linkedin/jenniferdunneco