Machine Learning From Movie Reviews - Long Form

Why Movie Reviews?
• Natural Language Processing is hot
• There are real world use cases
• It plays to my domain knowledge outside
of data science

Natural Language Processing Use Cases
What Are
People Saying
About Me?
Who Are
These
People?

Sentiment Analysis
What Are
People Saying
About Me?
Good Bad

Customer Segmentation
Who Are
These
People?
Animated Action Comedy Drama
Family Fantasy Horror Musical
Mystery Romance Sci-Fi Thriller
War Western

Sentiment Analysis Testing
Data
• 100,000 movie reviews from IMDB
• Training set of 12,500 positive
reviews (7-10 stars)
• Training set of 12,500 negative
reviews (<5 stars)
• 30 or fewer reviews per movie
Methods
• Bag of words (Sklearn TF/IDF)
• Word2Vec (Gensim)
• Doc2Vec (Gensim)
• Pattern
• Indico (by sentence)
• Indico (by document)
• Indico (High Quality Sentiment)

Speed
• How long does it take to
train?
◦ Preparing text
◦ Training machine learning
model
◦ Some models are pre-trained
• How quickly does it analyze?
◦ Preparing text
◦ Running text through the
trained model

How well does it do?
• Accuracy
◦ Did we correctly name
positive sentiment as
positive?
◦ Did we correctly name
negative sentiment as
negative?
◦ Better for even class
distribution
• F1 = (Precision + Recall) / 2
◦ Precision = percent of things
we called positive that were
actually positive
◦ Recall = percent of things that
were actually positive that we
called positive
◦ Better for uneven class
distribution

Bag Of Words (Sklearn TFIDF)
• Simple algorithm
• Fast to train (10 minutes)
• Fast to apply
• 85.3% accuracy
• 85.3% F1

Word2Vec (Gensim)
• More complex algorithm
• Computationally intensive
• Better results with larger
training sets, multiple epochs
• Slow to train (2 hours)
• Slow to apply
• 81.9% accuracy
• 82.2% F1

Doc2Vec (Gensim)
• Computationally intensive
• Better results with larger
training sets, multiple epochs
• Slow to train (4 hours)
• Slow to apply
• 82.8% accuracy
• 82.8% F1
• Distributed bag of words
• (other models 70% and 82%
accuracy rates)

Pattern (built-in)
• Part of the Pattern module
• No training required
• Fast to apply
• 76.4% accuracy
• 76.9% F1
• (lowest scores)

Indico (by sentence)
• API calls to proprietary
system
• Fast to apply
• 89.1% accuracy
• 88.9% F1

Indico (by document)
system
• Fast to apply
• 90.1% accuracy
• 90.0% F1

Indico (High Quality Sentiment)
system
• Slow to apply
• 93.2% accuracy
• 93.2% F1
• (highest scores)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Indico Sent_HQ
Indico by doc.
Doc2Vec
Pattern
Indico by sent.
Word2Vec
Bag of Words
Accuracy F1
Comparison of Sentiment Prediction

Customer Segmentation Testing
Data
• 129,809 movie reviews from IMDB
• 3,323 different movies
• 510 different combinations of genre
• 40 or fewer reviews per movie
Methodology
◦ Transform each review into an
Indico document vector
◦ Test success of different
document criteria
◦ Test success of different models
◦ Optimize best model

What Is Random Chance?
• 510 different genre combinations
• Heavily weighted to negative
• F1 random chance < 20%
Animated Action Comedy Drama
Family Fantasy Horror Musical
Mystery Romance Sci-Fi Thriller
War Western

More Reviews or More Words?
Number of Reviews Word Length F1 Score
129,809 All Reviews .54
72,691 200+ .56
19,914 500+ .57
Conclusion: Longer reviews work better

Optimizing Models
Model F1 Score
Tuned Random Forest .57
Initial Logistic Regression, Initial Linear SVC .62
Tuned Logistic Regression, Tuned Linear SVC .63
Initial Gradient Boost .63
Tuned Gradient Boost .67
Conclusion: Choosing the right model matters more than tuning

More From Customer Segmentation
Genre
Review 1
Review 2
Review 3
Feature
1
Feature
2
Feature
3
Genre

Some Customer Segments Overlap
Animated Family
Both: 1297 reviews Animated: 395 reviews Family: 1090 reviews

Segments Care About Different Things
Horror War

Machine Learning From Movie Reviews
• See the complete set of word clouds at:
◦ Github/JenniferDunne
• Contact:
◦ Jennifer.dunne.co@gmail.com
◦ Linkedin/jenniferdunneco

Machine Learning From Movie Reviews - Long Form

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Machine Learning From Movie Reviews - Long Form

Similar to Machine Learning From Movie Reviews - Long Form (20)

Machine Learning From Movie Reviews - Long Form