1. Sentiment Analysis in Machine Learning
Jennifer D. Davis, Ph.D.
American Computing Machinery, Austin Chapter
Sub-group on Knowledge, Discovery and Data Mining
June 2, 2015
3. What is sentiment analysis?
Machine learning technique that classifies
comments and phrases based on what is called a
‘corpus’—a group of annotated texts with weights
given to words in numerical terms
Defined as:
“Sentiment analysis (opinion mining) refers to the
use of natural language processing, text analysis and
computational linguistics to identify and extract
subjective information to source materials.”
wikipedia encyclopedia
4. Sentiment Analysis: Not your Mother’sTwitter Feed!
Sentiment Analysis can be used to:
Understand the intent behind language in an
unbiased manner
Business areas that frequently use Sentiment
Analysis:
Retail
Entertainment
Healthcare
Any customer-centered organization
Respond to customer complaints with better
solutions, a sort of virtual call center (e.g. Amelia)
5. Retail
Introduce new products more successfully by
understanding culture & social media
Understand and respond to customer needs using
internal data sources such as customer reviews or
feedback
Develop new products based on customer wants and
needs as expressed in reviews, on-line and social media
6. Entertainment
Create interest or excitement about movies by
understanding the market segment
Target movie advertising or recommender systems
based on social commentary and collaborative
filtering
Target advertising to gender or population or by
cultural affinity.
7. Healthcare and MedicalTreatment
Healthcare:
Learn about patient wellness –
Potentially detect depression from journal entries
Assist with patient adherence to treatment
Learn about patient satisfaction and what is working
Gather outcomes measures associated with patient
satisfaction
This is a hot area of research and several academic
institutions are investing in research related to
patient outcomes and sentiment analysis.
8. What are the overall steps for sentiment analysis?
Gather unstructured data from your own sources, web-sources, databases
(healthcare.gov surprisingly has some) and competitions like Kaggle.
Parse out unnecessary punctuation and “stop” words or phrases, perform
other pre-processing as needed or appropriate.
Transform the words or phrases to a numerical representation such as a
vector
Choose an appropriate classification algorithm. For example Random Forrest
has a high accuracy rate, but isn’t always computationally efficient. We
discussed several other methods previously.
Apply your algorithm to a training set and if enough data is available, cross-
validate. Tune the algorithm using appropriate parameters matched to
features, but avoid over-fitting.
Apply the algorithm to test data (the fun part).
9. What techniques can we use?
Many are under development by machine-learning
focused corporations and in academic linguistic
laboratories
Often an ensemble of algorithms works best and is most
accurate
Text data is often unstructured data. You will spend a
portion of time cleaning and organizing data. Not fun,
but necessary.
Today we will very briefly give high-level overview of 3
methods (i) Bayesian Probability classification, (ii)
Word2Vec and (iii) Neural Recursive Networks
10. Bayesian Probability and classification method
Naïve Bayes classification uses probability formulas
that are based on the assumptions that all features
function independently
For most cases this is surprisingly accurate, and
typically can yield 70-80% accuracies
You can read more about this in the textbook for
this course, “Building Machine Learning Systems
with Python”
11. Word2vec “deep” learning method
This method relies upon creating a “Bag of Words” from semi-
structured data
Many tools are available in scikit learn and nltk python
libraries (we will show some in our Jupyter (iPython)
notebook
Invented by Google engineers who describes it as a “tool [that
provides] an efficient implementation of a continuous bag-of-
words and skip-gram architectures for computing vector
representations of words”
In other words, (pun intended) words are assigned a vector of
numbers representing their importance, and meaning
12. Neural recursive network method
The best (and most convenient to use) library is Stanford
University’s Natural Language Processing library.
The method uses a recursion algorithm that will distinguish
between phrases based upon the order of words & phrases
For example “this movie has humor that could not be denied”
would be graded as positive whereas “this movie did not have
any humor whatsoever” would be graded as negative based
on order and choice of words & phrases.
SNLP Group can be found at: nlp.stanford.edu; their live
demonstration is available at: nlp.stanford.edu/sentiment
13. So which do I choose?
It depends upon the complexity of data you are
analyzing
It depends upon the accuracy you desire versus
scalability (always a balancing act)
It depends on your time frame and how you will
integrate the knowledge derived from using
sentiment analysis
Out of the box solutions can work, but sometimes
you will need to build your own
14. So now we can give it a try!
A Jupyter Notebook has been created and can be accessed via
my Github account at:
https://github.com/jddavis-100/Statistics-and-Machine-Learning/
Data is available at:
Kaggle.com by joining the Kaggle Competition
The test set was designed by me, and I can provide it to you or
Omar.
Gather your own data from a number of APIs including or web-
crawlers such as:
Rotten Tomatoes API
Twitter API
Web-scraping tools such as Scrapy (Python tool available at
scrapy.org)