This document discusses sentiment analysis of hotel reviews using data mining techniques. It describes building a sentiment lexicon by expanding an existing lexicon using rule-based mining and establishing word relations. A Bayesian network is used to analyze sentiment polarity by looking up sentiments in the expanded lexicon. The system achieved 57% precision and 63% recall in validating sentiment labels. Expanding the lexicon to capture domain and context sensitivities was advocated to improve coverage.
2. Content
1. Background
2.Formulating the
problem
3. Data Mining Process
4. Techniques
5. Analysis
01
3. What is Data
Mining?
• Extraction of meaningful /
useful / Interesting patterns
from a large volume of data
sources
• In this project, the source will
be large volume of WEB HOTEL
REVIEWS data
• Data mining is one of the top
ten emerging technology
MIT’s TECHNOLOGY REVIEW 2004
4. What is Data
Mining?
• Process of exploration and
analysis
• By automatic / semi automatic
means
• With little or no human
interactions
• To discover meaningful
patterns and rulesAND LINOFF, 2000
MASTERING DATA MINING BY BERRY
5. User’s Opinions in
• Increase in social
Hotel
media and web user
• Increase in valuable
opinion oriented data
in Hotel due to web
expansion
• Identify potential hotel
to stay by looking at
the aspects
• Overall Sentiments on
hotel are greatly
sought on the web for
6. What can Data Mining
• Identify best prospects
do?
(ASPECTS), and retain
customers
• Predict what ASPECTS
customers like and
promote accordingly
• Learn parameters
influencing trends in
sales and margins
• Identification of
opinions for customers
7. What are the
• Exponential growth of
problems?
user’s opinions
• Limitations of human
analysis
• Accuracy of human
analysis
Machines can be trained
to take over human
analysis with advanced
computer technology
and it is done with LOW
8. Some Limitations of
• Unable to read like a
machines
human
• No emotions
• Cannot detect
sarcasm
• Expression of
sentiments in different
topic and domain
• Polarity analysis
• Facts Vs Opinion
9. Some machine
• “The service is as
limitation examples
good as none”.
Negation not obvious
to machine
• “Swimming pool is big
enough to swim with
comfort” , “There is a
big crowd at the
counter complaining”.
Polarity might change
with context.
11. Machine
Learning
• A tool for data mining and
intelligent decision support
• Application of computer
algorithms that improve
automatically through
experience
MASTERING DATA MINING BY BERRY AND LINOFF, 2000
12. Types of Machine
• Supervised Learning
learning
• A training set is
provided (data with
correct answers)
which is used to mine
for known pattern
• Unsupervised Learning
• Data are provided
with no prior
knowledge of the
hidden patterns that
they contain.
13. Supervised Learning
• Rule Mining and Rule
techniques
learning
• Bayesian Networks
• Support Vector
Machine
14. Project
Objective
• Prediction of sentence polarity
• Classification of polarity for
sentiment lexicon
• Detection of relations
15. Pre-requisite
• Large data set
• Relevant Prior
Knowledge to domain,
in our case the hotel
domain
• Eg. Rating
• Sentiment lexicon for
sentiment analysis
• Data selection for
reliability and
standards
17. Cleaning the “Dirty”
• Frequent problem : Data
Data (60% of effort)
inconsistencies
• Duplicate data
• Spelling Errors != Trim from
data
• Foreign accent and characters
• Singular / Plural conversion
• Punctuations removal /
replacement
• Noise and incomplete data
• Naming convention misused,
18. Data Preprocessing
• Part of Speech Tagging (POS)
(Laundering)
using Brill Tagger
• Polarity tagging using
19. Findings
• Part of Speech Tagging (POS)
using Brill Tagger - NO
PROBLEM
-95% accuracy POS tagging
words after data cleaning
20. Findings
•Polarity tagging using
sentiment lexicon – BIG
PROBLEM
-40% sentiment words not found
in sentiment lexicon
-10% sentiment words with a
positive or negative polarity
found are in the neutral section
of sentiment lexicon
21. Problems
• Sentiment lexicon not
comprehensive to fulfill
machine learning technique
adopted
• Polarity of sentiment words
who are domain dependent are
founded in neutral section of
sentiment lexicon
• Polarity of sentiment words
can also change within the
domain even though they are
domain dependent
22. Solution
• Classify the polarity of
unlabeled sentiment word
using rule based mining
• Classify domain dependent
sentiment words
• Establish word relations
between labeled and unlabeled
sentiment words
23. Data Processing
• Rule based mining using
conjunction and punctuation
Polarity Assignment Rules
Same Adj – AND/OR - Adj
Opposite Neg - Adj – AND/OR - Adj /
Adj – AND/OR - Neg- Adj
Same Neg - Adj – AND/OR - Neg- Adj
Opposite Adj – BUT/NOR – Adj
Same Neg - Adj – BUT/NOR - Adj /
Adj – BUT/NOR - Neg- Adj
Opposite Neg - Adj – BUT/NOR - Neg- Adj
Same Adj , Adj
26. Analysis
• Using the expanded sentiment
lexicon, we analyze the polarity
sentiment by doing a sentiment
lookup using Bayesian Network
27. Bayesian
• To determine polarity of
sentiments
P(X | Y) = P(X) P(Y | X) / P(Y)
• Probability that a sentiments is
positive or negative, given it's
contents
• Assumptions: There is no link
between words
• P(sentiment | sentence) =
28. Validation
• Precision = N (agree & found) /
N (found)
• High precision means most of
the correct sentiment words
are found by the system
• Recall = N (agree & found) / N
(agree)
• High recall means most of
29. Validation Results
• It is found that out of the 350
aspect-unlabelled sentiment
word pairs,
• Only 194 are founded by the
methods. Thus, the precision is
about 57%.
• The recall is also not very high;
only 126 words are corrected
labelled by the system, which is
about 63%.
30. Discussion
• The results will improve if more
rules are applied such the
inclusion of more adverbs such
as “excessively” as negation
words.
• There might not be enough
dataset for the system to work
on. There are only 350 aspect-
unlabelled sentiment word
pairs for the application to
work with.
• This, however requires more
31. Conclusion
• Comprehensive Sentiment
Lexicon is a simple yet
effective solution to sentiment
analysis as it does not requires
prior training
• Current sentiment lexicon does
not capture such domain and
context sensitivities of
sentiment expressions
32. Conclusion
• This leads to poor coverage
• Thus, expanding general
sentiment lexicon to capture
domain and context
sensitivities of sentiment
expressions are advocated
What can data mining do in a hotel domain, in other words, learn the market
Impossible for humans to read every single opinions Biased of humans to read certain opinions Machines Allow fast access to vast amount of data Allow computational intensive algorithm and statistical methods
Impossible for humans to read every single opinions Biased of humans to read certain opinions Machines Allow fast access to vast amount of data Allow computational intensive algorithm and statistical methods
Many fields of data mining and in this project we will focus on these 4
Growing data volume , limitation of humans and low cost to human
The goal for unsupervised learning is to discover these patterns Semi – Knowledge is known and applied from one data collection in order to mine, classify, analyze, interpret a related data collection
Some of the problems to be solved by data mining Prediction of sentence polarity Classification of polarity for sentiment lexicon Detection of relations
Data inconsistencies: Say good in the title but in the review say bad
Assigning a label to every word in the text to allow machine to do something with it
Pos tagging wrong due to some word like heart having double tagging
For example, in the domain of handheld devices, the word “ large ” can express positivity for screen size but negativity in the phone size.
Assigning a label to every word in the text to allow machine to do something with it
After establishing relations, we have a graph of nodes (Sentiments / Aspects) Determine the probability that the node is positive or negative given its surrounding nodes Start with a high frequency unlabelled sentiment word-aspect pair then based on the aspect and its label semtiment pair, determine the polarity for the unlabel This process iterate till all unlabe found their polarity
After establishing relations, we have a graph of nodes (Sentiments / Aspects) Determine the probability that the node is positive or negative given its surrounding nodes Start with a high frequency unlabelled sentiment word-aspect pair then based on the aspect and its label semtiment pair, determine the polarity for the unlabel This process iterate till all unlabe found their polarity
Assigning a label to every word in the text to allow machine to do something with it
A comprehensive sentiment lexicon can provide a simple yet effective solution to sentiment analysis, because it is general and does not require prior training. Therefore, attention and effort have been paid to the construction of such lexicons. However, a significant challenge to this approach is that the polarity of many words is domain and context dependent. For example, ‘long’ is positive in ‘long battery life’ and negative in ‘long shutter lag.’ Current sentiment lexicons do not capture such domain and context sensitivities of sentiment expressions. They either exclude such domain and context dependent sentiment expressions or tag them with an overall polarity tendency based on statistics gathered from certain corpus such as the world wide web accessed via the internet. While excluding such expressions leads to poor coverage, simply tagging them with a polarity tendency leads to poor precision.
AThey either exclude such domain and context dependent sentiment expressions or tag them with an overall polarity tendency based on statistics gathered from certain corpus such as the world wide web accessed via the internet. While excluding such expressions leads to poor coverage, simply tagging them with a polarity tendency leads to poor precision.