Fypca4

User’s
Opinions
in Hotel
TEY JUN HONG
U095074X
National University Of Singapore

Content
1. Background
2.Formulating the
problem
3. Data Mining Process
4. Techniques
5. Analysis

01

What is Data
Mining?
• Extraction of meaningful /
useful / Interesting patterns
from a large volume of data
sources
• In this project, the source will
be large volume of WEB HOTEL
REVIEWS data
• Data mining is one of the top
ten emerging technology
MIT’s TECHNOLOGY REVIEW 2004

What is Data
Mining?
• Process of exploration and
analysis
• By automatic / semi automatic
means
• With little or no human
interactions
• To discover meaningful
patterns and rulesAND LINOFF, 2000
MASTERING DATA MINING BY BERRY

User’s Opinions in
• Increase in social
Hotel
media and web user
• Increase in valuable
opinion oriented data
in Hotel due to web
expansion
• Identify potential hotel
to stay by looking at
the aspects
• Overall Sentiments on
hotel are greatly
sought on the web for

What can Data Mining
• Identify best prospects
do?
(ASPECTS), and retain
customers
• Predict what ASPECTS
customers like and
promote accordingly
• Learn parameters
influencing trends in
sales and margins
• Identification of
opinions for customers

What are the
• Exponential growth of
problems?
user’s opinions
• Limitations of human
analysis
• Accuracy of human
analysis

Machines can be trained
to take over human
analysis with advanced
computer technology
and it is done with LOW

Some Limitations of
• Unable to read like a
machines
human
• No emotions
• Cannot detect
sarcasm
• Expression of
sentiments in different
topic and domain
• Polarity analysis
• Facts Vs Opinion

Some machine
• “The service is as
limitation examples
good as none”.
Negation not obvious
to machine

• “Swimming pool is big
enough to swim with
comfort” , “There is a
big crowd at the
counter complaining”.
Polarity might change
with context.

Machine
Learning
• A tool for data mining and
intelligent decision support
• Application of computer
algorithms that improve
automatically through
experience

MASTERING DATA MINING BY BERRY AND LINOFF, 2000

Types of Machine
• Supervised Learning
learning
• A training set is
provided (data with
correct answers)
which is used to mine
for known pattern
• Unsupervised Learning
• Data are provided
with no prior
knowledge of the
hidden patterns that
they contain.

Supervised Learning
• Rule Mining and Rule
techniques
learning
• Bayesian Networks
• Support Vector
Machine

Project
Objective
• Prediction of sentence polarity
• Classification of polarity for
sentiment lexicon
• Detection of relations

Pre-requisite
• Large data set
• Relevant Prior
Knowledge to domain,
in our case the hotel
domain
• Eg. Rating
• Sentiment lexicon for
sentiment analysis
• Data selection for
reliability and
standards

Cleaning the “Dirty”
• Frequent problem : Data
Data (60% of effort)
inconsistencies
• Duplicate data
• Spelling Errors != Trim from
data
• Foreign accent and characters
• Singular / Plural conversion
• Punctuations removal /
replacement
• Noise and incomplete data
• Naming convention misused,

Data Preprocessing
• Part of Speech Tagging (POS)
(Laundering)
using Brill Tagger

• Polarity tagging using

Findings
• Part of Speech Tagging (POS)
using Brill Tagger - NO
PROBLEM
-95% accuracy POS tagging
words after data cleaning

Findings
•Polarity tagging using
sentiment lexicon – BIG
PROBLEM
-40% sentiment words not found
in sentiment lexicon
-10% sentiment words with a
positive or negative polarity
found are in the neutral section
of sentiment lexicon

Problems
• Sentiment lexicon not
comprehensive to fulfill
machine learning technique
adopted
• Polarity of sentiment words
who are domain dependent are
founded in neutral section of
sentiment lexicon
• Polarity of sentiment words
can also change within the
domain even though they are
domain dependent

Solution
• Classify the polarity of
unlabeled sentiment word
using rule based mining
• Classify domain dependent
sentiment words
• Establish word relations
between labeled and unlabeled
sentiment words

Data Processing
• Rule based mining using
conjunction and punctuation
Polarity Assignment Rules

Same Adj – AND/OR - Adj

Opposite Neg - Adj – AND/OR - Adj /
Adj – AND/OR - Neg- Adj
Same Neg - Adj – AND/OR - Neg- Adj

Opposite Adj – BUT/NOR – Adj

Same Neg - Adj – BUT/NOR - Adj /
Adj – BUT/NOR - Neg- Adj
Opposite Neg - Adj – BUT/NOR - Neg- Adj

Same Adj , Adj

Data Processing
• Relation Network – Aspect –
Sentiment word pair

Analysis
• Using the expanded sentiment
lexicon, we analyze the polarity
sentiment by doing a sentiment
lookup using Bayesian Network

Bayesian
• To determine polarity of
sentiments

P(X | Y) = P(X) P(Y | X) / P(Y)

• Probability that a sentiments is
positive or negative, given it's
contents
• Assumptions: There is no link
between words
• P(sentiment | sentence) =

Validation
• Precision = N (agree & found) /
N (found)
• High precision means most of
the correct sentiment words
are found by the system
• Recall = N (agree & found) / N
(agree)
• High recall means most of

Validation Results
• It is found that out of the 350
aspect-unlabelled sentiment
word pairs,
• Only 194 are founded by the
methods. Thus, the precision is
about 57%.
• The recall is also not very high;
only 126 words are corrected
labelled by the system, which is
about 63%.

Discussion
• The results will improve if more
rules are applied such the
inclusion of more adverbs such
as “excessively” as negation
words.
• There might not be enough
dataset for the system to work
on. There are only 350 aspect-
unlabelled sentiment word
pairs for the application to
work with.
• This, however requires more

Conclusion
• Comprehensive Sentiment
Lexicon is a simple yet
effective solution to sentiment
analysis as it does not requires
prior training
• Current sentiment lexicon does
not capture such domain and
context sensitivities of
sentiment expressions

Conclusion
• This leads to poor coverage
• Thus, expanding general
sentiment lexicon to capture
domain and context
sensitivities of sentiment
expressions are advocated

Fypca4

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (8)

Similaire à Fypca4

Similaire à Fypca4 (20)

Fypca4

Notes de l'éditeur