Additional1

+ František Dařena
Jan Žižka
Department
of
Faculty of
Business
Informatics and
Economics

Mendel Czech
University Republic
in Brno

TEXT MINING-BASED FORMATION OF
DICTIONARIES EXPRESSING OPINIONS
IN NATURAL LANGUAGES

+
Introduction

 Many companies collect opinions expressed
by their customers.
 These opinions can hide valuable knowledge.
 Discovering the knowledge by people can be
sometimes a very demanding task because
 the opinion database can be very large,
 the customers can use different languages,
 the people can handle the opinions subjectively,
 sometimes additional resources (like lists of positive
and negative words) might be needed.

+
Objective

To automatically extract words
significant for positive and negative
customers' opinions and to form
dictionaries of positive and negative
words, including the strength of their
positivity and negativity.

+
Data description

 Processed data included reviews of hotel clients
collected from publicly available sources.
 The reviews were labeled as positive and
negative.
 Reviews characteristics:
 more than 5,000,000 reviews,
 written in more than 25 natural languages,
 written only by real customers, based on a real
experience,
 written relatively carefully but still containing errors that
are typical for natural languages.

+
Review examples

 Positive
 The breakfast and the very clean rooms stood out as the best
features of this hotel.
 Clean and moden, the great loation near station. Friendly
reception!
 The rooms are new. The breakfast is also great. We had a really
nice stay.
 Good location - very quiet and good breakfast.

 Negative
 High price charged for internet access which actual cost now
is extreamly low.
 water in the shower did not flow away
 The room was noisy and the room temperature was higher
than normal.
 The air conditioning wasn't working

+
Data preparation

 Data collection, cleaning (removing tags, non-
letter characters), converting to upper-case.
 Transforming into the Bag-of-Words
representation, term frequencies (TF) used as
attribute values.
 Removing the words with global frequency
MinTF < 2.

+
Data characteristics

Number of unique words for different languages (MinTF = 1)

+
Data characteristics

total negative positive both classes

Number of unique words for different languages – for negative and positive
classes and words in both classes (MinTF = 2)

+
Finding the significant words

 Significantwords were discovered as relevant
attributes used by a classification algorithm – a
decision tree, the tree-generating algorithm c5 (by
R. Quinlan) based on entropy minimization.
 The goal was not to achieve the best classification
accuracy (it was around 90%) but to find relevant
attributes that contribute to assigning a text to a
given class.
 Thesignificant words appeared in the nodes of the
decision tree.

+
Representing the decision tree
using rules
 Thebranches of a decision tree can be converted into
rules.
 Examples:

f(word1) > 0 AND f(word2) = 0 AND f(word3) = 0 : NEG[N1; I1]
f(word4) = 0 AND f(word5) > 0 AND f(word6) > 0 : NEG[N2; I2]
f(word1) = 0 AND f(word6) > 0 : NEG[N3; I3]
Nx – number of times when the rule was used
Ix – number of times when the rule was used incorrectly
 When a word appears in a rule as f(word) > 0 it
contributes to classification into a given class and it is
thus relevant for the class.

+
One word in more paths/rules

The same word (e.g. “friendly”) can appear in
more paths in the decision tree and to contribute
to classification into both classes.

+
Strength of word sentiment

 The more a word appears as relevant in rules assigning the
negative (positive) class to a text correctly the more
negative (positive) the word is. However, it is necessary to
consider not only absolute frequency but also the relative
accuracy.
 For example, a word W1 is used 10 times for a correct and 0
times for an incorrect classification to the negative class, and
word W2 is used 30 times for a correct and 20 times for an
incorrect classification to the negative class (50 times in
total). Now, the question is which of these two words is `more
negative.' The word W1 was used less times but in 100%
correctly, while the word W2 was used 5 times more but with
only 60% correctness.

+
Sentiment strength weight

NC ln NC + NN
2 2
ww = ×
NN ln(Nmax )

The weight balances the
frequency when a word was
used for classification and the
correctness of the classification.
The calculated weight then
determines the importance of a
word in relation to a given
category (positive or negative
class) – higher numbers mean
bigger relevancy.

+
Conclusions

A procedure how to apply computers, machine
learning, and natural language processing areas to
automatically find significant words was presented.
 Fromthe total number of words (80,000–200,000) only
about 200–300 were identified as significant.
 The procedure worked well for many languages.
 Followingresearch will focus on generating typical
short phrases instead of only creating individual words.
 Theprocedure might be used during the marketing
research or marketing intelligence, for filtering
reviews, generating lists of key-words etc.

Additional1

Recommended

Recommended

More Related Content

Similar to Additional1

Similar to Additional1 (20)

More from Natalia Ostapuk

More from Natalia Ostapuk (20)

Recently uploaded

Recently uploaded (20)

Additional1