SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
+         František Dařena
                  Jan Žižka
                              Department
                              of
                                            Faculty of
                                            Business
               Karel Burda    Informatics   and
                                            Economics




                              Mendel        Czech
                              University    Republic
                              in Brno




    Grouping Customer Opinions Written in
    Natural Language Using Unsupervised
              Machine Learning
+
    Introduction


     Many   companies collect opinions expressed by
      their customers
     These opinions can hide valuable knowledge
     Discovering such the knowledge by people can
      be a very demanding task because:
      the opinion database can be very large,
      the customers can use different languages,
      the people can handle the opinions subjectively,
      sometimes additional resources (like lists of positive
       and negative words) might be needed.
+
    Introduction


     Ourprevious research focused on the analysis
     what was significant for including a certain
     opinion into one of categories like satisfied or
     dissatisfied customers
     However, this requires to have the reviews
     separated into classes sharing a common
     opinion/sentiment
+
    Introduction

     Clusteringas the most common form of
     unsupervised learning enables automatic
     grouping of unlabeled documents into subsets
     called clusters
     Inthe previous research, we analyzed how well
     a computer can separate the classes
     expressing a certain opinion and to find a
     clustering algorithm with a set of its best
     parameters, similarity, and clustering-criterion
     function, word representation, and the role of
     stemming for the given specific data
+
    Objective


     Clustering process is naturally not errorless
     and some reviews labeled as positive appear
     in a cluster containing mostly negative
     reviews and vice versa
     The objective was to analyse why certain
     reviews were assigned “wrongly” to a group
     containing mostly reviews from a different
     class in order to improve the results of
     classification and prediction
+
    Data description

     Processed   data included reviews of hotel clients
      collected from publicly available sources
     The reviews were labeled as positive and
      negative
     Reviews characteristics:
      more  than 5,000,000 reviews
      written in more than 25 natural languages
      written only by real customers, based on their
       experience
      written relatively carefully but still containing errors that
       are typical for natural languages
+
    Properties of data used for
    experiments
     The subset (marked as written in English) used in
     our experiments contained almost two million
     opinions

            Review category         Positive       Negative
            Number of reviews       1,190,949      741,092
            Maximal review length   391 words      396 words
            Average review length   21.67 words    25.73 words
            Variance                403.34 words   618.47 words
+
    Review examples

       Positive
           The breakfast and the very clean rooms stood out as the best
            features of this hotel.
           Clean and moden, the great loation near station. Friendly
            reception!
           The rooms are new. The breakfast is also great. We had a really
            nice stay.
           Good location - very quiet and good breakfast.

       Negative
           High price charged for internet access which actual cost now is
            extreamly low.
           water in the shower did not flow away
           The room was noisy and the room temperature was higher than
            normal.
           The air conditioning wasn't working
+
    Data preparation

     Datacollection, cleaning (removing tags, non-letter
     characters), converting to upper-case
     Removing  stopwords and words shorter than 3
      characters and
     Spell checking, diacritics removal etc. were not carried
      out
     Creatingthree smaller subsets containing positive and
     negative reviews with the following proportions:
      about 1,000 positive and 1,000 negative (small)
      about 50,000 positive and 50,000 negative (medium)
      about 250,000 positive and 250,000 negative (large)
+
    Experimental steps

       Transformation of the data into the vector representation (bag-
        of-words model, tf-idf weighting schema)

       Clustering with Cluto* with following parameters
           similarity function – cosine similarity,
           clustering method – k-means (Cluto’s variation)
           criterion function that is optimized during clustering process – H2

       weighted entropy of the results varied from about 0.58 to 0.60
        (e.g., for the small set of reviews, the entropy was 0.587, and
        accuracy 0.859)
    *   Free software providing different clustering methods working with several
        clustering criterion functions and similarity measures, suitable for operating on
        very large datasets.
+
    Graphical representation of the
    results of clustering




         False Positive (FP)       False Negative (FP)

         True Positive (TP)        True Negative (TN)

         Clustered Positive (CP)   Clustered Negative (CN)
+
    Analysis of incorrectly clustered
    reviews
     When a review rPi, originally labeled as positive, is
     “wrongly” assigned to a cluster with mostly negative
     reviews (CN), we can assume that the properties of this
     review are more “similar” to the properties of the other
     reviews in CN, i.e., the words of rPi and their combinations
     are more similar to the words contained in the dictionary
     of CN

     The similarity was related to the frequency of words of rPi
     in the subsets of the clustering solution (FN is compared
     to TN, TP, CP, and FP is compared to TP, TN, CN)
+
    Analysis of incorrectly clustered
    reviews
     We                            𝑋
         introduce the importance 𝑖 𝑤 𝑖 of a word wi in a
     given set X:




    where NX(wi) is the frequency of word wi in set X and
    NX is the number of dictionary words in X
+
    Analysis of incorrectly clustered
    reviews
     The     importance of a word in one set should be similar
        to the importance of the same word in the most similar
        set, i.e., importance of words in FN and TN should be
        more similar than, e.g., importance of words in FN and
        TP

     Lowest    value among
        and                    corresponds to highest
        importance similarity with TP, TN, or CN

       The same comparisons between FN and TN, TP, and
        CP were carried out
+ Importance of words from dictionary of False
  Positive set compared to the other sets
+ Importance of words from dictionary of False
  Negative set compared to the other sets
+
    Results of the analysis

 The    words with higher frequencies included mostly the words
    that could be considered positive (e.g., location, excellent, or
    friendly) and negative (e.g., small, noise, or poor) in terms of
    their contribution to the assignment of reviews to a “correct”
    category

 These    words that are important from the correct classification
    viewpoint have often the most similar importance in a different
    set than one would expect, e.g., some words in reviews from
    FN bearing a strong positive sentiment had their importance
    most similar to their importance in TN and not in TP or CP
+
    Example 1 – small data set

       A strongly positive word excellent was used 3 times in the
        FN (290 positive reviews, 3,678 words)  iFN = 0.0008

       Such the importance was the most similar to the
        importance of the same word in TN (iTN = 0.0007) and not
        in TP (iTP = 0.007) or CP (iCP = 0.006)

       Review “Excellent bed making. Very good restaurant but
        an English language menu would be advantageous to
        non-german speaking visitors.” containing a strongly
        positive word excellent was categorized incorrectly
+
    Example 2 – small data set

     A positiveword good (with smaller positivity than
     excellent) had the importance iFN = 0.0114

     Suchthe importance was most similar to the
     importance of the same word in CP (iCP = 0.0146)
     and not in TP (iTP = 0.016) or TN (iTN = 0.0021)

     Nevertheless,
                  some reviews containing this positive
     word were assigned to a group with mostly negative
     reviews.
+
    Results of the analysis

 Both    examples demonstrate that other document
    properties, i.e., the presence of the other words together
    with their importance, are signifi-
    cant. This is demonstrated in the
    table with importance similarities
    of words of an obviously positive
    review containing twice strongly
    positive word “good” which was
    assigned incorrectly to CN.
+
    Results of the analysis – importance
    vs. frequency
       The analysis of the importance of words from dictionary of FN
        showed that about 60% of words had their importance similar
        to their importance in TN

       However, the frequency of each of these words (number of
        occurrences in all reviews) was relatively low (many of them
        appeared just once)

       These words with highly similar importance also often did not
        bear any sentiment, such as the words discounted, happening,
        or attitude
+
    Conclusions

       The study aimed at finding what was actually the reason of
        assigning some documents into a “wrong” class

       The critical information is provided by certain significant words
        included in individual reviews

       Words that the previous research found significant for opinion
        polarity did not take effect as misleading information unlike
        words that were much more or quite insignificant

       Specific words (or their combinations) can be filtered out as
        noise, improving the cluster generation

Contenu connexe

En vedette (11)

Ponomareva
PonomarevaPonomareva
Ponomareva
 
Braslavsky 13.12.12
Braslavsky 13.12.12Braslavsky 13.12.12
Braslavsky 13.12.12
 
Presentation
PresentationPresentation
Presentation
 
Nlp seminar academicwriting
Nlp seminar academicwritingNlp seminar academicwriting
Nlp seminar academicwriting
 
Cross domainsc new
Cross domainsc newCross domainsc new
Cross domainsc new
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniar
 
Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013
 
Tomita одесса
Tomita одессаTomita одесса
Tomita одесса
 
Evaluation in-nlp
Evaluation in-nlpEvaluation in-nlp
Evaluation in-nlp
 
ТебеРефрейминг V2
ТебеРефрейминг V2ТебеРефрейминг V2
ТебеРефрейминг V2
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12
 

Similaire à Zizka synasc 2012

Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Nirav Raje
 
Analytical Design in Applied Marketing Research
Analytical Design in Applied Marketing ResearchAnalytical Design in Applied Marketing Research
Analytical Design in Applied Marketing ResearchKelly Page
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Examples Of Comparing And Contrast Essays
Examples Of Comparing And Contrast EssaysExamples Of Comparing And Contrast Essays
Examples Of Comparing And Contrast EssaysAngelavette Dowdy
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAnalyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAbhinav Gupta
 
Exploratory data analysis and data mining on yelp restaurant review
Exploratory data analysis and data mining on yelp restaurant review Exploratory data analysis and data mining on yelp restaurant review
Exploratory data analysis and data mining on yelp restaurant review PoojaPrasannan4
 
Kindly respond to this discussion 250 APA format.docx
Kindly respond to this discussion 250 APA format.docxKindly respond to this discussion 250 APA format.docx
Kindly respond to this discussion 250 APA format.docxwrite22
 
Online feedback correlation using clustering
Online feedback correlation using clusteringOnline feedback correlation using clustering
Online feedback correlation using clusteringawesomesos
 
Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016Asoka Korale
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Data Mining in Rediology reports
Data Mining in Rediology reportsData Mining in Rediology reports
Data Mining in Rediology reportsSaeed Mehrabi
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...IJERA Editor
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big DataSameer Wadkar
 
Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Alex Curtis
 

Similaire à Zizka synasc 2012 (20)

Seminar1
Seminar1Seminar1
Seminar1
 
Additional2
Additional2Additional2
Additional2
 
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
 
Analytical Design in Applied Marketing Research
Analytical Design in Applied Marketing ResearchAnalytical Design in Applied Marketing Research
Analytical Design in Applied Marketing Research
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Fypca5
Fypca5Fypca5
Fypca5
 
Examples Of Comparing And Contrast Essays
Examples Of Comparing And Contrast EssaysExamples Of Comparing And Contrast Essays
Examples Of Comparing And Contrast Essays
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAnalyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in Python
 
Exploratory data analysis and data mining on yelp restaurant review
Exploratory data analysis and data mining on yelp restaurant review Exploratory data analysis and data mining on yelp restaurant review
Exploratory data analysis and data mining on yelp restaurant review
 
Kindly respond to this discussion 250 APA format.docx
Kindly respond to this discussion 250 APA format.docxKindly respond to this discussion 250 APA format.docx
Kindly respond to this discussion 250 APA format.docx
 
Feb20 mayo-webinar-21feb2012
Feb20 mayo-webinar-21feb2012Feb20 mayo-webinar-21feb2012
Feb20 mayo-webinar-21feb2012
 
Online feedback correlation using clustering
Online feedback correlation using clusteringOnline feedback correlation using clustering
Online feedback correlation using clustering
 
Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Data Mining in Rediology reports
Data Mining in Rediology reportsData Mining in Rediology reports
Data Mining in Rediology reports
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
Fyp ca2
Fyp ca2Fyp ca2
Fyp ca2
 
Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)
 

Plus de Natalia Ostapuk

место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1Natalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3Natalia Ostapuk
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большаковаNatalia Ostapuk
 
автоматическое аннотирование новостного потока
автоматическое аннотирование новостного потокаавтоматическое аннотирование новостного потока
автоматическое аннотирование новостного потокаNatalia Ostapuk
 

Plus de Natalia Ostapuk (14)

Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Tomita 4марта
Tomita 4мартаTomita 4марта
Tomita 4марта
 
Analysis by-variants
Analysis by-variantsAnalysis by-variants
Analysis by-variants
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1
 
Text mining
Text miningText mining
Text mining
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
Angelii rus
Angelii rusAngelii rus
Angelii rus
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большакова
 
Авиком
АвикомАвиком
Авиком
 
автоматическое аннотирование новостного потока
автоматическое аннотирование новостного потокаавтоматическое аннотирование новостного потока
автоматическое аннотирование новостного потока
 
Ai brainy
Ai brainyAi brainy
Ai brainy
 

Zizka synasc 2012

  • 1. + František Dařena Jan Žižka Department of Faculty of Business Karel Burda Informatics and Economics Mendel Czech University Republic in Brno Grouping Customer Opinions Written in Natural Language Using Unsupervised Machine Learning
  • 2. + Introduction  Many companies collect opinions expressed by their customers  These opinions can hide valuable knowledge  Discovering such the knowledge by people can be a very demanding task because:  the opinion database can be very large,  the customers can use different languages,  the people can handle the opinions subjectively,  sometimes additional resources (like lists of positive and negative words) might be needed.
  • 3. + Introduction  Ourprevious research focused on the analysis what was significant for including a certain opinion into one of categories like satisfied or dissatisfied customers  However, this requires to have the reviews separated into classes sharing a common opinion/sentiment
  • 4. + Introduction  Clusteringas the most common form of unsupervised learning enables automatic grouping of unlabeled documents into subsets called clusters  Inthe previous research, we analyzed how well a computer can separate the classes expressing a certain opinion and to find a clustering algorithm with a set of its best parameters, similarity, and clustering-criterion function, word representation, and the role of stemming for the given specific data
  • 5. + Objective  Clustering process is naturally not errorless and some reviews labeled as positive appear in a cluster containing mostly negative reviews and vice versa  The objective was to analyse why certain reviews were assigned “wrongly” to a group containing mostly reviews from a different class in order to improve the results of classification and prediction
  • 6. + Data description  Processed data included reviews of hotel clients collected from publicly available sources  The reviews were labeled as positive and negative  Reviews characteristics:  more than 5,000,000 reviews  written in more than 25 natural languages  written only by real customers, based on their experience  written relatively carefully but still containing errors that are typical for natural languages
  • 7. + Properties of data used for experiments  The subset (marked as written in English) used in our experiments contained almost two million opinions Review category Positive Negative Number of reviews 1,190,949 741,092 Maximal review length 391 words 396 words Average review length 21.67 words 25.73 words Variance 403.34 words 618.47 words
  • 8. + Review examples  Positive  The breakfast and the very clean rooms stood out as the best features of this hotel.  Clean and moden, the great loation near station. Friendly reception!  The rooms are new. The breakfast is also great. We had a really nice stay.  Good location - very quiet and good breakfast.  Negative  High price charged for internet access which actual cost now is extreamly low.  water in the shower did not flow away  The room was noisy and the room temperature was higher than normal.  The air conditioning wasn't working
  • 9. + Data preparation  Datacollection, cleaning (removing tags, non-letter characters), converting to upper-case  Removing stopwords and words shorter than 3 characters and  Spell checking, diacritics removal etc. were not carried out  Creatingthree smaller subsets containing positive and negative reviews with the following proportions:  about 1,000 positive and 1,000 negative (small)  about 50,000 positive and 50,000 negative (medium)  about 250,000 positive and 250,000 negative (large)
  • 10. + Experimental steps  Transformation of the data into the vector representation (bag- of-words model, tf-idf weighting schema)  Clustering with Cluto* with following parameters  similarity function – cosine similarity,  clustering method – k-means (Cluto’s variation)  criterion function that is optimized during clustering process – H2  weighted entropy of the results varied from about 0.58 to 0.60 (e.g., for the small set of reviews, the entropy was 0.587, and accuracy 0.859) * Free software providing different clustering methods working with several clustering criterion functions and similarity measures, suitable for operating on very large datasets.
  • 11. + Graphical representation of the results of clustering False Positive (FP) False Negative (FP) True Positive (TP) True Negative (TN) Clustered Positive (CP) Clustered Negative (CN)
  • 12. + Analysis of incorrectly clustered reviews  When a review rPi, originally labeled as positive, is “wrongly” assigned to a cluster with mostly negative reviews (CN), we can assume that the properties of this review are more “similar” to the properties of the other reviews in CN, i.e., the words of rPi and their combinations are more similar to the words contained in the dictionary of CN  The similarity was related to the frequency of words of rPi in the subsets of the clustering solution (FN is compared to TN, TP, CP, and FP is compared to TP, TN, CN)
  • 13. + Analysis of incorrectly clustered reviews  We 𝑋 introduce the importance 𝑖 𝑤 𝑖 of a word wi in a given set X: where NX(wi) is the frequency of word wi in set X and NX is the number of dictionary words in X
  • 14. + Analysis of incorrectly clustered reviews  The importance of a word in one set should be similar to the importance of the same word in the most similar set, i.e., importance of words in FN and TN should be more similar than, e.g., importance of words in FN and TP  Lowest value among and corresponds to highest importance similarity with TP, TN, or CN  The same comparisons between FN and TN, TP, and CP were carried out
  • 15. + Importance of words from dictionary of False Positive set compared to the other sets
  • 16. + Importance of words from dictionary of False Negative set compared to the other sets
  • 17. + Results of the analysis  The words with higher frequencies included mostly the words that could be considered positive (e.g., location, excellent, or friendly) and negative (e.g., small, noise, or poor) in terms of their contribution to the assignment of reviews to a “correct” category  These words that are important from the correct classification viewpoint have often the most similar importance in a different set than one would expect, e.g., some words in reviews from FN bearing a strong positive sentiment had their importance most similar to their importance in TN and not in TP or CP
  • 18. + Example 1 – small data set  A strongly positive word excellent was used 3 times in the FN (290 positive reviews, 3,678 words)  iFN = 0.0008  Such the importance was the most similar to the importance of the same word in TN (iTN = 0.0007) and not in TP (iTP = 0.007) or CP (iCP = 0.006)  Review “Excellent bed making. Very good restaurant but an English language menu would be advantageous to non-german speaking visitors.” containing a strongly positive word excellent was categorized incorrectly
  • 19. + Example 2 – small data set  A positiveword good (with smaller positivity than excellent) had the importance iFN = 0.0114  Suchthe importance was most similar to the importance of the same word in CP (iCP = 0.0146) and not in TP (iTP = 0.016) or TN (iTN = 0.0021)  Nevertheless, some reviews containing this positive word were assigned to a group with mostly negative reviews.
  • 20. + Results of the analysis  Both examples demonstrate that other document properties, i.e., the presence of the other words together with their importance, are signifi- cant. This is demonstrated in the table with importance similarities of words of an obviously positive review containing twice strongly positive word “good” which was assigned incorrectly to CN.
  • 21. + Results of the analysis – importance vs. frequency  The analysis of the importance of words from dictionary of FN showed that about 60% of words had their importance similar to their importance in TN  However, the frequency of each of these words (number of occurrences in all reviews) was relatively low (many of them appeared just once)  These words with highly similar importance also often did not bear any sentiment, such as the words discounted, happening, or attitude
  • 22. + Conclusions  The study aimed at finding what was actually the reason of assigning some documents into a “wrong” class  The critical information is provided by certain significant words included in individual reviews  Words that the previous research found significant for opinion polarity did not take effect as misleading information unlike words that were much more or quite insignificant  Specific words (or their combinations) can be filtered out as noise, improving the cluster generation