SlideShare a Scribd company logo
1 of 17
Download to read offline
+       František Dařena
                Jan Žižka
                            Department
                            of
                                          Faculty of
                                          Business
                            Informatics   and
                                          Economics




                            Mendel        Czech
                            University    Republic
                            in Brno




    TEXT MINING-BASED FORMATION OF
    DICTIONARIES EXPRESSING OPINIONS
          IN NATURAL LANGUAGES
+
    Introduction


     Many  companies collect opinions expressed
      by their customers.
     These opinions can hide valuable knowledge.
     Discovering the knowledge by people can be
      sometimes a very demanding task because
      the opinion database can be very large,
      the customers can use different languages,
      the people can handle the opinions subjectively,
      sometimes additional resources (like lists of positive
       and negative words) might be needed.
+
    Objective


    To automatically extract words
     significant for positive and negative
     customers' opinions and to form
     dictionaries of positive and negative
     words, including the strength of their
     positivity and negativity.
+
    Data description

     Processed  data included reviews of hotel clients
      collected from publicly available sources.
     The reviews were labeled as positive and
      negative.
     Reviews characteristics:
      more than 5,000,000 reviews,
      written in more than 25 natural languages,
      written only by real customers, based on a real
       experience,
      written relatively carefully but still containing errors that
       are typical for natural languages.
+
    Review examples

       Positive
           The breakfast and the very clean rooms stood out as the best
            features of this hotel.
           Clean and moden, the great loation near station. Friendly
            reception!
           The rooms are new. The breakfast is also great. We had a really
            nice stay.
           Good location - very quiet and good breakfast.

       Negative
           High price charged for internet access which actual cost now
            is extreamly low.
           water in the shower did not flow away
           The room was noisy and the room temperature was higher
            than normal.
           The air conditioning wasn't working
+
    Data preparation


     Data  collection, cleaning (removing tags, non-
      letter characters), converting to upper-case.
     Transforming into the Bag-of-Words
      representation, term frequencies (TF) used as
      attribute values.
     Removing the words with global frequency
      MinTF < 2.
+
    Data characteristics




       Number of unique words for different languages (MinTF = 1)
+
    Data characteristics




            total         negative     positive    both classes


Number of unique words for different languages – for negative and positive
             classes and words in both classes (MinTF = 2)
+
    Finding the significant words


     Significantwords were discovered as relevant
     attributes used by a classification algorithm – a
     decision tree, the tree-generating algorithm c5 (by
     R. Quinlan) based on entropy minimization.
     The goal was not to achieve the best classification
     accuracy (it was around 90%) but to find relevant
     attributes that contribute to assigning a text to a
     given class.
     Thesignificant words appeared in the nodes of the
     decision tree.
+
    Representing the decision tree
    using rules
     Thebranches of a decision tree can be converted into
     rules.
     Examples:

      f(word1) > 0 AND f(word2) = 0 AND f(word3) = 0 : NEG[N1; I1]
      f(word4) = 0 AND f(word5) > 0 AND f(word6) > 0 : NEG[N2; I2]
      f(word1) = 0 AND f(word6) > 0 : NEG[N3; I3]
      Nx – number of times when the rule was used
      Ix – number of times when the rule was used incorrectly
     When  a word appears in a rule as f(word) > 0 it
     contributes to classification into a given class and it is
     thus relevant for the class.
+
    One word in more paths/rules




The same word (e.g. “friendly”) can appear in
more paths in the decision tree and to contribute
to classification into both classes.
+
    Strength of word sentiment

       The more a word appears as relevant in rules assigning the
        negative (positive) class to a text correctly the more
        negative (positive) the word is. However, it is necessary to
        consider not only absolute frequency but also the relative
        accuracy.
       For example, a word W1 is used 10 times for a correct and 0
        times for an incorrect classification to the negative class, and
        word W2 is used 30 times for a correct and 20 times for an
        incorrect classification to the negative class (50 times in
        total). Now, the question is which of these two words is `more
        negative.' The word W1 was used less times but in 100%
        correctly, while the word W2 was used 5 times more but with
        only 60% correctness.
+
    Sentiment strength weight


                                 NC ln NC + NN
                                          2     2
                            ww =    ×
                                 NN   ln(Nmax )

                         The weight balances the
                         frequency when a word was
                         used for classification and the
                         correctness of the classification.
                         The calculated weight then
                         determines the importance of a
                         word in relation to a given
                         category (positive or negative
                         class) – higher numbers mean
                         bigger relevancy.
+
    Results
+
    Results
+
    Results
+
    Conclusions

    A procedure how to apply computers, machine
     learning, and natural language processing areas to
     automatically find significant words was presented.
     Fromthe total number of words (80,000–200,000) only
     about 200–300 were identified as significant.
     The   procedure worked well for many languages.
     Followingresearch will focus on generating typical
     short phrases instead of only creating individual words.
     Theprocedure might be used during the marketing
     research or marketing intelligence, for filtering
     reviews, generating lists of key-words etc.

More Related Content

Similar to Additional1

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Alex Curtis
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big DataSameer Wadkar
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPInsoo Chung
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptxChode Amarnath
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectorsOsebe Sammi
 

Similar to Additional1 (20)

Zizka aimsa 2012
Zizka aimsa 2012Zizka aimsa 2012
Zizka aimsa 2012
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Document similarity
Document similarityDocument similarity
Document similarity
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Word vectors
Word vectorsWord vectors
Word vectors
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Doc format.
Doc format.Doc format.
Doc format.
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Word embedding
Word embedding Word embedding
Word embedding
 

More from Natalia Ostapuk

Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Natalia Ostapuk
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniarNatalia Ostapuk
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12Natalia Ostapuk
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1Natalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3Natalia Ostapuk
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большаковаNatalia Ostapuk
 
Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Natalia Ostapuk
 

More from Natalia Ostapuk (20)

Gromov
GromovGromov
Gromov
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Ponomareva
PonomarevaPonomareva
Ponomareva
 
Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013
 
Tomita одесса
Tomita одессаTomita одесса
Tomita одесса
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniar
 
Tomita 4марта
Tomita 4мартаTomita 4марта
Tomita 4марта
 
Konyushkova
KonyushkovaKonyushkova
Konyushkova
 
Braslavsky 13.12.12
Braslavsky 13.12.12Braslavsky 13.12.12
Braslavsky 13.12.12
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12
 
Zizka immm 2012
Zizka immm 2012Zizka immm 2012
Zizka immm 2012
 
Analysis by-variants
Analysis by-variantsAnalysis by-variants
Analysis by-variants
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
Angelii rus
Angelii rusAngelii rus
Angelii rus
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большакова
 
Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012
 

Recently uploaded

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 

Recently uploaded (20)

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 

Additional1

  • 1. + František Dařena Jan Žižka Department of Faculty of Business Informatics and Economics Mendel Czech University Republic in Brno TEXT MINING-BASED FORMATION OF DICTIONARIES EXPRESSING OPINIONS IN NATURAL LANGUAGES
  • 2. + Introduction  Many companies collect opinions expressed by their customers.  These opinions can hide valuable knowledge.  Discovering the knowledge by people can be sometimes a very demanding task because  the opinion database can be very large,  the customers can use different languages,  the people can handle the opinions subjectively,  sometimes additional resources (like lists of positive and negative words) might be needed.
  • 3. + Objective To automatically extract words significant for positive and negative customers' opinions and to form dictionaries of positive and negative words, including the strength of their positivity and negativity.
  • 4. + Data description  Processed data included reviews of hotel clients collected from publicly available sources.  The reviews were labeled as positive and negative.  Reviews characteristics:  more than 5,000,000 reviews,  written in more than 25 natural languages,  written only by real customers, based on a real experience,  written relatively carefully but still containing errors that are typical for natural languages.
  • 5. + Review examples  Positive  The breakfast and the very clean rooms stood out as the best features of this hotel.  Clean and moden, the great loation near station. Friendly reception!  The rooms are new. The breakfast is also great. We had a really nice stay.  Good location - very quiet and good breakfast.  Negative  High price charged for internet access which actual cost now is extreamly low.  water in the shower did not flow away  The room was noisy and the room temperature was higher than normal.  The air conditioning wasn't working
  • 6. + Data preparation  Data collection, cleaning (removing tags, non- letter characters), converting to upper-case.  Transforming into the Bag-of-Words representation, term frequencies (TF) used as attribute values.  Removing the words with global frequency MinTF < 2.
  • 7. + Data characteristics Number of unique words for different languages (MinTF = 1)
  • 8. + Data characteristics total negative positive both classes Number of unique words for different languages – for negative and positive classes and words in both classes (MinTF = 2)
  • 9. + Finding the significant words  Significantwords were discovered as relevant attributes used by a classification algorithm – a decision tree, the tree-generating algorithm c5 (by R. Quinlan) based on entropy minimization.  The goal was not to achieve the best classification accuracy (it was around 90%) but to find relevant attributes that contribute to assigning a text to a given class.  Thesignificant words appeared in the nodes of the decision tree.
  • 10. + Representing the decision tree using rules  Thebranches of a decision tree can be converted into rules.  Examples: f(word1) > 0 AND f(word2) = 0 AND f(word3) = 0 : NEG[N1; I1] f(word4) = 0 AND f(word5) > 0 AND f(word6) > 0 : NEG[N2; I2] f(word1) = 0 AND f(word6) > 0 : NEG[N3; I3] Nx – number of times when the rule was used Ix – number of times when the rule was used incorrectly  When a word appears in a rule as f(word) > 0 it contributes to classification into a given class and it is thus relevant for the class.
  • 11. + One word in more paths/rules The same word (e.g. “friendly”) can appear in more paths in the decision tree and to contribute to classification into both classes.
  • 12. + Strength of word sentiment  The more a word appears as relevant in rules assigning the negative (positive) class to a text correctly the more negative (positive) the word is. However, it is necessary to consider not only absolute frequency but also the relative accuracy.  For example, a word W1 is used 10 times for a correct and 0 times for an incorrect classification to the negative class, and word W2 is used 30 times for a correct and 20 times for an incorrect classification to the negative class (50 times in total). Now, the question is which of these two words is `more negative.' The word W1 was used less times but in 100% correctly, while the word W2 was used 5 times more but with only 60% correctness.
  • 13. + Sentiment strength weight NC ln NC + NN 2 2 ww = × NN ln(Nmax ) The weight balances the frequency when a word was used for classification and the correctness of the classification. The calculated weight then determines the importance of a word in relation to a given category (positive or negative class) – higher numbers mean bigger relevancy.
  • 14. + Results
  • 15. + Results
  • 16. + Results
  • 17. + Conclusions A procedure how to apply computers, machine learning, and natural language processing areas to automatically find significant words was presented.  Fromthe total number of words (80,000–200,000) only about 200–300 were identified as significant.  The procedure worked well for many languages.  Followingresearch will focus on generating typical short phrases instead of only creating individual words.  Theprocedure might be used during the marketing research or marketing intelligence, for filtering reviews, generating lists of key-words etc.