SlideShare a Scribd company logo
1 of 41
.




                Seminar on

                                       Text Mining

                                                          by Examples

By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 27 Jan. 2010

            Hadi Mohammadzadeh          Text Mining by Examples   Pages       1
.




Seminar on Text Mining by Examples


OutLine

– New Terminologies
– WordNet - A Large Lexical DataBase of English
– Reuters-21578 … as a Text Collection
–   CMU Text Learning Group Data Archives
– Text Mine Software - Web based algorithms
– Text Mine Software - Command based algorithms
– Usefull Web sites




       Hadi Mohammadzadeh   Text Mining by Examples   Pages       2
.




  Seminar on Text Mining by Examples

                          Part One



New Terminologies
     Word and Meaning Relationships




Hadi Mohammadzadeh   Text Mining by Examples   Pages       3
.



                          Understanding Text
                Hyponym and Hypernym
• In linguistics, a hyponym is a word or phrase whose
  semantic range is included within another word, its
  hypernym. For example, scarlet and crimson are all
  hyponyms of red (their hypernym), which is, in turn, a
  hyponym of colour.




          Hadi Mohammadzadeh   Text Mining by Examples   Pages       4
.




                        Understanding Text
                            Meronym

• Meronymy is a semantic relation used in linguistics.
  A meronym denotes a constituent part of, or a
  member of something. That is,
   – X is a meronym of Y if Xs are parts of Y(s), or
   – X is a meronym of Y if Xs are members of Y(s).
• For example, 'finger' is a meronym of 'hand' because
  a finger is part of a hand. Similarly 'wheel' is a
  meronym of 'automobile'.



        Hadi Mohammadzadeh   Text Mining by Examples   Pages       5
.




                          Understanding Text
                              Holonym

• Holonymy defines the relationship between a term denoting
  the whole and a term denoting a part of the whole. That is,

   – 'X' is a holonym of 'Y' if Ys are parts of Xs, or
   – 'X' is a holonym of 'Y' if Ys are members of Xs.

• For example, 'tree' is a holonym of 'bark',
                                   of 'trunk‘
                                   and of 'limb.'




          Hadi Mohammadzadeh   Text Mining by Examples   Pages       6
.




  Seminar on Text Mining by Examples

                          Part Two



               WordNet
        A Large Lexical DataBase of English




Hadi Mohammadzadeh   Text Mining by Examples   Pages       7
.




                               WordNet
• WordNet® is a large lexical database of English, developed
  under the direction of George A. Miller.

• Develpoment of WordNet began in 1985 and its use is
  widespread in tools to manage text.

• WordNet is more than just a dictionary and thesaurus; it includes
  all kinds of relationships between words. WordNet version 2.0
  contains roughly 150,000 content words.




          Hadi Mohammadzadeh   Text Mining by Examples   Pages       8
.




                       WordNet                         cont.


• Nouns, verbs, adjectives and adverbs are grouped into
  sets of cognitive synonyms (synsets), each expressing a
  distinct concept.


• WordNet is also freely and publicly available for
  download.

• WordNet's structure makes it a useful tool for
  computational linguistics and natural language
  processing.

        Hadi Mohammadzadeh   Text Mining by Examples   Pages       9
.



                     Understanding Text – Polysemy
          Number of Senses in WordNet
• A word can have more than one meaning that is not obvious in
  a sentence.
• In WordNet a word has an average of 1.4 senses.

                           Average of Sense
      Word                             Number Average of Senses
        Verb                                               2.1
      Adjective                                           1.45
       Adverb                                             1.25
       Nouns                                              1.24

          Hadi Mohammadzadeh   Text Mining by Examples   Pages       10
.



              Understanding Text – Polysemy
   Number of Senses in WordNet

Words with the Highest Number of Senses from
                  WordNet
 Word                   Number of Senses

  Break                                            74
   Cut                                             73
  Run                                              57
  Play                                             52
  Make                                             51



   Hadi Mohammadzadeh   Text Mining by Examples   Pages       11
.



                    Understanding Text – Polysemy
           Number of POS in WordNet
• Some words also have more than one part of speech(POS). For
  example still has five different parts of speech.

         Word                                    Number of POS
            Out                                             5
          Round                                             5
           Still                                            5
          Down                                              5
           Over                                             4

         Hadi Mohammadzadeh   Text Mining by Examples   Pages       12
.




           World Classifications in WordNet

• Words can be classified into word classes or POS.
• We refer to nouns, verbs, adjectives, and adverbs as content words.
• Conjunctions, determiners, pronouns, and prepositions are called
  function words.

                Frequencies of Word Classes from WordNet
           Type               Number                        Type             Number

    Noun                114,400(75%)               Preposition          133(0.08%)

    Adjective           21,438(14%)                Pronoun              118(0.07%)

    Verb                11,341(7.4%)               Conjunction          89(0.05%)

    Adverb              4662(3%)                   Determiner           14(0.009%)


             Hadi Mohammadzadeh   Text Mining by Examples   Pages                     13
.




                                  WordNet
                Website and Developed Program
•   WordNet Website



•   WordNet Developed Program




           Hadi Mohammadzadeh   Text Mining by Examples   Pages       14
.




  Seminar on Text Mining by Examples

                         Part Three



      Reuters-21578
                as a Text Collection




Hadi Mohammadzadeh   Text Mining by Examples   Pages       15
.




                          Reuters-21578
                                    History
• The documents in the Reuters-21578 collection
  appeared on the Reuters newswire in 1987.

• Reuters-21578 is a test collection for evaluation of
  automatic text categorization techniques. Really it is a
  classic benchmark for text categorization algorithms.

• The Reuters-21578 collection is distributed in 22 files.
  Each of the first 21 files contain 1000 documents,
  while the last contains 578 documents.

         Hadi Mohammadzadeh   Text Mining by Examples   Pages       16
.




                       Reuters-21578

• Distribution 1.0 on 26 September 1997, By
  David D. Lewis AT&T Labs - Research

• The data was originally collected and labeled
  by Carnegie Group, Inc. and Reuters, Ltd. in
  the course of developing the CONSTRUE text
  categorization system.


       Hadi Mohammadzadeh   Text Mining by Examples   Pages       17
.




    Seminar on Text Mining by Examples

                            Part Four


CMU Text Learning Group
     Data Archives
                  as a Text Collection




  Hadi Mohammadzadeh   Text Mining by Examples   Pages       18
.




                   CMU Text Learning Group
                        Data Archives

• This data set is a collection of 20,000 messages, collected
  from 20 different netnews newsgroups. One thousand
  messages from each of the twenty newsgroups were chosen at
  random and partitioned by newsgroup name.

• Link

• Sample Message

• Experiment Results

• Prof. Cho , Sam Houston State of University


         Hadi Mohammadzadeh   Text Mining by Examples   Pages       19
.




                         CMU Text Learning Group
                              Data Archives
1.    alt.atheism
2.    talk.politics.guns
3.    talk.politics.mideast
4.    talk.politics.misc
5.    talk.religion.misc
6.    soc.religion.christian
7.    comp.sys.ibm.pc.hardware
8.    comp.graphics
9.    comp.os.ms-windows.misc
10.   comp.sys.mac.hardware
11.   comp.windows.x
12.   rec.autos
13.   rec.motorcycles
14.   rec.sport.baseball
15.   rec.sport.hockey
16.   sci.crypt
17.   sci.electronics
18.   sci.space
19.   sci.med
20.   misc.forsale



                  Hadi Mohammadzadeh   Text Mining by Examples   Pages       20
.




  Seminar on Text Mining by Examples

                          Part Five



Text Mine Software
              Web based algorithms




Hadi Mohammadzadeh   Text Mining by Examples   Pages       21
.




                 Text Mine Application




•    The three scripts in the first row handle:
    1.       the creation of text statistics
         •     Number of word types
         •     Letter frequencies
         •     Word frequencies
    2.   Entity Extraction
    3.   Finding the POS tags for words
              Hadi Mohammadzadeh   Text Mining by Examples   Pages       22
.




           Text Mine Application




•   As an input use a text file such as Help File or
    write a text on Textbox.



        Hadi Mohammadzadeh   Text Mining by Examples   Pages       23
.




  Seminar on Text Mining by Examples

                           Part Six



Text Mine Software
          Command based algorithms




Hadi Mohammadzadeh   Text Mining by Examples   Pages       24
.




                          Zeroth Program
                                         Tokens


• Name of Program: tokens.pl
• Input : sample.
• Output : After runnig this program, it will generate a text file with
  following name

                 tokens.txt
• Aim : Generating Tokens




           Hadi Mohammadzadeh   Text Mining by Examples   Pages           25
.




                        First Program
                            Part of Speech Tagger


• Name of Program: pos-test.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will
  generate a text file with following name

             pos_test_results.txt
• Aim : Part of Speech Tagger


       Hadi Mohammadzadeh   Text Mining by Examples   Pages       26
.




                        Second Program
                               Entity Extraction


• To generate named entities with associated
  types, we need some dictionaries for categories
  such as
  – Person, place, organization, number, currency, dimension, time,
    technical time, or miscellaneous.
  – For Exampel co_abbrev.dat contains a list of about 900
    abbreviations. Or co_places table is a list of about 3000 of the
    world’s lager cities.




         Hadi Mohammadzadeh   Text Mining by Examples   Pages          27
.




                         Second Program
                                Entity Extraction


•   Name of Program: test-ent.pl
•   Input : Inside Perl File.
•   Output : After runnig this program, it will generate a
    text file with following name
                test_ent_results.txt
•   Aim : Entity Extraction




          Hadi Mohammadzadeh   Text Mining by Examples   Pages       28
.



                            Third Program
                   Disambiguate words with multiple


• Name of Program: sense.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will
  generate a text file with following name

             sense.txt


       Hadi Mohammadzadeh   Text Mining by Examples   Pages       29
.




                    Fourth Program
                     Random Text Generator

• Name of Program: tgen.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will
  generate a text file with following name

             tgen.txt


       Hadi Mohammadzadeh   Text Mining by Examples   Pages       30
.




                      Fifth Program
                     Splitting of text into sentences


• Name of Program: tsplit.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will
  generate a text file with following name

             tsplit.txt


       Hadi Mohammadzadeh   Text Mining by Examples   Pages       31
.




                            Sixth program
                                      Clustering

• Name of Program: cluster.pl

• Input Data: a collection of 55 Reuters documents from three topics
    – Cocoa , 15 documents
    – Suger , 22 documents
    – Coffee , 18 documents
    Input file included in cluster.pl.

• Input Parameters : A similarity threshold, a linking parameter, and
  an indexing parameter.

• Output :
  It returns a list of clusters and similarity matrix. Cluster.txt

• Method : This program is based on genetic algorithm method.


            Hadi Mohammadzadeh   Text Mining by Examples   Pages        32
.




  Seminar on Text Mining by Examples

                         Part Seven



     Usefull Web sites




Hadi Mohammadzadeh   Text Mining by Examples   Pages       33
.




                      Talk to Ditto
• http://www.convo.co.uk/x02/?




       Hadi Mohammadzadeh   Text Mining by Examples   Pages       34
.




Hadi Mohammadzadeh   Text Mining by Examples   Pages       35
.




Hadi Mohammadzadeh   Text Mining by Examples   Pages       36
.




Hadi Mohammadzadeh   Text Mining by Examples   Pages       37
.




                   How it works?
• Bayesian Classification is used to teach Ditto
  the donkey the basics of the English language
• When Ditto receives a message, he evaluates it
  for niceness or nastiness, then responds
  emotionally on a scale of –100 to +100
• Ditto was trained using 5525 examples




       Hadi Mohammadzadeh   Text Mining by Examples   Pages       38
.




                    Dragon Toolkit

• Dragon Toolkit




       Hadi Mohammadzadeh   Text Mining by Examples   Pages       39
.




                                  Disp
• http://www.ltg.ed.ac.uk/disp/resources/




       Hadi Mohammadzadeh   Text Mining by Examples   Pages       40
.




                              References

• Books
  –   Introduction to Information Retrieval-2008
  –   Managing Gigabytes-1999
  –   The Text Mining Handbook
  –   Text Mining Application Programming
  –   Web Data Mining




         Hadi Mohammadzadeh   Text Mining by Examples   Pages       41

More Related Content

Viewers also liked

Data mining week 1 - pengantar data mining
Data mining   week 1 - pengantar data miningData mining   week 1 - pengantar data mining
Data mining week 1 - pengantar data mining
Lye Lazar
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorial
mgrcar
 

Viewers also liked (20)

Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Text mining
Text miningText mining
Text mining
 
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
 
Text & Data Mining Licensing Issues
Text & Data Mining Licensing IssuesText & Data Mining Licensing Issues
Text & Data Mining Licensing Issues
 
32296 23 algoritma tf idf
32296 23 algoritma tf idf32296 23 algoritma tf idf
32296 23 algoritma tf idf
 
Data mining week 1 - pengantar data mining
Data mining   week 1 - pengantar data miningData mining   week 1 - pengantar data mining
Data mining week 1 - pengantar data mining
 
Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies
 
Text Mining Using JBoss Rules
Text Mining Using JBoss RulesText Mining Using JBoss Rules
Text Mining Using JBoss Rules
 
Insurance basics
Insurance basicsInsurance basics
Insurance basics
 
Week12
Week12Week12
Week12
 
Text mining full text for molecular targets
Text mining full text for molecular targetsText mining full text for molecular targets
Text mining full text for molecular targets
 
Text Analytics
Text Analytics Text Analytics
Text Analytics
 
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionOUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: Introduction
 
Text Mining - Data Mining
Text Mining - Data MiningText Mining - Data Mining
Text Mining - Data Mining
 
Text MIning
Text MIningText MIning
Text MIning
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorial
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
Text mining
Text miningText mining
Text mining
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 

More from Hadi Mohammadzadeh (8)

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web Pages
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
 
Webist2012 presentation
Webist2012 presentationWebist2012 presentation
Webist2012 presentation
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML Files
 
Main Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesMain Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML Files
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi Mohammadzadeh
 
Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi Mohammadzadeh
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Text mining by examples, By Hadi Mohammadzadeh

  • 1. . Seminar on Text Mining by Examples By : Hadi Mohammadzadeh Institute of Applied Information Processing University of Ulm – 27 Jan. 2010 Hadi Mohammadzadeh Text Mining by Examples Pages 1
  • 2. . Seminar on Text Mining by Examples OutLine – New Terminologies – WordNet - A Large Lexical DataBase of English – Reuters-21578 … as a Text Collection – CMU Text Learning Group Data Archives – Text Mine Software - Web based algorithms – Text Mine Software - Command based algorithms – Usefull Web sites Hadi Mohammadzadeh Text Mining by Examples Pages 2
  • 3. . Seminar on Text Mining by Examples Part One New Terminologies Word and Meaning Relationships Hadi Mohammadzadeh Text Mining by Examples Pages 3
  • 4. . Understanding Text Hyponym and Hypernym • In linguistics, a hyponym is a word or phrase whose semantic range is included within another word, its hypernym. For example, scarlet and crimson are all hyponyms of red (their hypernym), which is, in turn, a hyponym of colour. Hadi Mohammadzadeh Text Mining by Examples Pages 4
  • 5. . Understanding Text Meronym • Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is, – X is a meronym of Y if Xs are parts of Y(s), or – X is a meronym of Y if Xs are members of Y(s). • For example, 'finger' is a meronym of 'hand' because a finger is part of a hand. Similarly 'wheel' is a meronym of 'automobile'. Hadi Mohammadzadeh Text Mining by Examples Pages 5
  • 6. . Understanding Text Holonym • Holonymy defines the relationship between a term denoting the whole and a term denoting a part of the whole. That is, – 'X' is a holonym of 'Y' if Ys are parts of Xs, or – 'X' is a holonym of 'Y' if Ys are members of Xs. • For example, 'tree' is a holonym of 'bark', of 'trunk‘ and of 'limb.' Hadi Mohammadzadeh Text Mining by Examples Pages 6
  • 7. . Seminar on Text Mining by Examples Part Two WordNet A Large Lexical DataBase of English Hadi Mohammadzadeh Text Mining by Examples Pages 7
  • 8. . WordNet • WordNet® is a large lexical database of English, developed under the direction of George A. Miller. • Develpoment of WordNet began in 1985 and its use is widespread in tools to manage text. • WordNet is more than just a dictionary and thesaurus; it includes all kinds of relationships between words. WordNet version 2.0 contains roughly 150,000 content words. Hadi Mohammadzadeh Text Mining by Examples Pages 8
  • 9. . WordNet cont. • Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. • WordNet is also freely and publicly available for download. • WordNet's structure makes it a useful tool for computational linguistics and natural language processing. Hadi Mohammadzadeh Text Mining by Examples Pages 9
  • 10. . Understanding Text – Polysemy Number of Senses in WordNet • A word can have more than one meaning that is not obvious in a sentence. • In WordNet a word has an average of 1.4 senses. Average of Sense Word Number Average of Senses Verb 2.1 Adjective 1.45 Adverb 1.25 Nouns 1.24 Hadi Mohammadzadeh Text Mining by Examples Pages 10
  • 11. . Understanding Text – Polysemy Number of Senses in WordNet Words with the Highest Number of Senses from WordNet Word Number of Senses Break 74 Cut 73 Run 57 Play 52 Make 51 Hadi Mohammadzadeh Text Mining by Examples Pages 11
  • 12. . Understanding Text – Polysemy Number of POS in WordNet • Some words also have more than one part of speech(POS). For example still has five different parts of speech. Word Number of POS Out 5 Round 5 Still 5 Down 5 Over 4 Hadi Mohammadzadeh Text Mining by Examples Pages 12
  • 13. . World Classifications in WordNet • Words can be classified into word classes or POS. • We refer to nouns, verbs, adjectives, and adverbs as content words. • Conjunctions, determiners, pronouns, and prepositions are called function words. Frequencies of Word Classes from WordNet Type Number Type Number Noun 114,400(75%) Preposition 133(0.08%) Adjective 21,438(14%) Pronoun 118(0.07%) Verb 11,341(7.4%) Conjunction 89(0.05%) Adverb 4662(3%) Determiner 14(0.009%) Hadi Mohammadzadeh Text Mining by Examples Pages 13
  • 14. . WordNet Website and Developed Program • WordNet Website • WordNet Developed Program Hadi Mohammadzadeh Text Mining by Examples Pages 14
  • 15. . Seminar on Text Mining by Examples Part Three Reuters-21578 as a Text Collection Hadi Mohammadzadeh Text Mining by Examples Pages 15
  • 16. . Reuters-21578 History • The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987. • Reuters-21578 is a test collection for evaluation of automatic text categorization techniques. Really it is a classic benchmark for text categorization algorithms. • The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files contain 1000 documents, while the last contains 578 documents. Hadi Mohammadzadeh Text Mining by Examples Pages 16
  • 17. . Reuters-21578 • Distribution 1.0 on 26 September 1997, By David D. Lewis AT&T Labs - Research • The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. Hadi Mohammadzadeh Text Mining by Examples Pages 17
  • 18. . Seminar on Text Mining by Examples Part Four CMU Text Learning Group Data Archives as a Text Collection Hadi Mohammadzadeh Text Mining by Examples Pages 18
  • 19. . CMU Text Learning Group Data Archives • This data set is a collection of 20,000 messages, collected from 20 different netnews newsgroups. One thousand messages from each of the twenty newsgroups were chosen at random and partitioned by newsgroup name. • Link • Sample Message • Experiment Results • Prof. Cho , Sam Houston State of University Hadi Mohammadzadeh Text Mining by Examples Pages 19
  • 20. . CMU Text Learning Group Data Archives 1. alt.atheism 2. talk.politics.guns 3. talk.politics.mideast 4. talk.politics.misc 5. talk.religion.misc 6. soc.religion.christian 7. comp.sys.ibm.pc.hardware 8. comp.graphics 9. comp.os.ms-windows.misc 10. comp.sys.mac.hardware 11. comp.windows.x 12. rec.autos 13. rec.motorcycles 14. rec.sport.baseball 15. rec.sport.hockey 16. sci.crypt 17. sci.electronics 18. sci.space 19. sci.med 20. misc.forsale Hadi Mohammadzadeh Text Mining by Examples Pages 20
  • 21. . Seminar on Text Mining by Examples Part Five Text Mine Software Web based algorithms Hadi Mohammadzadeh Text Mining by Examples Pages 21
  • 22. . Text Mine Application • The three scripts in the first row handle: 1. the creation of text statistics • Number of word types • Letter frequencies • Word frequencies 2. Entity Extraction 3. Finding the POS tags for words Hadi Mohammadzadeh Text Mining by Examples Pages 22
  • 23. . Text Mine Application • As an input use a text file such as Help File or write a text on Textbox. Hadi Mohammadzadeh Text Mining by Examples Pages 23
  • 24. . Seminar on Text Mining by Examples Part Six Text Mine Software Command based algorithms Hadi Mohammadzadeh Text Mining by Examples Pages 24
  • 25. . Zeroth Program Tokens • Name of Program: tokens.pl • Input : sample. • Output : After runnig this program, it will generate a text file with following name tokens.txt • Aim : Generating Tokens Hadi Mohammadzadeh Text Mining by Examples Pages 25
  • 26. . First Program Part of Speech Tagger • Name of Program: pos-test.pl • Input : Inside Perl File. • Output : After runnig this program, it will generate a text file with following name pos_test_results.txt • Aim : Part of Speech Tagger Hadi Mohammadzadeh Text Mining by Examples Pages 26
  • 27. . Second Program Entity Extraction • To generate named entities with associated types, we need some dictionaries for categories such as – Person, place, organization, number, currency, dimension, time, technical time, or miscellaneous. – For Exampel co_abbrev.dat contains a list of about 900 abbreviations. Or co_places table is a list of about 3000 of the world’s lager cities. Hadi Mohammadzadeh Text Mining by Examples Pages 27
  • 28. . Second Program Entity Extraction • Name of Program: test-ent.pl • Input : Inside Perl File. • Output : After runnig this program, it will generate a text file with following name test_ent_results.txt • Aim : Entity Extraction Hadi Mohammadzadeh Text Mining by Examples Pages 28
  • 29. . Third Program Disambiguate words with multiple • Name of Program: sense.pl • Input : Inside Perl File. • Output : After runnig this program, it will generate a text file with following name sense.txt Hadi Mohammadzadeh Text Mining by Examples Pages 29
  • 30. . Fourth Program Random Text Generator • Name of Program: tgen.pl • Input : Inside Perl File. • Output : After runnig this program, it will generate a text file with following name tgen.txt Hadi Mohammadzadeh Text Mining by Examples Pages 30
  • 31. . Fifth Program Splitting of text into sentences • Name of Program: tsplit.pl • Input : Inside Perl File. • Output : After runnig this program, it will generate a text file with following name tsplit.txt Hadi Mohammadzadeh Text Mining by Examples Pages 31
  • 32. . Sixth program Clustering • Name of Program: cluster.pl • Input Data: a collection of 55 Reuters documents from three topics – Cocoa , 15 documents – Suger , 22 documents – Coffee , 18 documents Input file included in cluster.pl. • Input Parameters : A similarity threshold, a linking parameter, and an indexing parameter. • Output : It returns a list of clusters and similarity matrix. Cluster.txt • Method : This program is based on genetic algorithm method. Hadi Mohammadzadeh Text Mining by Examples Pages 32
  • 33. . Seminar on Text Mining by Examples Part Seven Usefull Web sites Hadi Mohammadzadeh Text Mining by Examples Pages 33
  • 34. . Talk to Ditto • http://www.convo.co.uk/x02/? Hadi Mohammadzadeh Text Mining by Examples Pages 34
  • 35. . Hadi Mohammadzadeh Text Mining by Examples Pages 35
  • 36. . Hadi Mohammadzadeh Text Mining by Examples Pages 36
  • 37. . Hadi Mohammadzadeh Text Mining by Examples Pages 37
  • 38. . How it works? • Bayesian Classification is used to teach Ditto the donkey the basics of the English language • When Ditto receives a message, he evaluates it for niceness or nastiness, then responds emotionally on a scale of –100 to +100 • Ditto was trained using 5525 examples Hadi Mohammadzadeh Text Mining by Examples Pages 38
  • 39. . Dragon Toolkit • Dragon Toolkit Hadi Mohammadzadeh Text Mining by Examples Pages 39
  • 40. . Disp • http://www.ltg.ed.ac.uk/disp/resources/ Hadi Mohammadzadeh Text Mining by Examples Pages 40
  • 41. . References • Books – Introduction to Information Retrieval-2008 – Managing Gigabytes-1999 – The Text Mining Handbook – Text Mining Application Programming – Web Data Mining Hadi Mohammadzadeh Text Mining by Examples Pages 41