SlideShare a Scribd company logo
1 of 50
Download to read offline
Natural Language Toolkit [NLTK]
                                                  Prakash B Pimpale
                                                  pbpimpale@gmail.com

                                  @
    FOSS(From the Open Source Shelf) 
  An open source softwares seminar series

                         (CC) KBCS CDAC MUMBAI
Let's start
➢   What is it about?
    ➢   NLTK – installing and using it for some 'basic NLP' 
        and Text Processing tasks
    ➢   Some NLP and Some python 
➢   How will we proceed ?
    ➢   NLTK introduction
    ➢   NLP introduction
    ➢   NLTK installation
    ➢   Playing with text using NLTK 
    ➢   Conclusion          (CC) KBCS CDAC MUMBAI
NLTK

➢   Open source python modules and linguistic 
    data for Natural Langauge Processing 
    application 
➢   Dveloped by group of people, project leaders: 
    Steven Bird, Edward Loper, Ewan Klein 
➢   Python: simple, free and open source, portable, 
    object oriented, interpreted 


                     (CC) KBCS CDAC MUMBAI
NLP
➢   Natural Language Processing: field of computer science 
    and linguistics woks out interactions between computers 
    and human (natural) languages.
➢   some brand ambassadors: machine translation, 
    automatic summarization, information extraction, 
    transliteration, question answering, opinion mining  
➢   Basic tasks: stemming, POS tagging, chunking, parsing, 
    etc.
➢   Involves simple frequency count to undertanding and 
    generating complex text 

                        (CC) KBCS CDAC MUMBAI
Terms/Steps in NLP

➢   Tokenization: getting words and punctuations 
    out from text


➢   Stemming: getting the (inflectional) root of word; 
    plays, playing, played : play  




                     (CC) KBCS CDAC MUMBAI
cont..

POS(Part of Speech) tagging: 
     Ram  NNP  
     killed  VBD 
     Ravana NNP 


POS tags: NNP­ proper noun
                 VBD  verb, past tense
(Penn tree bank tagset )
                   (CC) KBCS CDAC MUMBAI
cont..

➢   Chunking: groups the similar POS tags (liberal 
    def)




                    (CC) KBCS CDAC MUMBAI
cont..
Parsing: Identifying gramatical structures in a 
sentence:  Mary saw a dog
 *(S 
    (NP Mary)
    **(VP 
       (V saw)
       ***(NP 
           (Det a)
           (N dog)   )***   )**   )*
                           (CC) KBCS CDAC MUMBAI
Applications

➢   Machine Translation
    MaTra, Google translate
➢   Transliteration 
    Xlit, Google Indic
➢   Text classification
    Spam filters
    many more.......!
                        (CC) KBCS CDAC MUMBAI
Challenges
➢   Word sense disambiguation
     Bank(financial/river)
     Plant(industrial/natural)
➢   Name entity recognition
    CDAC is in Mumbai
➢   Clause Boundary detection
    Ram who was the king of Ayodhya killed 
    Ravana 
    and many more.....!
                    (CC) KBCS CDAC MUMBAI
Installing NLTK
    Windows
➢   Install Python: Version 2.4.*, 2.5.*, or 2.6.* and 
    not 3.0
    available http://www.nltk.org/download 
    [great instructions!]
➢   Install PyYAML:
➢   Install Numpy
➢   Install Matplotlib
➢   Install NLTK         (CC) KBCS CDAC MUMBAI
cont..

    Linux/Unix
➢   Most likely python will be there (if not­> 
    package manager)
➢   Get python­nympy and python­scipy packages 
    (scintific library for multidimentional array and 
    linear algebra), use package manager to install 
    or 
    'sudo apt­get install python­numpy python­
    scipy'
                      (CC) KBCS CDAC MUMBAI
cont..

    NLTK Source Installation
➢   Download NLTK source 
    ( http://nltk.googlecode.com/files/nltk­2.0b8.zip)
➢   Unzip it
➢   Go to the new unziped folder
➢   Just do it! 
    'sudo python setup.py install'
    Done :­) !
                     (CC) KBCS CDAC MUMBAI
Installing Data

➢   We need data to experiment with
    It's too simple (same for all platform)...
      $python            #(IDLE) enter to python prompt 
      >>> import nltk
      >>> nltk.download()
    select any one option that you like/need



                        (CC) KBCS CDAC MUMBAI
cont..




         (CC) KBCS CDAC MUMBAI
cont..

 Test installation
 ( set path if needed: as 
 export NLTK_DATA = '/home/....../nltk_data')
       >>> from nltk.corpus import brown
       >>> brown.words()[0:20]
        ['The', 'Fulton', 'County', 'Grand', 'Jury']
                     All set!


                          (CC) KBCS CDAC MUMBAI
Playing with text

 Feeling python 
 $python
 >>>print 'Hello world'
 >>>2+3+4
     
 Okay !


                   (CC) KBCS CDAC MUMBAI
cont..

 Data from book:
   >>>from nltk.book import *
 see what is it?
    >>>text1
 Search the word:
   >>> text1.concordance("monstrous")
 Find similar words 
   >>> text1.similar("monstrous")
                   (CC) KBCS CDAC MUMBAI
cont..
Find common context
  >>>text2.common_contexts(["monstrous", "very"])
Find position of the word in entire text
  >>>text4.dispersion_plot(["citizens", "democracy", 
  "freedom", "duties", "America"])
Counting vocab
   >>>len(text4)  #gives tokens(words and 
  punctuations)


                    (CC) KBCS CDAC MUMBAI
cont..
 Vocabulary(unique words) used by author
   >>>len(set(text1))
 See sorted vocabulary(upper case first!)
   >>>sorted(set(text1))
 Measuring richness of text(average repetition)
   >>>len(text1)/len(set(text1))
 (before division import for floating point division
 from __future__ import division)
                    (CC) KBCS CDAC MUMBAI
cont..

 Counting occurances 
      >>>text5.count("lol")
 % of the text taken
      >>> 100 * text5.count('a') / len(text5)
 let's write function for it in python 
      >>>def  prcntg_oftext(word,text):
                   return100*.count(wrd)/len(txt)
          
                        (CC) KBCS CDAC MUMBAI
cont..
 Writing our own text
   >>>textown =['a','text','is','a','list','of','words']
 adding texts(lists, can contain anything)
   >>>sent = ['pqr']
   >>>sent2 = ['xyz']
   >>>sent+sent2
 append to text
   >>>sent.append('abc')

                        (CC) KBCS CDAC MUMBAI
cont..

 Indexing text
   >>> text4[173]
 or 
   >>>text4.index('awaken')
 slicing
 text1[3:4], text[:5], text[10:] try it! (notice ending 
 index one less than specified)

                    (CC) KBCS CDAC MUMBAI
cont..

 Strings in python
   >>>name = 'astring'
   >>>name[0]
 what's more
   >>>name*2
   >>>'SPACE '.join(['let us','join'])




                     (CC) KBCS CDAC MUMBAI
cont..

 Statistics to text
 Frequency Distribution
    >>> fdist1 = FreqDist(text1)
    >>> fdist1
    >>> vocabulary1 = fdist1.keys()
    >>>vocabulary1[:25]
 let's plot it...
    >>>fdist1.plot(50, cumulative=True)
                      (CC) KBCS CDAC MUMBAI
cont..

 *what the text1 is about?
 *are all the words meaningful?
 Let's find long words i.e.
 Words from text1 with length more than  10
 in python
    >>>longw=[word for word in text1 if len(word)>10]
 let's see it... 
    >>>sorted(longw)
                    (CC) KBCS CDAC MUMBAI
cont..

 *most long words have small freq and are not 
 major topic of text many times
 *words with long length and certain freq may 
 help to identify text topic
   >>>sorted([w for w in set(text5) if len(w) > 7 and 
   fdist5[w] > 7])




                    (CC) KBCS CDAC MUMBAI
cont..
 Bigrams and collocations:
 *Bigrams
   >>>sent=['near','river', 'bank']
   >>>bigrams(sent)
 basic to many NLP apps; transliteration, 
 translation
 *collocations: frequent bigrams of rare words
   >>>text4.collocations()
 basic to some WSD approaches
                     (CC) KBCS CDAC MUMBAI
cont..
 FreDist functions:




                  (CC) KBCS CDAC MUMBAI
cont..
 Python Frequntly needed 'word' functions




                 (CC) KBCS CDAC MUMBAI
cont

 Bigger probs:
 WSD, Anaphora resolution, Machine translation
 try and laugh:
   >>>babelize_shell()
   >>>german
   >>>run 
 or try it here: 
 http://translationparty.com/#6917644
                  (CC) KBCS CDAC MUMBAI
cont..
  Playing with text corpora
  Available corporas: Gutenberg, Web and chat 
  text, Brown, Reuters, inaugural address, etc.
    >>> from nltk.corpus import inaugural
    >>> inaugural.fileids()
    >>>fileid[:4] for fileid in 
    inaugural.fileids()]


                   (CC) KBCS CDAC MUMBAI
cont..

   >>> cfd = nltk.ConditionalFreqDist(
               (target, file[:4])
               for fileid in inaugural.fileids()
               for w in inaugural.words(fileid)
               for target in ['america', 'citizen']
               if w.lower().startswith(target))
   >>> cfd.plot()
   Conditional Freq distribution of words 'america' and 
   'citizen' with different years
                         (CC) KBCS CDAC MUMBAI
cont..
 Generating text (applying bigrams)
   >>> def generate_model(cfdist, word, num=15):
       for i in range(num):
           print word,
           word = cfdist[word].max()
   ­­­
   >>>text = nltk.corpus.genesis.words('english­kjv.txt')
   >>>bigrams = nltk.bigrams(text)
   >>>cfd = nltk.ConditionalFreqDist(bigrams)
    >>print cfd['living']
    >>generate_model(cfd, 'living')
                   (CC) KBCS CDAC MUMBAI
cont..
  WordList corpora
           >>>def unusual_words(text):
                text_vocab = set(w.lower() for w in text if  
              w.isalpha())
               english_vocab = set(w.lower() for w in 
              nltk.corpus.words.words())
               unusual = text_vocab.difference(english_vocab)
               return sorted(unusual)
  ­­
           >>>unusual_words(nltk.corpus.gutenberg.words('aust
            en­sense.txt'))
  Are the unusual or misspelled words.
                      (CC) KBCS CDAC MUMBAI
cont..
 Stop words:
   >>> from nltk.corpus import stopwords
   >>> stopwords.words('english')
   ­­ Percentage of content
   >>> def content_fraction(text):
          stopwords = nltk.corpus.stopwords.words('english')
          content = [w for w in text if w.lower() not in stopwords]
          return len(content) / len(text)
   >>> content_fraction(nltk.corpus.reuters.words())
   is the useful fraction of useful words! 
                          (CC) KBCS CDAC MUMBAI
cont..

 WordNet: structured, semanticaly orinted 
 dictionary
   >>> from nltk.corpus import wordnet as wn
 get synset/collection of synonyms
   >>> wn.synsets('motorcar')
 now get synonyms(lemmas)
   >>> wn.synset('car.n.01').lemma_names


                  (CC) KBCS CDAC MUMBAI
cont..

 What does car.n.01 actually mean?
   >>> wn.synset('car.n.01').definition
 An example:
   >>> wn.synset('car.n.01').examples
 Also
   >>>wn.lemma('supply.n.02.supply').antonyms()
 Useful data in many applications like WSD, 
 understanding text, question answering, etc. 
                   (CC) KBCS CDAC MUMBAI
cont..

 Text from web:
   >>> from urllib import urlopen
   >>> url = 
   "http://www.gutenberg.org/files/2554/2554.txt"
   >>> raw = urlopen(url).read()
   >>> type(raw)
   >>> len(raw)
 is in characters...!

                   (CC) KBCS CDAC MUMBAI
cont..
 Tokenization:
   >>> tokens = nltk.word_tokenize(raw)
   >>> type(tokens)
   >>> len(tokens)
   >>> tokens[:15]
 convert to Text
   >>> text = nltk.Text(tokens)
   >>> type(text)
 Now you can do all above things with this!
                     (CC) KBCS CDAC MUMBAI
cont..
What about HTML?
  >>> url = 
  "http://news.bbc.co.uk/2/hi/health/2284783.stm"
  >>> html = urlopen(url).read()
  >>> html[:60]
It's HTML; Clean the tags...
  >>> raw = nltk.clean_html(html)
  >>> tokens = nltk.word_tokenize(raw)
  >>> tokens[:60]
  >>> text = nltk.Text(tokens)   
                     (CC) KBCS CDAC MUMBAI
cont..
 Try it yourself 
   Processing RSS feeds using universal feed parser (
   http://feedparser.org/)


 Opening text from local machine:
    >>> f = open('document.txt')
    >>> raw = f.read()
 (make sure you are in current directory)
 then tokenize and bring to Text and play with it
                    (CC) KBCS CDAC MUMBAI
cont..
 Capture user input:
   >>> s = raw_input("Enter some text: ")
   >>> print "You typed", len(nltk.word_tokenize(s)), 
   "words."
 Writing O/p to files
   >>> output_file = open('output.txt', 'w')
   >>> words = set(nltk.corpus.genesis.words('english­
   kjv.txt'))
   >>> for word in sorted(words):     
   output_file.write(word + "n")
                      (CC) KBCS CDAC MUMBAI
cont..
 Stemming : NLTK has inbuilt stemmers 
   >>>text =['apples','called','indians','applying']
   >>> porter = nltk.PorterStemmer()
   >>> lancaster = nltk.LancasterStemmer()
   >>>[porter.stem(p) for p in text]
   >>>[lancaster.stem(p) for p in text]
 Lemmatization: WordNet lemmetizer, removes 
 affixes only if the new word exist in dictionary
   >>> wnl = nltk.WordNetLemmatizer()
   >>> [wnl.lemmatize(t) for t in text]
                   (CC) KBCS CDAC MUMBAI
cont..

 Using a POS tagger:
   >>> text = nltk.word_tokenize("And now for 
   something completely different")


 Query the POS tag documentation as:
   >>>nltk.help.upenn_tagset('NNP')
 Parsing with NLTK
             an example  
                     (CC) KBCS CDAC MUMBAI
cont..
 Classifying with NLTK:
 Problem: Gender identification from name
 Feature last letter 
             >>> def gender_features(word):
                        return {'last_letter': word[­1]}
 Get data
   >>> from nltk.corpus import names
   >>> import random



                       (CC) KBCS CDAC MUMBAI
cont..
    >>> names = ([(name, 'male') for name in 
    names.words('male.txt')] +
              [(name, 'female') for name in 
    names.words('female.txt')])
    >>> random.shuffle(names)
 Get Feature set
     >>> featuresets = [(gender_features(n), g) for (n,g) in 
    names]
    >>> train_set, test_set = featuresets[500:], featuresets[:500]
 Train
    >>> classifier = nltk.NaiveBayesClassifier.train(train_set)
 Test                    (CC) KBCS CDAC MUMBAI

    >>> classifier.classify(gender_features('Neo'))
Conclusion

 NLTK is rich resource for starting with Natural 
 Language Processing and it's being open 
 source and based on python adds feathers to 
 it's cap!




                  (CC) KBCS CDAC MUMBAI
References



➢   http://www.nltk.org/
➢   http://www.python.org/
➢   http://wikipedia.org/ 
     (all accessed latest on 19th March 2010)
➢   Natural Language Processing with Python, Steven Bird, Ewan Klein, and 
    Edward Loper, O'REILLY


                              (CC) KBCS CDAC MUMBAI
                 Thank you 



          #More on NLP @ http://matra2.blogspot.com/
           
           #Slides were  prepared for 'supporting' the talk and are distributed as it is under CC 3.0
                               (CC) KBCS CDAC MUMBAI

More Related Content

What's hot

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 

What's hot (20)

NLP
NLPNLP
NLP
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Text mining
Text miningText mining
Text mining
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Word embedding
Word embedding Word embedding
Word embedding
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 

Viewers also liked

Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 
Information retrieval using yelp dataset
Information retrieval using yelp datasetInformation retrieval using yelp dataset
Information retrieval using yelp dataset
Maiyaporn Phanich
 
TheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyTheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkey
Morgan Elizabeth Romano
 
Semantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event NotificationSemantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event Notification
okazaki117
 
semantic social network analysis
semantic social network analysissemantic social network analysis
semantic social network analysis
guillaume ereteo
 

Viewers also liked (20)

NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Information retrieval using yelp dataset
Information retrieval using yelp datasetInformation retrieval using yelp dataset
Information retrieval using yelp dataset
 
Nd4 j slides.pptx
Nd4 j slides.pptxNd4 j slides.pptx
Nd4 j slides.pptx
 
TheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyTheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkey
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTK
 
Semantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event NotificationSemantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event Notification
 
semantic social network analysis
semantic social network analysissemantic social network analysis
semantic social network analysis
 
IEEE 2016-2017 SOFTWARE TITLE
IEEE 2016-2017  SOFTWARE TITLE IEEE 2016-2017  SOFTWARE TITLE
IEEE 2016-2017 SOFTWARE TITLE
 
Nltk
NltkNltk
Nltk
 
Ijcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-finalIjcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-final
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
Data mining on yelp dataset
Data mining on yelp datasetData mining on yelp dataset
Data mining on yelp dataset
 
Aisb cyberbullying
Aisb cyberbullyingAisb cyberbullying
Aisb cyberbullying
 
Python an-intro - odp
Python an-intro - odpPython an-intro - odp
Python an-intro - odp
 

Similar to Natural Language Toolkit (NLTK), Basics

Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
Rebecca Bilbro
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)
Vitaly Baum
 
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in EclipseSpoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Devnology
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
Baidu, Inc.
 

Similar to Natural Language Toolkit (NLTK), Basics (20)

NLTK introduction
NLTK introductionNLTK introduction
NLTK introduction
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Modern C++
Modern C++Modern C++
Modern C++
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Compiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code GenerationCompiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code Generation
 
Parm
ParmParm
Parm
 
Parm
ParmParm
Parm
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)
 
Return of c++
Return of c++Return of c++
Return of c++
 
NativeBoost
NativeBoostNativeBoost
NativeBoost
 
Static PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: MettleStatic PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: Mettle
 
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in EclipseSpoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
 
Introduction to-linux
Introduction to-linuxIntroduction to-linux
Introduction to-linux
 
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun..."ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
 
The Ring programming language version 1.5.4 book - Part 180 of 185
The Ring programming language version 1.5.4 book - Part 180 of 185The Ring programming language version 1.5.4 book - Part 180 of 185
The Ring programming language version 1.5.4 book - Part 180 of 185
 
The Ring programming language version 1.5.2 book - Part 176 of 181
The Ring programming language version 1.5.2 book - Part 176 of 181The Ring programming language version 1.5.2 book - Part 176 of 181
The Ring programming language version 1.5.2 book - Part 176 of 181
 
cs262_intro_slides.pptx
cs262_intro_slides.pptxcs262_intro_slides.pptx
cs262_intro_slides.pptx
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStack
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 

More from Prakash Pimpale

More from Prakash Pimpale (6)

Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Patent Search
Patent SearchPatent Search
Patent Search
 
Technology Entrepreneurship Venture Lab 2012 - Trunk monkey
Technology Entrepreneurship Venture Lab 2012 - Trunk monkeyTechnology Entrepreneurship Venture Lab 2012 - Trunk monkey
Technology Entrepreneurship Venture Lab 2012 - Trunk monkey
 
Entrepreneurship, Entrepreneurs and Startups
Entrepreneurship, Entrepreneurs and Startups Entrepreneurship, Entrepreneurs and Startups
Entrepreneurship, Entrepreneurs and Startups
 
Collaboration tools in education
Collaboration tools in educationCollaboration tools in education
Collaboration tools in education
 
Genetic Algorithms Made Easy
Genetic Algorithms Made EasyGenetic Algorithms Made Easy
Genetic Algorithms Made Easy
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Natural Language Toolkit (NLTK), Basics

  • 1. Natural Language Toolkit [NLTK]                                           Prakash B Pimpale                                                 pbpimpale@gmail.com @ FOSS(From the Open Source Shelf)  An open source softwares seminar series (CC) KBCS CDAC MUMBAI
  • 2. Let's start ➢ What is it about? ➢ NLTK – installing and using it for some 'basic NLP'  and Text Processing tasks ➢ Some NLP and Some python  ➢ How will we proceed ? ➢ NLTK introduction ➢ NLP introduction ➢ NLTK installation ➢ Playing with text using NLTK  ➢ Conclusion (CC) KBCS CDAC MUMBAI
  • 3. NLTK ➢ Open source python modules and linguistic  data for Natural Langauge Processing  application  ➢ Dveloped by group of people, project leaders:  Steven Bird, Edward Loper, Ewan Klein  ➢ Python: simple, free and open source, portable,  object oriented, interpreted  (CC) KBCS CDAC MUMBAI
  • 4. NLP ➢ Natural Language Processing: field of computer science  and linguistics woks out interactions between computers  and human (natural) languages. ➢ some brand ambassadors: machine translation,  automatic summarization, information extraction,  transliteration, question answering, opinion mining   ➢ Basic tasks: stemming, POS tagging, chunking, parsing,  etc. ➢ Involves simple frequency count to undertanding and  generating complex text  (CC) KBCS CDAC MUMBAI
  • 5. Terms/Steps in NLP ➢ Tokenization: getting words and punctuations  out from text ➢ Stemming: getting the (inflectional) root of word;  plays, playing, played : play   (CC) KBCS CDAC MUMBAI
  • 7. cont.. ➢ Chunking: groups the similar POS tags (liberal  def) (CC) KBCS CDAC MUMBAI
  • 9. Applications ➢ Machine Translation MaTra, Google translate ➢ Transliteration  Xlit, Google Indic ➢ Text classification Spam filters many more.......! (CC) KBCS CDAC MUMBAI
  • 10. Challenges ➢ Word sense disambiguation  Bank(financial/river)  Plant(industrial/natural) ➢ Name entity recognition CDAC is in Mumbai ➢ Clause Boundary detection Ram who was the king of Ayodhya killed  Ravana  and many more.....! (CC) KBCS CDAC MUMBAI
  • 11. Installing NLTK Windows ➢ Install Python: Version 2.4.*, 2.5.*, or 2.6.* and  not 3.0 available http://www.nltk.org/download  [great instructions!] ➢ Install PyYAML: ➢ Install Numpy ➢ Install Matplotlib ➢ Install NLTK (CC) KBCS CDAC MUMBAI
  • 12. cont.. Linux/Unix ➢ Most likely python will be there (if not­>  package manager) ➢ Get python­nympy and python­scipy packages  (scintific library for multidimentional array and  linear algebra), use package manager to install  or  'sudo apt­get install python­numpy python­ scipy' (CC) KBCS CDAC MUMBAI
  • 13. cont.. NLTK Source Installation ➢ Download NLTK source  ( http://nltk.googlecode.com/files/nltk­2.0b8.zip) ➢ Unzip it ➢ Go to the new unziped folder ➢ Just do it!  'sudo python setup.py install' Done :­) ! (CC) KBCS CDAC MUMBAI
  • 14. Installing Data ➢ We need data to experiment with It's too simple (same for all platform)... $python            #(IDLE) enter to python prompt  >>> import nltk >>> nltk.download() select any one option that you like/need (CC) KBCS CDAC MUMBAI
  • 15. cont.. (CC) KBCS CDAC MUMBAI
  • 16. cont.. Test installation ( set path if needed: as  export NLTK_DATA = '/home/....../nltk_data')    >>> from nltk.corpus import brown    >>> brown.words()[0:20]     ['The', 'Fulton', 'County', 'Grand', 'Jury']                     All set! (CC) KBCS CDAC MUMBAI
  • 17. Playing with text Feeling python  $python >>>print 'Hello world' >>>2+3+4      Okay ! (CC) KBCS CDAC MUMBAI
  • 18. cont.. Data from book: >>>from nltk.book import * see what is it?  >>>text1 Search the word: >>> text1.concordance("monstrous") Find similar words  >>> text1.similar("monstrous") (CC) KBCS CDAC MUMBAI
  • 19. cont.. Find common context >>>text2.common_contexts(["monstrous", "very"]) Find position of the word in entire text >>>text4.dispersion_plot(["citizens", "democracy",  "freedom", "duties", "America"]) Counting vocab  >>>len(text4)  #gives tokens(words and  punctuations) (CC) KBCS CDAC MUMBAI
  • 20. cont.. Vocabulary(unique words) used by author >>>len(set(text1)) See sorted vocabulary(upper case first!) >>>sorted(set(text1)) Measuring richness of text(average repetition) >>>len(text1)/len(set(text1)) (before division import for floating point division from __future__ import division) (CC) KBCS CDAC MUMBAI
  • 21. cont.. Counting occurances  >>>text5.count("lol") % of the text taken >>> 100 * text5.count('a') / len(text5) let's write function for it in python  >>>def  prcntg_oftext(word,text):              return100*.count(wrd)/len(txt)              (CC) KBCS CDAC MUMBAI
  • 22. cont.. Writing our own text >>>textown =['a','text','is','a','list','of','words'] adding texts(lists, can contain anything) >>>sent = ['pqr'] >>>sent2 = ['xyz'] >>>sent+sent2 append to text >>>sent.append('abc') (CC) KBCS CDAC MUMBAI
  • 23. cont.. Indexing text >>> text4[173] or  >>>text4.index('awaken') slicing text1[3:4], text[:5], text[10:] try it! (notice ending  index one less than specified) (CC) KBCS CDAC MUMBAI
  • 24. cont.. Strings in python >>>name = 'astring' >>>name[0] what's more >>>name*2 >>>'SPACE '.join(['let us','join']) (CC) KBCS CDAC MUMBAI
  • 25. cont.. Statistics to text Frequency Distribution >>> fdist1 = FreqDist(text1) >>> fdist1 >>> vocabulary1 = fdist1.keys() >>>vocabulary1[:25] let's plot it... >>>fdist1.plot(50, cumulative=True) (CC) KBCS CDAC MUMBAI
  • 26. cont.. *what the text1 is about? *are all the words meaningful? Let's find long words i.e. Words from text1 with length more than  10 in python >>>longw=[word for word in text1 if len(word)>10] let's see it...  >>>sorted(longw) (CC) KBCS CDAC MUMBAI
  • 27. cont.. *most long words have small freq and are not  major topic of text many times *words with long length and certain freq may  help to identify text topic >>>sorted([w for w in set(text5) if len(w) > 7 and  fdist5[w] > 7]) (CC) KBCS CDAC MUMBAI
  • 28. cont.. Bigrams and collocations: *Bigrams >>>sent=['near','river', 'bank'] >>>bigrams(sent) basic to many NLP apps; transliteration,  translation *collocations: frequent bigrams of rare words >>>text4.collocations() basic to some WSD approaches (CC) KBCS CDAC MUMBAI
  • 29. cont.. FreDist functions: (CC) KBCS CDAC MUMBAI
  • 31. cont Bigger probs: WSD, Anaphora resolution, Machine translation try and laugh: >>>babelize_shell() >>>german >>>run  or try it here:  http://translationparty.com/#6917644 (CC) KBCS CDAC MUMBAI
  • 32. cont.. Playing with text corpora Available corporas: Gutenberg, Web and chat  text, Brown, Reuters, inaugural address, etc. >>> from nltk.corpus import inaugural >>> inaugural.fileids() >>>fileid[:4] for fileid in  inaugural.fileids()] (CC) KBCS CDAC MUMBAI
  • 33. cont.. >>> cfd = nltk.ConditionalFreqDist(             (target, file[:4])             for fileid in inaugural.fileids()             for w in inaugural.words(fileid)             for target in ['america', 'citizen']             if w.lower().startswith(target)) >>> cfd.plot() Conditional Freq distribution of words 'america' and  'citizen' with different years (CC) KBCS CDAC MUMBAI
  • 34. cont.. Generating text (applying bigrams) >>> def generate_model(cfdist, word, num=15):     for i in range(num):         print word,         word = cfdist[word].max() ­­­ >>>text = nltk.corpus.genesis.words('english­kjv.txt') >>>bigrams = nltk.bigrams(text) >>>cfd = nltk.ConditionalFreqDist(bigrams)  >>print cfd['living']  >>generate_model(cfd, 'living') (CC) KBCS CDAC MUMBAI
  • 35. cont.. WordList corpora >>>def unusual_words(text):      text_vocab = set(w.lower() for w in text if   w.isalpha())     english_vocab = set(w.lower() for w in  nltk.corpus.words.words())     unusual = text_vocab.difference(english_vocab)     return sorted(unusual) ­­ >>>unusual_words(nltk.corpus.gutenberg.words('aust en­sense.txt')) Are the unusual or misspelled words. (CC) KBCS CDAC MUMBAI
  • 36. cont.. Stop words: >>> from nltk.corpus import stopwords >>> stopwords.words('english') ­­ Percentage of content >>> def content_fraction(text):        stopwords = nltk.corpus.stopwords.words('english')        content = [w for w in text if w.lower() not in stopwords]        return len(content) / len(text) >>> content_fraction(nltk.corpus.reuters.words()) is the useful fraction of useful words!  (CC) KBCS CDAC MUMBAI
  • 37. cont.. WordNet: structured, semanticaly orinted  dictionary >>> from nltk.corpus import wordnet as wn get synset/collection of synonyms >>> wn.synsets('motorcar') now get synonyms(lemmas) >>> wn.synset('car.n.01').lemma_names (CC) KBCS CDAC MUMBAI
  • 38. cont.. What does car.n.01 actually mean? >>> wn.synset('car.n.01').definition An example: >>> wn.synset('car.n.01').examples Also >>>wn.lemma('supply.n.02.supply').antonyms() Useful data in many applications like WSD,  understanding text, question answering, etc.  (CC) KBCS CDAC MUMBAI
  • 39. cont.. Text from web: >>> from urllib import urlopen >>> url =  "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) >>> len(raw) is in characters...! (CC) KBCS CDAC MUMBAI
  • 40. cont.. Tokenization: >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) >>> len(tokens) >>> tokens[:15] convert to Text >>> text = nltk.Text(tokens) >>> type(text) Now you can do all above things with this! (CC) KBCS CDAC MUMBAI
  • 41. cont.. What about HTML? >>> url =  "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(url).read() >>> html[:60] It's HTML; Clean the tags... >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens[:60] >>> text = nltk.Text(tokens)    (CC) KBCS CDAC MUMBAI
  • 42. cont.. Try it yourself  Processing RSS feeds using universal feed parser ( http://feedparser.org/) Opening text from local machine:  >>> f = open('document.txt')  >>> raw = f.read() (make sure you are in current directory) then tokenize and bring to Text and play with it (CC) KBCS CDAC MUMBAI
  • 43. cont.. Capture user input: >>> s = raw_input("Enter some text: ") >>> print "You typed", len(nltk.word_tokenize(s)),  "words." Writing O/p to files >>> output_file = open('output.txt', 'w') >>> words = set(nltk.corpus.genesis.words('english­ kjv.txt')) >>> for word in sorted(words):      output_file.write(word + "n") (CC) KBCS CDAC MUMBAI
  • 44. cont.. Stemming : NLTK has inbuilt stemmers  >>>text =['apples','called','indians','applying'] >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>>[porter.stem(p) for p in text] >>>[lancaster.stem(p) for p in text] Lemmatization: WordNet lemmetizer, removes  affixes only if the new word exist in dictionary >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(t) for t in text] (CC) KBCS CDAC MUMBAI
  • 45. cont.. Using a POS tagger: >>> text = nltk.word_tokenize("And now for  something completely different") Query the POS tag documentation as: >>>nltk.help.upenn_tagset('NNP') Parsing with NLTK           an example   (CC) KBCS CDAC MUMBAI
  • 46. cont.. Classifying with NLTK: Problem: Gender identification from name Feature last letter  >>> def gender_features(word):            return {'last_letter': word[­1]} Get data >>> from nltk.corpus import names >>> import random (CC) KBCS CDAC MUMBAI
  • 47. cont.. >>> names = ([(name, 'male') for name in  names.words('male.txt')] +           [(name, 'female') for name in  names.words('female.txt')]) >>> random.shuffle(names) Get Feature set  >>> featuresets = [(gender_features(n), g) for (n,g) in  names] >>> train_set, test_set = featuresets[500:], featuresets[:500] Train >>> classifier = nltk.NaiveBayesClassifier.train(train_set) Test (CC) KBCS CDAC MUMBAI >>> classifier.classify(gender_features('Neo'))
  • 48. Conclusion NLTK is rich resource for starting with Natural  Language Processing and it's being open  source and based on python adds feathers to  it's cap! (CC) KBCS CDAC MUMBAI
  • 49. References ➢ http://www.nltk.org/ ➢ http://www.python.org/ ➢ http://wikipedia.org/   (all accessed latest on 19th March 2010) ➢ Natural Language Processing with Python, Steven Bird, Ewan Klein, and  Edward Loper, O'REILLY (CC) KBCS CDAC MUMBAI
  • 50.                  Thank you      #More on NLP @ http://matra2.blogspot.com/                #Slides were  prepared for 'supporting' the talk and are distributed as it is under CC 3.0 (CC) KBCS CDAC MUMBAI