Natural Language Processing and Python

Natural Language Processing
+ Python
by Ann C. Tan-Pohlmann

February 22, 2014

Outline
• NLP Basics
• NLTK
– Text Processing

• Gensim (really, really short )
– Text Classification

2

Natural Language Processing
• computer science, artificial intelligence, and
linguistics
• human–computer interaction
• natural language understanding
• natural language generation
- Wikipedia

3

Star Trek's Universal Translator

http://www.youtube.com/watch?v=EaeSKU
V2zp0

NLP Basics
• Morphology
– study of word formation
– how word forms vary in a sentence

• Syntax
– branch of grammar
– how words are arranged in a sentence to show
connections of meaning

• Semantics
– study of meaning of words, phrases and sentences
6

NLTK: Getting Started
• Natural Language Took Kit
– for symbolic and statistical NLP
– teaching tool, study tool and as a platform for prototyping

• Python 2.7 is a prerequisite
>>> import nltk
>>> nltk.download()

7

Some NLTK methods
•
•
•
•
•

Frequency Distribution

text.similar(str)
concordance(str)
len(text)
len(set(text))
lexical_diversity

•
•
•
•
•

– len(text)/
len(set(text))

fd = FreqDist(text)
fd.inc(str)
fd[str]
fd.N()
fd.max()

• text.collocations()
- sequence of words that occur
together often

MORPHOLOGY > Syntax > Semantics

8

Frequency Distribution
•
•
•
•
•

fd = FreqDist(text)
fd.inc(str) – increment count
fd[str] – returns the number of occurrence for sample str
fd.N() – total number of samples
fd.max() – sample with the greatest count

9

Corpus
• large collection of raw or categorized text on
one or more domain
• Examples: Gutenberg, Brown, Reuters, Web &
Chat Txt
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', '
humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> adventure_text = brown.words(categories='adventure')

10

Corpora in Other Languages
>>> from nltk.corpus import udhr
>>> languages = nltk.corpus.udhr.fileids()
>>> languages.index('Filipino_Tagalog-Latin1')
>>> tagalog = nltk.corpus.udhr.raw('Filipino_Tagalog-Latin1')
>>> tagalog_words = nltk.corpus.udhr.words('Filipino_Tagalog-Latin1')
>>> tagalog_tokens = nltk.word_tokenize(tagalog)
>>> tagalog_text = nltk.Text(tagalog_tokens)
>>> fd = FreqDist(tagalog_text)
>>> for sample in fd:
... print sample

11

Using Corpus from Palito
Corpus
– large collection of raw or categorized text
>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_dir = '/Users/ann/Downloads'
>>> tagalog = PlaintextCorpusReader(corpus_dir,
'Tagalog_Literary_Text.txt')
>>> raw = tagalog.raw()
>>> sentences = tagalog.sents()
>>> words = tagalog.words()
>>> tokens = nltk.word_tokenize(raw)
>>> tagalog_text = nltk.Text(tokens)
12

Spoken Dialog Systems


13

Tokenization
Tokenization
– breaking up of string into words and punctuations

>>> tokens = nltk.word_tokenize(raw)
>>> tagalog_tokens = nltk.Text(tokens)
>>> tagalog_tokens = set(sample.lower() for sample in tagalog_tokens)


14

Stemming
Stemming
– normalize words into its base form, result may not be the 'root' word
>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...
if word.endswith(suffix):
...
return word[:-len(suffix)]
... return word
...
>>> stem('reading')
'read'
>>> stem('moment')
'mo'


15

Lemmatization
Lemmatization
– uses vocabulary list and morphological analysis (uses POS of a word)
>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...
if word.endswith(suffix) and word[:-len(suffix)] in brown.words():
...
return word[:-len(suffix)]
... return word
...
>>> stem('reading')
'read'
>>> stem('moment')
'moment'


16

NLTK Stemmers & Lemmatizer
• Porter Stemmer and Lancaster Stemmer
>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()
>>> [porter.stem(w) for w in brown.words()[:100]]

• Word Net Lemmatizer
>>> wnl = nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(w) for w in brown.words()[:100]]

• Comparison
>>> [wnl.lemmatize(w) for w in ['investigation', 'women']]
>>> [porter.stem(w) for w in ['investigation', 'women']]
>>> [lancaster.stem(w) for w in ['investigation', 'women']]


17

Using Regular Expression
Operator
.
^abc
abc$
[abc]
[A-Z0-9]
ed|ing|s
*
+
?
{n}
{n,}
{,n}
{m,n}
a(b|c)+

Behavior
Wildcard, matches any character
Matches some pattern abc at the start of a string
Matches some pattern abc at the end of a string
Matches one of a set of characters
Matches one of a range of characters
Matches one of the specified strings (disjunction)
Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
One or more of previous item, e.g. a+, [a-z]+
Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
Exactly n repeats where n is a non-negative integer
At least n repeats
No more than n repeats
At least m and no more than n repeats
Parentheses that indicate the scope of the operators


18

Using Regular Expression
>>> import re
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'reading')
[('read', 'ing')]
>>> def stem(word):
... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
... stem, suffix = re.findall(regexp, word)[0]
... return stem
...
>>> stem('reading')
'read'
>>> stem('moment')
'moment'


19

Spoken Dialog Systems

Morphology > SYNTAX > Semantics

20

Lexical Resources
• collection of words with association information (annotation)
• Ex: stopwords – high-frequency words with little lexical
content
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
>>> stopwords.words('german')


21

Part-of-Speech (POS) Tagging
• the process of labeling and classifying words
to a particular part of speech based on its
definition and context


22

NLTKs POS Tag Sets* – 1/2
Tag
ADJ
ADV
CNJ
DET
EX
FW
MOD
N
NP

Meaning
adjective
adverb
conjunction
determiner
existential
foreign word
modal verb
noun
proper noun

Examples
new, good, high, special, big, local
really, already, still, early, now
and, or, but, if, while, although
the, a, some, most, every, no
there, there's
dolce, ersatz, esprit, quo, maitre
will, can, would, may, must, should
year, home, costs, time, education
Alison, Africa, April, Washington

*simplified

23

NLTKs POS Tag Sets* – 2/2
Tag
NUM
PRO
P
TO
UH
V
VD
VG
VN
WH

Meaning
number
pronoun
preposition
the word to
interjection
verb
past tense
present participle
past participle
wh determiner

Examples
twenty-four, fourth, 1991, 14:24
he, their, her, its, my, I, us
on, of, at, with, by, into, under
to
ah, bang, ha, whee, hmpf, oops
is, has, get, do, make, see, run
said, took, told, made, asked
making, going, playing, working
given, taken, begun, sung
who, which, when, what, where, how

*simplified

24

NLTK POS Tagger (Brown)
>>> nltk.pos_tag(brown.words()[:30])
[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'),
('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'),
('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'JJ'), ('recent', 'JJ'),
('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBN'), ('``', '``'), ('no',
'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'WDT'), ('any', 'DT'),
('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The',
'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN')]
>>> brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]


25

NLTK POS Tagger (German)
>>> german = nltk.corpus.europarl_raw.german
>>> nltk.pos_tag(german.words()[:30])
[(u'Wiederaufnahme', 'NNP'), (u'der', 'NN'), (u'Sitzungsperiode', 'NNP'),
(u'Ich', 'NNP'), (u'erklxe4re', 'NNP'), (u'die', 'VB'), (u'am', 'NN'), (u'Freita
g', 'NNP'), (u',', ','), (u'dem', 'NN'), (u'17.', 'CD'), (u'Dezember', 'NNP'), (u'
unterbrochene', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'des', 'VBZ'), (u'Eur
opxe4ischen', 'JJ'), (u'Parlaments', 'NNS'), (u'fxfcr', 'JJ'), (u'wiederaufg
enommen', 'NNS'), (u',', ','), (u'wxfcnsche', 'NNP'), (u'Ihnen', 'NNP'), (u'
nochmals', 'NNS'), (u'alles', 'VBZ'), (u'Gute', 'NNP'), (u'zum', 'NN'), (u'Ja
hreswechsel', 'NNP'), (u'und', 'NN'), (u'hoffe', 'NN'), (u',', ',')]

xe4 = ä xfc = ü
!!! DOES NOT WORK FOR GERMAN


26

NLTK POS Dictionary
>>> pos = nltk.defaultdict(lambda:'N')
>>> pos['eat']
'N'
>>> pos.items()
[('eat', 'N')]
>>> for (word, tag) in brown.tagged_words(simplify_tags=True):
... if word in pos:
...
if isinstance(pos[word], str):
...
new_list = [pos[word]]
...
pos[word] = new_list
...
if tag not in pos[word]:
...
pos[word].append(tag)
... else:
...
pos[word] = [tag]
...
>>> pos['eat']
['N', 'V']

27

What else can you do with NLTK?
• Other Taggers
– Unigram Tagging
• nltk.UnigramTagger()
• train tagger using tagged sentence data

– N-gram Tagging

• Text classification using machine learning
techniques
– decision trees
– naïve Bayes classification (supervised)
– Markov Models
Morphology > SYNTAX > SEMANTICS

28

Gensim
• Tool that extracts semantic structure of
documents, by examining word statistical cooccurrence patterns within a corpus of
training documents.
• Algorithms:
1. Latent Semantic Analysis (LSA)
2. Latent Dirichlet Allocation (LDA) or Random
Projections
Morphology > Syntax > SEMANTICS

29

Gensim
• Features
– memory independent
– wrappers/converters for several data formats

• Vector
– representation of the document as an array of features or
question-answer pair
1.
2.
3.

(word occurrence, count)
(paragraph, count)
(font, count)

• Model
– transformation from one vector to another
– learned from a training corpus without supervision
Morphology > Syntax > SEMANTICS

30

Wiki document classification

http://radimrehurek.com/gensim/wiki.html

31

Other NLP tools for Python
• TextBlob
– part-of-speech tagging, noun phrase extraction,
sentiment analysis, classification, translation
– https://pypi.python.org/pypi/textblob

• Pattern
– part-of-speech taggers, n-gram search, sentiment
analysis, WordNet, machine learning
– http://www.clips.ua.ac.be/pattern
32

Star Trek technology that became a reality

http://www.youtube.com/watch?v=sRZxwR
IH9RI

Installation Guides
• NLTK
– http://www.nltk.org/install.html
– http://www.nltk.org/data.html

• Gensim
– http://radimrehurek.com/gensim/install.html

• Palito
– http://ccs.dlsu.edu.ph:8086/Palito/find_project.js
p
34

Using iPython
• http://ipython.org/install.html
>>> documents = ["Human machine interface for lab abc computer applications",
>>>
"A survey of user opinion of computer system response time",
>>>
"The EPS user interface management system",
>>>
"System and human system engineering testing of EPS",
>>>
"Relation of user perceived response time to error measurement",
>>>
"The generation of random binary unordered trees",
>>>
"The intersection graph of paths in trees",
>>>
"Graph minors IV Widths of trees and well quasi ordering",
>>>
"Graph minors A survey"]

35

References
• Natural Language Processing with Python By
Steven Bird, Ewan Klein, Edward Loper
• http://www.nltk.org/book/
• http://radimrehurek.com/gensim/tutorial.htm
l

36

Thank You!
• For questions and comments:
- ann at auberonsolutions dot com

37

Natural Language Processing and Python

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Natural Language Processing and Python

Similaire à Natural Language Processing and Python (20)

Dernier

Dernier (20)

Natural Language Processing and Python