Introduction into Natural Language Processing:
- Fiction vs Reality
- Complexities of NLP
- NLP with Python: NLTK, Gensim, TextBlob
(stopwords removal, part of speech tagging, tfidf, text categorization, sentiment analysis
- What's next
2. Who am I?
Alyona
Medelyan
aka @zelandiya
▪ In Natural Language Processing since 2000
▪ PhD in NLP & Machine Learning from Waikato
▪ Author of the open source keyword extraction algorithm Maui
▪ Author of the most-cited 2009 journal survey “Mining Meaning with Wikipedia”
▪ Past: Chief Research Officer at Pingar
▪ Now: Founder of Entopix, NLP consultancy & software development
3. Agenda
State of NLP
Recap on fiction vs reality: Are we there yet?
NLP Complexities
Why is understanding language so complex?
NLP using Python
NLTK, Gensim, TextBlob & Co
Building NLP applications
A little bit of data science
Other NLP areas
And what’s coming next
9. Two girls use Google Translate to call a real Indian restaurant and order in Hindi…
How did it go? www.youtube.com/watch?v=wxDRburxwz8
10. The LCARS (or simply library computer) … used sophisticated
artificial intelligence routines to understand and execute vocal natural
language commands (From Memory Alpha Wiki)
16. Word segmentation complexities
▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。
▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。
▪ The first hot dogs were sold by Charles Feltman on Coney Island in
1870.
▪ The first hot dogs were sold by Charles Feltman on Coney Island in
1870.
19. text text text
text text text
text text text
text text text
text text text
text text text
sentiment
keywords
tags
genre
categories
taxonomy terms
entities
names
patterns
biochemical
… entities text text text
text text text
text text text
text text text
text text text
text text text
What can we do with text?
21. How to get to the core words?
Remove Stopwords with NLTK
even the acting in transcendence is solid , with the dreamy
depp turning in a typically strong performance
i think that transcendence has a pretty solid acting, with the
dreamy depp turning in a strong performance as he usually does
>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> words = ['the', 'acting', 'in', 'transcendence', 'is',
'solid', 'with', 'the', 'dreamy', 'depp']
>>> print [word for word in words if word not in stop]
['acting', 'transcendence', 'solid’, 'dreamy', 'depp']
22. Getting closer to the meaning:
Part of Speech tagging with NLTK
Flying planes can be dangerous
✓
>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous"))
[('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'),
('be', 'VB'), ('dangerous', 'JJ')]
23. Keyword scoring:
TFxIDF
Relative frequency
of a term t in a
document d
The inverse
proportion of
documents d in
collection D
mentioning term t
24. TFxIDF with Gensim
from nltk.corpus import movie_reviews
from gensim import corpora, models
texts = []
for fileid in movie_reviews.fileids():
words = texts.append(movie_reviews.words(fileid))
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
25. TFxIDF with Gensim (Results)
for word in ['film', 'movie', 'comedy',
'violence', 'jolie']:
my_id = dictionary.token2id.get(word)
print word, 't', tfidf.idfs[my_id]
film 0.190174003903
movie 0.364013496254
comedy 1.98564470702
violence 3.2108967825
jolie 6.96578428466
26. Where does this text belong?
Text Categorization with NLTK
Entertainment
TVNZ: “Obama and
Hangover star
trade insults in interview”
Politics
>>> train_set = [(document_features(d), c) for (d,c) in categorized_documents]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> doc_features = document_features(new_document)
>>> category = classifier.classify(features)
27. Sentiment analysis with TextBlob
>>> from textblob import TextBlob
>>> blob = TextBlob("I love this library")
>>> blob.sentiment
Sentiment(polarity=0.5, subjectivity=0.6)
for review in transcendence:
blob = TextBlob(open(review).read())
print review, blob.sentiment.polarity
../data/transcendence_1star.txt 0.0170799124247
../data/transcendence_5star.txt 0.0874591503268
../data/transcendence_8star.txt 0.256845238095
../data/transcendence_10star.txt 0.304310344828
29. Keywords extracton in 3h:
Understanding a movie review
…four of the biggest directors in hollywood : quentin
tarantino , robert rodriguez , … were all directing one big film
with a big and popular cast ...the second room ( jennifer
beals ) was better , but lacking in plot ... the bumbling and
mumbling bellboy … ruins every joke in the film …
bellboy
jennifer beals
four rooms
beals
rooms
tarantino
madonna
antonio banderas
valeria golino
github.com/zelandiya/KiwiPyCon-NLP-tutorial
30. Keyword extraction on 2000 movie reviews:
What makes a successful movie?
Negative Positive
van damme
zeta – jones
smith
batman
de palma
eddie murphy
killer
tommy lee jones
wild west
mars
murphy
ship
space
brothers
de bont
...
star wars
disney
war
de niro
jackie
alien
jackie chan
private ryan
truman show
ben stiller
cameron
science fiction
cameron diaz
fiction
jack
...
31. How NLP can help a beer drinker?
Sweaty Horse Blanket: Processing the Natural Language of Beer
by Ben Fields
vimeo.com/96809735
34. Filling the gaps in machine understanding
… Jack Ruby, who killed J.F.Kennedy's assassin Lee Harvey Oswald. …
/m/0d3k14
/m/044sb
/m/0d3k14
Freebase
36. Conclusions:
Understanding human language with Python
NLTK
nltk.org
Are we there yet?
radimrehurek.com/gensim
scikit-learn.org/stable deeplearning.net/software/theano
textblob.readthedocs.org
@zelandiya #nlproc
Notes de l'éditeur
Let’s start with fiction. Here we have Knight Rider’s car, which not only communicates with David Hasselhof’s character, but is also great at dry humour.
I don’t know about humour, but when I was at Google I/O this year, I enjoyed the demo of the Android Auto, a system that will soon be integreated into most cars. You can ask for things like opening times, recommendations and of course directions in real voice.
Back to fiction again: Those who red Hitchiker’s guide to the galaxy will still remember Babel Fish, a creature that you insert into your ear for instant translation.
The reality’s answer to that is probably WordLense, an app that offline translates short bits of text while keeping the font rendering. We’ve showed it off at a party in Germany last month, and one person said “it’s magic!”
Those who’ve seen Star Treck will remember the library computer everybody talks to.
Well, Google actually is pretty much already there. Let’s try it out.
And finally, computers and love. In this recent movie, this guy forms a relationship with an operating system. She is the perfect girlfriend, and she has Scarlet Johansson’s voice.
A guy named Joshua tried to do this with Siri. Unfortunately, she told him that her “Licensing agreement does not cover marriage”.
I would like to finish this talk with a funny story from another event I got a chance to attend several years ago.
It was the unconference Foocamp in the US and I was brave enough to decide to hold a session.
I called it “NLP: Are we there yet?” And what I meant by that was “Are NLP algorithms good enough to be used in commercial-grade application”
I’m waiting in the room and the first person who comes in is a guy who I remembered introduced himself as the security guide at the Burning Man, a festival in Nevada. I asked him if he is after the NLP which stands for “Natural Language Processing”, …
When he left, I was terrified. What happens now?
The next person who walked in asked “Is this Natural Language Processing?” I said, yes, it is, phew! The room filled up quickly and I was star struck when Andrew Ng, the Director of the Stanford AI Lab walked in.
The discussion was lively and we concluded that NLP is there! Particularly the data-driven algorithms that use machine-learning, such as “deep leaning”. I talked today about NLTK, Gensim and TextBlob, but I encourage you to also look into Scikit Learn and Theano, libraries that do exactly that.