2. NLP is a tool to devise ways to interpret human understandable
languages and extract meaning out of it, which can be used
by other humans (translators), machines or software systems
for achieving greater objectives.
Why do we need this?
Not everyone speaks the same language (MT)
Not all the languages are written the same way (MT)
We would like to automate tasks which are tedious (IE)
We desire more accuracy and less human errors!(Automation)
3. 1. Spam Detection
2. Spelling correction
3. Parts of Speech Tagging
4. Named Entity Recognition
1. Co reference Resolution
2. Information Extraction
3. Sentiment Analysis
4. Machine Translation
1. Answering Questions
2. Summarization
3. Paraphrasing
4. Dialog
4. Ambiguity is pervasive
Headline: Republicans Grill IRS Chief Over Lost Emails
Meanings:
Republicans harshly question the chief about the emails
Republicans cook the chief using email as the fuel
New ways of writing
Twitter hashtags
All Capitals
Abused notations (U for You, FB for facebook, @ for At)
New words (Retweet, Unfriend etc)
Emoticons ( , and many others)
5. We learn, we remember and we conquer
What tools we use?
We require the knowledge about the language
We require the knowledge about the world
We need a way to combine knowledge sources
How we do this?
Probabilistic models built upon language data for inferring language
properties
P(“fragrant” -> “rose”) is high
P(“awful” -> “love”) is low
6. What we mostly do in NLP is “Text Processing”. Normalizing the text in one
way or the other.
Word Tokenization, Text Search, Sentence Segmentation, Pattern
Recognition, Disambiguating Words etc…
Text Processing is Important, one important tool is
Regular Expressions (http://regexpal.com)
Word Tokenization (http://sentiment.christopherpotts.net/tokenizing.html)
how many words are there in the text
what is the size of the vocabulary and so on..
8. Sentence completion
How are _____?
The weather is ______.
Phrase Rearrangement
little had Mary a lamb
happy everyone during is holidays
How can we solve these problems?
9. Sentence completion – Probability(upcoming word)
Phrase rearrangement – Probability(occurrence of a sentence)
For both of these, we need a machinery, which is called “language model”
In simple words – a language model is a black box, which has prior knowledge
about the language(s) and for any question we consult the LM for an
appropriate answer
Formally – A language model is a model which computes either P(upcoming
word) or P(occurrence of a sentence)
10. This ideally depends on what kind of answer you want from the model.
An answer to the question:
P(upcoming word) - list of pairs(phrase – probability), a phrase with highest
probability (or any word based on the complexity of the algorithm)
P(phrase rearrangement) – list of pairs (sentence – probability), a sentence
with highest probability (or any sentence based on the complexity of the
algorithm)
How do we get these probabilities?
11. Goal: Calculating the Probability of a sequence of words
P("The tiger is a fierce animal") =
P(The) * P(tiger | The) * P(is | The tiger) * P(a | The tiger is) *
* P(fierce | The tiger is a) * P(animal | The tiger is a fierce)
This is the joint probability of the sequence of words by the
chain rule of probability.
P(animal | The tiger is a fierce) =
Count(The tiger is a fierce animal)
Count(The tiger is a fierce)
12. It would be mostly sufficient to assume that
P(animal| the tiger is a fierce) =
P(animal| fierce) or P(animal| a fierce)
Hence, P(the tiger is a fierce animal) = P(is| the tiger) * P(a|
tiger is) * P(fierce| is a) * P(animal| a fierce)
13. Vocabulary: Cold, You, They, How, Are, The, Weather, Hot, Is, Was
Strategy 1: Assign probabilities to all the words depending on the
number of times it occurred in the corpus
How are Cold?
How are You?
How are They?
How are How?
All these sentences are equally likely.
This model is called unigram model
14. Strategy 2: Assign probabilities based on some context. Look
for one previous word.
is cold , is hot, is gone
are gone, are you, are they
was cold, was hot
they are, the weather, you are
Complete the sentence : The weather is ________
Three possible options are
The weather is hot
The weather is cold
The weather is gone
This model is called bigram model
15. Strategy 3: Start looking at more than one previous words
weather is cold , tea is hot,
they are gone, how are you, where are they
water was cold, food was hot
Complete the sentence : The weather is ________
Possible options is : The weather is cold.
Complete the sentence : How are ________?
Possible options is : How are you?
This model is called trigram model
16. The computer I installed in the chemistry laboratory _______
Bigrams may result in :
The computer I installed in the chemistry laboratory equipment
Trigrams may result in :
The computer I installed in the chemistry laboratory apparatus
Long Distance Dependencies**
Huge corpus with common n grams.
“The computer I installed in the chemistry laboratory crashed”
17. A 2000 words vocabulary has 2000*2000 = 4,000,000 possible bigrams
A piece of document with 16000 tokens produces around 5000 unique bigrams
This means 5000/4000000 or 99.875% of the bigrams are never seen.
This will assign zero probability to all the unseen bigrams and hence, if we
calculate the probability of a sentence with a new bigram, our model will
return a value zero.
To avoid this we use a technique called smoothing. The simplest of all is the
Laplace’s Smoothing or the Add One Smoothing
18. Laplace’s Smoothing: It suggests that, consider that you saw everything one
more time. This solves the problems of unseen bigrams or ngrams.
Applying Laplace’s smoothing we will calculate probability as below:
Without Smoothing: P(wi| wi-1) = c(wi| wi-1)
c(wi-1)
Add One Smoothing: P(wi| wi-1) = c(wi| wi-1) + 1
c(wi-1) + |V|
Why adding a |V| in the denominator?
Seeing each word one more time means seeing all the unique words one more
time and the total number of unique words is |V| so the denominator
increases by |V|
19. Online Available Corpus
SRI Language Model ToolKit : http://www.speech.sri.com/projects/srilm/
Google Books Ngram Viewer: https://books.google.com/ngrams
Google Ngram Corpus: http://googleresearch.blogspot.in/2006/08/all-our-n-
gram-are-belong-to-you.html
20.
21. Typing is a manual process & not everyone types correctly. Spelling
mistakes occur frequently. There are two spelling tasks in such
scenarios
Error Detection
Error Correction
Auto Correct
hte -> the
Suggest one correction
theate-> theatre
Suggest list of correction
leding -> leading, lending
22. Non Word Errors
opportnity -> opportunity
graffe -> giraffe
Real Word Errors
Typographical Errors
In dog we trust -> god
Cognitive Errors (Homophones)
good buy -> bye
withdraw cache -> cash
23. Non Word Errors – very easy to correct
Detection
Have a word list from the dictionary of the language
Check the word in the dictionary, if the word is not present, the word is an
error.
Correction
Find candidate words which are similar to the error
Choose the best candidate based on any algorithm or model discussed next.
24. Real Word Errors – tough to fix
Detection
There is no error detection possible, because even the error is a real word
hence it is impossible to test the word against the given dictionary.
Correction
Find candidate words which are similar (pronunciation, spelling) to a word in
the sentence
Do this for all the words in the sentences
Choose the best candidate based on any algorithm or model discussed next.
25. N M E
N A M E
H T E
T H E
A C R E S S
A C R E S
INSERT
TRANSPOSE
DELETE
26. The noisy channel is a probabilistic model which represents the real
world conditions.
We run a lot of guesses through the channel and the one which matches
the most to the noisy word is our correct word.
27. What are we trying to find?
For an observation x(the noisy word), we are looking for a word w (from the
vocabulary) which maximizes the probability of the word using this noisy
channel.
ŵ = argmax P(w|x)
w ϵ V
= argmax P(x|w) * P(w) / P(x) -> Bayes Rule
w ϵ V
P(x) is constant for all the w in the vocabulary, because x is the observation.
P(x|w) is called the channel model and P(w) is called the language model.
28. x = acress
Possible candidates:
Error (x) Correction
(w)
Correct
Letter
Error Letter Error Type
acress actress t - Deletion
acress cress - a Insertion
acress caress ca ac Transposition
acress access c r Substitution
acress across o e Substitution
acress acres - s Insertion
acress acres - s Insertion
29. Words Taken from Corpus of Contemporary English (400,000,000 words)
Clearly the probability of across is the highest and our model will
suggest the correction ‘across’ in a unigram language model
Word Frequency P(word)
actress 9448 0.00002362
cress 220 0.00000055
caress 686 0.00000171
access 35310 0.00008827
across 105559 0.00026389
acres 12874 0.00003218
30. We just talked about the probability of the words (the language model),
which is one factor contributing to the correction.
The other factor which is the channel model, still needs to be consulted
for a better correction task.
The channel model consults a tool called Confusion Matrix for finding
out the likelihood of a type of error:
Types of Error: Insertion, Deletion, Transposition & Substitution
Each confusion matrix tells the possibility of a given type of error.
31.
32. Words Taken from Corpus of Contemporary English (400,000,000 words)
Clearly the probability of across is the highest and our model will
suggest the correction ‘across’ in a unigram language model
Word Correct
letter
Error
Letter
x|w P(x|word) P(word) Result
10e-9
actress t - c|ct 0.0001170 0.00002362 2.7
cress - a a|# 0.0000014 0.00000055 0.00078
caress ca ac ac|ca 0.0000016 0.00000171 0.0028
access c r r|c 0.0000002 0.00008827 0.019
across o e e|o 0.0000093 0.00026389 2.8
acres - s es|e 0.0000321 0.00003218 1.0
acres - s ss|s 0.0000342 0.00003218 1.0
33. Wikipedia - Common English Spelling mistakes :
https://en.wikipedia.org/wiki/Commonly_misspelled_English_words
Birkbeck Spelling Error Corpus : http://ota.ox.ac.uk/headers/0643.xml
Peter Norvig List of Error: http://norvig.com/ngrams/spell-errors.txt
34.
35.
36. Spam like attributes
Undisclosed Recipients
Prize!!
No name, lucky draw
Suspicious URL
37. It is a process of classifying given documents into various classes. For e.g.:
An e-mail is a SPAM or NOT?
A product review is POSITIVE or NEGATIVE?
And so on…
We need a classification model, which works on the given inputs and produces
the desired output.
Inputs:
a document d
a fixed set of classes C = {c1, c2, c3, …., cn}
Desired Output:
a predicted class c ϵ C such that the document belongs to that class c.
38. Hand Coded Rules
Based on Combination of words for e.g. a black listed sender, words like
Viagra, dollars, impress a girl.
Supervised Machine Learning
Naïve Bayes
Support Vector Machines
Logistic Regression
KNN (K Nearest Neighbor)
39. What is a Sentiment?
It is an attitude, affectively colored beliefs or dispositions towards objects and
persons liking, loving, hating, valuing, desiring
What is the task of Sentiment Analysis
Detecting attitude holder of the attitude, target of the attitude, type of the
attitude.
Types of Analysis
Simplest is to assign a binary value to a sentence or document
Slightly complex is to rate on a scale of 1-10
Toughest is to detect the target and source
40. 1. The fragrance is just awesome, I love it
2. Keeps you going all day long, this is the best perfume
3. Seriously do you call it a perfume? It’s awful.
4. Thanks for this wonderful fragrance in the classy bottle. Great!!
5. I feel like being cheated after buying this piece of crap
6. If you are reading this because it is your darling fragrance, please wear it
at home exclusively, and tape the windows shut.
Attitude – awesome, best, classy, awful, crap, cheated, thanks
Target – perfume, bottle, fragrance,
Holder – purchaser, user
We are basically extracting the opinion.
41. 1. We prefer to watch movies after reading the reviews
2. We prefer to buy products after reader the reviews
3. We prefer to invest in stocks after understanding the market sentiments
4. We want a profitable asset
5. We don’t want to get cheated with odd surprises
6. We believe that we can predict future (election result, market outcome)
Now all the above, more or less can be solved using the tool called ‘Sentiment
Analysis’.
42. Input : Test documents
Tokenize the test documents
Feature Extraction ( either all the tokens or a group of relevant tokens)
Classification using any of the classifier
Naïve Bayes Classifiers
Maximum Entropy Classifiers
Support Vector Machine Classifier
43. Our goal:
For a document d and set of classes C, we need to calculate probability of all
the classes given the document.
P(c|d) = P(d|c)*P(c)
P(d)
The class which maximizes P(c|d) is the class where the document belongs.
P(d|c) – conditional probability of the document given the class
P(c) – prior probability of the class
44. What is the practical meaning of prior P(c) and likelihood P(d|c) ?
P(c) means the probability of occurrence of the class, i.e. how often the class
occurs in the corpus.
P(d|c) means the probability of the occurrence of some set of features(of
certain length) given a class, i.e. P(x1, x2, x3, … xi | c). The joint
probability of all the features given a class.
Now calculating the joint probability requires enormous amount of parameters
to be calculated, may be of the order |x|n and these many for each class.
That would require enormous amount of training samples which is mostly not
available.
This looks complicated, we must try out simplifications!!
45. Simplifying Assumptions
Position of the word in the document doesn’t matter – Bag of Words
Feature probabilities given a class are independent - Conditional Independence
This simplifies our model to
P(x1, x2, x3… xi | c) = P(x1|c) * P(x2|c) * P(x3|c)* … *P(xi | c)
And hence, the probability of a class given a document reduces to
P(c|d) = P(c) * P(x1|c) * P(x2|c) * P(x3|c) *… * P(xn|c)
48. Tokenization Issues
Data is available online in HTML, XML and various other mark up languages
Twitter names, hash tags etc pollute the data
Phone numbers, short forms, new words, emoticons, phone numbers etc..
Extracting Features
Handling negations
I didn’t like this movie vs I really liked this movie
Which words to use?
Choosing the words (I, this, movie etc do not belong to the set of words which
contribute to attitude)
49. The words which matter in the sentiments. Its better to train our models on these
lexicons instead of the complete list of words in the training documents.
Here are few links for the lexicons:
The General Inquirer :
http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
http://www.wjh.harvard.edu/~inquirer/homecat.htm
LIWC(Linguistic Inquiry and Word Count)
http://www.liwc.net
Negative emotions (bad, weird, hate, problem, crap)
Positive Emotions ( love, wonderful, magnificent, lovely)
Bing Liu's page on Opinion Mining
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
50. Not all classifiers we build is a state of art system. We must refine and fine
tune it after building.
What are the parameters to judge a classifier?
Contingency Matrix
What is the accuracy of the system?
Accuracy = TP + TN
TP+TN+FP+FN
51. Task : Build a classifier to identify food items on web pages
Most of the tokens on a web page won’t be the name off a food item. Let us say
that there are 1000 words and only 10 words are names of food item.
Let us consider that our classifier is a bogus one and it always returns a false
for each word it encounters.
This means the accuracy of our system = TP + TN = 990/1000 = 99%
TP+TN+FP+FN
Hence, our 99% accurate system is not able to do what we needed, i.e.
detecting food items.
52. We definitely need a better parameter to judge our model. So we define two
parameters PRECISION and RECALL
Precision, is the percentage of selected items that are correct.
Recall, is the percentage of correct item that the system was able to select.
PRECISION = TP / ( TP + FP ) = 0/0 = UNDEFINED
RECALL = TP / (TP + FN) = 0/10 = 0
So, these parameters somewhere fairly judge our classifiers. Here, the recall is
zero and the precision too is zero.
53. The figures on the last slide didn’t give much insight into the roles of these two
parameters.
Slightly better classifier - capable of selecting the true food items
Precision = 10/(10 + 20) = 33%
Recall = 10/(10+10) = 50%
Hence these parameters fairly judge the classifier, so a combination of these
two measures can be a fair evaluation criteria. It is also called the F
measure.
54. Discriminative Language models
Maximum Entropy
Support Vector Machines
Other Advanced Applications
Named Entity Recognition
POS Tagging
Machine Translation & Probabilistic Parsing
CFGs
Language Grammars
Areas of research
Information Retrieval (Query Based and Generic)
Question & Answering
Summarization