Nlp

a developer’s perspective to
- Vanya Seth & Dharmendra Prasad

NLP is a tool to devise ways to interpret human understandable
languages and extract meaning out of it, which can be used
by other humans (translators), machines or software systems
for achieving greater objectives.
Why do we need this?
 Not everyone speaks the same language (MT)
 Not all the languages are written the same way (MT)
 We would like to automate tasks which are tedious (IE)
 We desire more accuracy and less human errors!(Automation)

1. Spam Detection
2. Spelling correction
3. Parts of Speech Tagging
4. Named Entity Recognition
1. Co reference Resolution
2. Information Extraction
3. Sentiment Analysis
4. Machine Translation
1. Answering Questions
2. Summarization
3. Paraphrasing
4. Dialog

 Ambiguity is pervasive
Headline: Republicans Grill IRS Chief Over Lost Emails
Meanings:
 Republicans harshly question the chief about the emails
 Republicans cook the chief using email as the fuel
 New ways of writing
 Twitter hashtags
 All Capitals
 Abused notations (U for You, FB for facebook, @ for At)
 New words (Retweet, Unfriend etc)
 Emoticons ( ,  and many others)

We learn, we remember and we conquer
 What tools we use?
 We require the knowledge about the language
 We require the knowledge about the world
 We need a way to combine knowledge sources
 How we do this?
 Probabilistic models built upon language data for inferring language
properties
 P(“fragrant” -> “rose”) is high
 P(“awful” -> “love”) is low

What we mostly do in NLP is “Text Processing”. Normalizing the text in one
way or the other.
Word Tokenization, Text Search, Sentence Segmentation, Pattern
Recognition, Disambiguating Words etc…
Text Processing is Important, one important tool is
Regular Expressions (http://regexpal.com)
Word Tokenization (http://sentiment.christopherpotts.net/tokenizing.html)
how many words are there in the text
what is the size of the vocabulary and so on..

Sentence completion
How are _____?
The weather is ______.
Phrase Rearrangement
little had Mary a lamb
happy everyone during is holidays
How can we solve these problems?

Sentence completion – Probability(upcoming word)
Phrase rearrangement – Probability(occurrence of a sentence)
For both of these, we need a machinery, which is called “language model”
In simple words – a language model is a black box, which has prior knowledge
about the language(s) and for any question we consult the LM for an
appropriate answer
Formally – A language model is a model which computes either P(upcoming
word) or P(occurrence of a sentence)

This ideally depends on what kind of answer you want from the model.
An answer to the question:
P(upcoming word) - list of pairs(phrase – probability), a phrase with highest
probability (or any word based on the complexity of the algorithm)
P(phrase rearrangement) – list of pairs (sentence – probability), a sentence
with highest probability (or any sentence based on the complexity of the
algorithm)
How do we get these probabilities?

Goal: Calculating the Probability of a sequence of words
P("The tiger is a fierce animal") =
P(The) * P(tiger | The) * P(is | The tiger) * P(a | The tiger is) *
* P(fierce | The tiger is a) * P(animal | The tiger is a fierce)
This is the joint probability of the sequence of words by the
chain rule of probability.
P(animal | The tiger is a fierce) =
Count(The tiger is a fierce animal)
Count(The tiger is a fierce)

Vocabulary: Cold, You, They, How, Are, The, Weather, Hot, Is, Was
Strategy 1: Assign probabilities to all the words depending on the
number of times it occurred in the corpus
 How are Cold?
 How are You?
 How are They?
 How are How?
All these sentences are equally likely.
This model is called unigram model

Strategy 2: Assign probabilities based on some context. Look
for one previous word.
 is cold , is hot, is gone
 are gone, are you, are they
 was cold, was hot
 they are, the weather, you are
Complete the sentence : The weather is ________
Three possible options are
 The weather is hot
 The weather is cold
 The weather is gone
This model is called bigram model

Strategy 3: Start looking at more than one previous words
 weather is cold , tea is hot,
 they are gone, how are you, where are they
 water was cold, food was hot
Complete the sentence : The weather is ________
Possible options is : The weather is cold.
Complete the sentence : How are ________?
Possible options is : How are you?
This model is called trigram model

The computer I installed in the chemistry laboratory _______
Bigrams may result in :
The computer I installed in the chemistry laboratory equipment
Trigrams may result in :
The computer I installed in the chemistry laboratory apparatus
Long Distance Dependencies**
Huge corpus with common n grams.
“The computer I installed in the chemistry laboratory crashed”

A 2000 words vocabulary has 2000*2000 = 4,000,000 possible bigrams
A piece of document with 16000 tokens produces around 5000 unique bigrams
This means 5000/4000000 or 99.875% of the bigrams are never seen.
This will assign zero probability to all the unseen bigrams and hence, if we
calculate the probability of a sentence with a new bigram, our model will
return a value zero.
To avoid this we use a technique called smoothing. The simplest of all is the
Laplace’s Smoothing or the Add One Smoothing

Laplace’s Smoothing: It suggests that, consider that you saw everything one
more time. This solves the problems of unseen bigrams or ngrams.
Applying Laplace’s smoothing we will calculate probability as below:
Without Smoothing: P(wi| wi-1) = c(wi| wi-1)
c(wi-1)
Add One Smoothing: P(wi| wi-1) = c(wi| wi-1) + 1
c(wi-1) + |V|
Why adding a |V| in the denominator?
Seeing each word one more time means seeing all the unique words one more
time and the total number of unique words is |V| so the denominator
increases by |V|

Online Available Corpus
SRI Language Model ToolKit : http://www.speech.sri.com/projects/srilm/
Google Books Ngram Viewer: https://books.google.com/ngrams
Google Ngram Corpus: http://googleresearch.blogspot.in/2006/08/all-our-n-
gram-are-belong-to-you.html

Typing is a manual process & not everyone types correctly. Spelling
mistakes occur frequently. There are two spelling tasks in such
scenarios
 Error Detection
 Error Correction
 Auto Correct
 hte -> the
 Suggest one correction
 theate-> theatre
 Suggest list of correction
 leding -> leading, lending

 Non Word Errors
 opportnity -> opportunity
 graffe -> giraffe
 Real Word Errors
 Typographical Errors
 In dog we trust -> god
 Cognitive Errors (Homophones)
 good buy -> bye
 withdraw cache -> cash

Non Word Errors – very easy to correct
 Detection
 Have a word list from the dictionary of the language
 Check the word in the dictionary, if the word is not present, the word is an
error.
 Correction
 Find candidate words which are similar to the error
 Choose the best candidate based on any algorithm or model discussed next.

Real Word Errors – tough to fix
 Detection
 There is no error detection possible, because even the error is a real word
hence it is impossible to test the word against the given dictionary.
 Correction
 Find candidate words which are similar (pronunciation, spelling) to a word in
the sentence
 Do this for all the words in the sentences
 Choose the best candidate based on any algorithm or model discussed next.

N M E
N A M E
H T E
T H E
A C R E S S
A C R E S
INSERT
TRANSPOSE
DELETE

The noisy channel is a probabilistic model which represents the real
world conditions.
We run a lot of guesses through the channel and the one which matches
the most to the noisy word is our correct word.

What are we trying to find?
For an observation x(the noisy word), we are looking for a word w (from the
vocabulary) which maximizes the probability of the word using this noisy
channel.
ŵ = argmax P(w|x)
w ϵ V
= argmax P(x|w) * P(w) / P(x) -> Bayes Rule
w ϵ V
P(x) is constant for all the w in the vocabulary, because x is the observation.
P(x|w) is called the channel model and P(w) is called the language model.

x = acress
Possible candidates:
Error (x) Correction
(w)
Correct
Letter
Error Letter Error Type
acress actress t - Deletion
acress cress - a Insertion
acress caress ca ac Transposition
acress access c r Substitution
acress across o e Substitution
acress acres - s Insertion
acress acres - s Insertion

Words Taken from Corpus of Contemporary English (400,000,000 words)
Clearly the probability of across is the highest and our model will
suggest the correction ‘across’ in a unigram language model
Word Frequency P(word)
actress 9448 0.00002362
cress 220 0.00000055
caress 686 0.00000171
access 35310 0.00008827
across 105559 0.00026389
acres 12874 0.00003218

We just talked about the probability of the words (the language model),
which is one factor contributing to the correction.
The other factor which is the channel model, still needs to be consulted
for a better correction task.
The channel model consults a tool called Confusion Matrix for finding
out the likelihood of a type of error:
Types of Error: Insertion, Deletion, Transposition & Substitution
Each confusion matrix tells the possibility of a given type of error.

Words Taken from Corpus of Contemporary English (400,000,000 words)
Clearly the probability of across is the highest and our model will
suggest the correction ‘across’ in a unigram language model
Word Correct
letter
Error
Letter
x|w P(x|word) P(word) Result
10e-9
actress t - c|ct 0.0001170 0.00002362 2.7
cress - a a|# 0.0000014 0.00000055 0.00078
caress ca ac ac|ca 0.0000016 0.00000171 0.0028
access c r r|c 0.0000002 0.00008827 0.019
across o e e|o 0.0000093 0.00026389 2.8
acres - s es|e 0.0000321 0.00003218 1.0
acres - s ss|s 0.0000342 0.00003218 1.0

Wikipedia - Common English Spelling mistakes :
https://en.wikipedia.org/wiki/Commonly_misspelled_English_words
Birkbeck Spelling Error Corpus : http://ota.ox.ac.uk/headers/0643.xml
Peter Norvig List of Error: http://norvig.com/ngrams/spell-errors.txt

Spam like attributes
 Undisclosed Recipients
 Prize!!
 No name, lucky draw
 Suspicious URL

It is a process of classifying given documents into various classes. For e.g.:
 An e-mail is a SPAM or NOT?
 A product review is POSITIVE or NEGATIVE?
And so on…
We need a classification model, which works on the given inputs and produces
the desired output.
Inputs:
 a document d
 a fixed set of classes C = {c1, c2, c3, …., cn}
Desired Output:
a predicted class c ϵ C such that the document belongs to that class c.

Hand Coded Rules
Based on Combination of words for e.g. a black listed sender, words like
Viagra, dollars, impress a girl.
Supervised Machine Learning
Naïve Bayes
Support Vector Machines
Logistic Regression
KNN (K Nearest Neighbor)

What is a Sentiment?
It is an attitude, affectively colored beliefs or dispositions towards objects and
persons liking, loving, hating, valuing, desiring
What is the task of Sentiment Analysis
Detecting attitude holder of the attitude, target of the attitude, type of the
attitude.
Types of Analysis
 Simplest is to assign a binary value to a sentence or document
 Slightly complex is to rate on a scale of 1-10
 Toughest is to detect the target and source

1. The fragrance is just awesome, I love it
2. Keeps you going all day long, this is the best perfume
3. Seriously do you call it a perfume? It’s awful.
4. Thanks for this wonderful fragrance in the classy bottle. Great!!
5. I feel like being cheated after buying this piece of crap
6. If you are reading this because it is your darling fragrance, please wear it
at home exclusively, and tape the windows shut.
Attitude – awesome, best, classy, awful, crap, cheated, thanks
Target – perfume, bottle, fragrance,
Holder – purchaser, user
We are basically extracting the opinion.

1. We prefer to watch movies after reading the reviews
2. We prefer to buy products after reader the reviews
3. We prefer to invest in stocks after understanding the market sentiments
4. We want a profitable asset
5. We don’t want to get cheated with odd surprises
6. We believe that we can predict future (election result, market outcome)
Now all the above, more or less can be solved using the tool called ‘Sentiment
Analysis’.

Input : Test documents
 Tokenize the test documents
 Feature Extraction ( either all the tokens or a group of relevant tokens)
 Classification using any of the classifier
 Naïve Bayes Classifiers
 Maximum Entropy Classifiers
 Support Vector Machine Classifier

Our goal:
For a document d and set of classes C, we need to calculate probability of all
the classes given the document.
P(c|d) = P(d|c)*P(c)
P(d)
The class which maximizes P(c|d) is the class where the document belongs.
 P(d|c) – conditional probability of the document given the class
 P(c) – prior probability of the class

What is the practical meaning of prior P(c) and likelihood P(d|c) ?
P(c) means the probability of occurrence of the class, i.e. how often the class
occurs in the corpus.
P(d|c) means the probability of the occurrence of some set of features(of
certain length) given a class, i.e. P(x1, x2, x3, … xi | c). The joint
probability of all the features given a class.
Now calculating the joint probability requires enormous amount of parameters
to be calculated, may be of the order |x|n and these many for each class.
That would require enormous amount of training samples which is mostly not
available.
This looks complicated, we must try out simplifications!!

Simplifying Assumptions
 Position of the word in the document doesn’t matter – Bag of Words
 Feature probabilities given a class are independent - Conditional Independence
This simplifies our model to
P(x1, x2, x3… xi | c) = P(x1|c) * P(x2|c) * P(x3|c)* … *P(xi | c)
And hence, the probability of a class given a document reduces to
P(c|d) = P(c) * P(x1|c) * P(x2|c) * P(x3|c) *… * P(xn|c)

Tokenization Issues
 Data is available online in HTML, XML and various other mark up languages
 Twitter names, hash tags etc pollute the data
 Phone numbers, short forms, new words, emoticons, phone numbers etc..
Extracting Features
 Handling negations
 I didn’t like this movie vs I really liked this movie
 Which words to use?
 Choosing the words (I, this, movie etc do not belong to the set of words which
contribute to attitude)

The words which matter in the sentiments. Its better to train our models on these
lexicons instead of the complete list of words in the training documents.
Here are few links for the lexicons:
 The General Inquirer :
 http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
 http://www.wjh.harvard.edu/~inquirer/homecat.htm
 LIWC(Linguistic Inquiry and Word Count)
 http://www.liwc.net
 Negative emotions (bad, weird, hate, problem, crap)
 Positive Emotions ( love, wonderful, magnificent, lovely)
 Bing Liu's page on Opinion Mining
 https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Not all classifiers we build is a state of art system. We must refine and fine
tune it after building.
What are the parameters to judge a classifier?
Contingency Matrix
What is the accuracy of the system?
Accuracy = TP + TN
TP+TN+FP+FN

Task : Build a classifier to identify food items on web pages
Most of the tokens on a web page won’t be the name off a food item. Let us say
that there are 1000 words and only 10 words are names of food item.
Let us consider that our classifier is a bogus one and it always returns a false
for each word it encounters.
This means the accuracy of our system = TP + TN = 990/1000 = 99%
TP+TN+FP+FN
Hence, our 99% accurate system is not able to do what we needed, i.e.
detecting food items.

We definitely need a better parameter to judge our model. So we define two
parameters PRECISION and RECALL
Precision, is the percentage of selected items that are correct.
Recall, is the percentage of correct item that the system was able to select.
PRECISION = TP / ( TP + FP ) = 0/0 = UNDEFINED
RECALL = TP / (TP + FN) = 0/10 = 0
So, these parameters somewhere fairly judge our classifiers. Here, the recall is
zero and the precision too is zero.

The figures on the last slide didn’t give much insight into the roles of these two
parameters.
Slightly better classifier - capable of selecting the true food items
Precision = 10/(10 + 20) = 33%
Recall = 10/(10+10) = 50%
Hence these parameters fairly judge the classifier, so a combination of these
two measures can be a fair evaluation criteria. It is also called the F
measure.

Discriminative Language models
Maximum Entropy
Support Vector Machines
Other Advanced Applications
Named Entity Recognition
POS Tagging
Machine Translation & Probabilistic Parsing
CFGs
Language Grammars
Areas of research
Information Retrieval (Query Based and Generic)
Question & Answering
Summarization

Dharmendra Prasad
admin@techieme.in

Nlp

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Nlp

Similaire à Nlp (20)

Plus de Hyderabad Scalability Meetup

Plus de Hyderabad Scalability Meetup (15)

Dernier

Dernier (20)

Nlp