SlideShare une entreprise Scribd logo
1  sur  55
a developer’s perspective to
- Vanya Seth & Dharmendra Prasad
NLP is a tool to devise ways to interpret human understandable
languages and extract meaning out of it, which can be used
by other humans (translators), machines or software systems
for achieving greater objectives.
Why do we need this?
 Not everyone speaks the same language (MT)
 Not all the languages are written the same way (MT)
 We would like to automate tasks which are tedious (IE)
 We desire more accuracy and less human errors!(Automation)
1. Spam Detection
2. Spelling correction
3. Parts of Speech Tagging
4. Named Entity Recognition
1. Co reference Resolution
2. Information Extraction
3. Sentiment Analysis
4. Machine Translation
1. Answering Questions
2. Summarization
3. Paraphrasing
4. Dialog
 Ambiguity is pervasive
Headline: Republicans Grill IRS Chief Over Lost Emails
Meanings:
 Republicans harshly question the chief about the emails
 Republicans cook the chief using email as the fuel
 New ways of writing
 Twitter hashtags
 All Capitals
 Abused notations (U for You, FB for facebook, @ for At)
 New words (Retweet, Unfriend etc)
 Emoticons ( ,  and many others)
We learn, we remember and we conquer
 What tools we use?
 We require the knowledge about the language
 We require the knowledge about the world
 We need a way to combine knowledge sources
 How we do this?
 Probabilistic models built upon language data for inferring language
properties
 P(“fragrant” -> “rose”) is high
 P(“awful” -> “love”) is low
What we mostly do in NLP is “Text Processing”. Normalizing the text in one
way or the other.
Word Tokenization, Text Search, Sentence Segmentation, Pattern
Recognition, Disambiguating Words etc…
Text Processing is Important, one important tool is
Regular Expressions (http://regexpal.com)
Word Tokenization (http://sentiment.christopherpotts.net/tokenizing.html)
how many words are there in the text
what is the size of the vocabulary and so on..
probabilistic approach
Sentence completion
How are _____?
The weather is ______.
Phrase Rearrangement
little had Mary a lamb
happy everyone during is holidays
How can we solve these problems?
Sentence completion – Probability(upcoming word)
Phrase rearrangement – Probability(occurrence of a sentence)
For both of these, we need a machinery, which is called “language model”
In simple words – a language model is a black box, which has prior knowledge
about the language(s) and for any question we consult the LM for an
appropriate answer
Formally – A language model is a model which computes either P(upcoming
word) or P(occurrence of a sentence)
This ideally depends on what kind of answer you want from the model.
An answer to the question:
P(upcoming word) - list of pairs(phrase – probability), a phrase with highest
probability (or any word based on the complexity of the algorithm)
P(phrase rearrangement) – list of pairs (sentence – probability), a sentence
with highest probability (or any sentence based on the complexity of the
algorithm)
How do we get these probabilities?
Goal: Calculating the Probability of a sequence of words
P("The tiger is a fierce animal") =
P(The) * P(tiger | The) * P(is | The tiger) * P(a | The tiger is) *
* P(fierce | The tiger is a) * P(animal | The tiger is a fierce)
This is the joint probability of the sequence of words by the
chain rule of probability.
P(animal | The tiger is a fierce) =
Count(The tiger is a fierce animal)
Count(The tiger is a fierce)
It would be mostly sufficient to assume that
P(animal| the tiger is a fierce) =
P(animal| fierce) or P(animal| a fierce)
Hence, P(the tiger is a fierce animal) = P(is| the tiger) * P(a|
tiger is) * P(fierce| is a) * P(animal| a fierce)
Vocabulary: Cold, You, They, How, Are, The, Weather, Hot, Is, Was
Strategy 1: Assign probabilities to all the words depending on the
number of times it occurred in the corpus
 How are Cold?
 How are You?
 How are They?
 How are How?
All these sentences are equally likely.
This model is called unigram model
Strategy 2: Assign probabilities based on some context. Look
for one previous word.
 is cold , is hot, is gone
 are gone, are you, are they
 was cold, was hot
 they are, the weather, you are
Complete the sentence : The weather is ________
Three possible options are
 The weather is hot
 The weather is cold
 The weather is gone
This model is called bigram model
Strategy 3: Start looking at more than one previous words
 weather is cold , tea is hot,
 they are gone, how are you, where are they
 water was cold, food was hot
Complete the sentence : The weather is ________
Possible options is : The weather is cold.
Complete the sentence : How are ________?
Possible options is : How are you?
This model is called trigram model
The computer I installed in the chemistry laboratory _______
Bigrams may result in :
The computer I installed in the chemistry laboratory equipment
Trigrams may result in :
The computer I installed in the chemistry laboratory apparatus
Long Distance Dependencies**
Huge corpus with common n grams.
“The computer I installed in the chemistry laboratory crashed”
A 2000 words vocabulary has 2000*2000 = 4,000,000 possible bigrams
A piece of document with 16000 tokens produces around 5000 unique bigrams
This means 5000/4000000 or 99.875% of the bigrams are never seen.
This will assign zero probability to all the unseen bigrams and hence, if we
calculate the probability of a sentence with a new bigram, our model will
return a value zero.
To avoid this we use a technique called smoothing. The simplest of all is the
Laplace’s Smoothing or the Add One Smoothing
Laplace’s Smoothing: It suggests that, consider that you saw everything one
more time. This solves the problems of unseen bigrams or ngrams.
Applying Laplace’s smoothing we will calculate probability as below:
Without Smoothing: P(wi| wi-1) = c(wi| wi-1)
c(wi-1)
Add One Smoothing: P(wi| wi-1) = c(wi| wi-1) + 1
c(wi-1) + |V|
Why adding a |V| in the denominator?
Seeing each word one more time means seeing all the unique words one more
time and the total number of unique words is |V| so the denominator
increases by |V|
Online Available Corpus
SRI Language Model ToolKit : http://www.speech.sri.com/projects/srilm/
Google Books Ngram Viewer: https://books.google.com/ngrams
Google Ngram Corpus: http://googleresearch.blogspot.in/2006/08/all-our-n-
gram-are-belong-to-you.html
Typing is a manual process & not everyone types correctly. Spelling
mistakes occur frequently. There are two spelling tasks in such
scenarios
 Error Detection
 Error Correction
 Auto Correct
 hte -> the
 Suggest one correction
 theate-> theatre
 Suggest list of correction
 leding -> leading, lending
 Non Word Errors
 opportnity -> opportunity
 graffe -> giraffe
 Real Word Errors
 Typographical Errors
 In dog we trust -> god
 Cognitive Errors (Homophones)
 good buy -> bye
 withdraw cache -> cash
Non Word Errors – very easy to correct
 Detection
 Have a word list from the dictionary of the language
 Check the word in the dictionary, if the word is not present, the word is an
error.
 Correction
 Find candidate words which are similar to the error
 Choose the best candidate based on any algorithm or model discussed next.
Real Word Errors – tough to fix
 Detection
 There is no error detection possible, because even the error is a real word
hence it is impossible to test the word against the given dictionary.
 Correction
 Find candidate words which are similar (pronunciation, spelling) to a word in
the sentence
 Do this for all the words in the sentences
 Choose the best candidate based on any algorithm or model discussed next.
N M E
N A M E
H T E
T H E
A C R E S S
A C R E S
INSERT
TRANSPOSE
DELETE
The noisy channel is a probabilistic model which represents the real
world conditions.
We run a lot of guesses through the channel and the one which matches
the most to the noisy word is our correct word.
What are we trying to find?
For an observation x(the noisy word), we are looking for a word w (from the
vocabulary) which maximizes the probability of the word using this noisy
channel.
ŵ = argmax P(w|x)
w ϵ V
= argmax P(x|w) * P(w) / P(x) -> Bayes Rule
w ϵ V
P(x) is constant for all the w in the vocabulary, because x is the observation.
P(x|w) is called the channel model and P(w) is called the language model.
x = acress
Possible candidates:
Error (x) Correction
(w)
Correct
Letter
Error Letter Error Type
acress actress t - Deletion
acress cress - a Insertion
acress caress ca ac Transposition
acress access c r Substitution
acress across o e Substitution
acress acres - s Insertion
acress acres - s Insertion
Words Taken from Corpus of Contemporary English (400,000,000 words)
Clearly the probability of across is the highest and our model will
suggest the correction ‘across’ in a unigram language model
Word Frequency P(word)
actress 9448 0.00002362
cress 220 0.00000055
caress 686 0.00000171
access 35310 0.00008827
across 105559 0.00026389
acres 12874 0.00003218
We just talked about the probability of the words (the language model),
which is one factor contributing to the correction.
The other factor which is the channel model, still needs to be consulted
for a better correction task.
The channel model consults a tool called Confusion Matrix for finding
out the likelihood of a type of error:
Types of Error: Insertion, Deletion, Transposition & Substitution
Each confusion matrix tells the possibility of a given type of error.
Words Taken from Corpus of Contemporary English (400,000,000 words)
Clearly the probability of across is the highest and our model will
suggest the correction ‘across’ in a unigram language model
Word Correct
letter
Error
Letter
x|w P(x|word) P(word) Result
10e-9
actress t - c|ct 0.0001170 0.00002362 2.7
cress - a a|# 0.0000014 0.00000055 0.00078
caress ca ac ac|ca 0.0000016 0.00000171 0.0028
access c r r|c 0.0000002 0.00008827 0.019
across o e e|o 0.0000093 0.00026389 2.8
acres - s es|e 0.0000321 0.00003218 1.0
acres - s ss|s 0.0000342 0.00003218 1.0
Wikipedia - Common English Spelling mistakes :
https://en.wikipedia.org/wiki/Commonly_misspelled_English_words
Birkbeck Spelling Error Corpus : http://ota.ox.ac.uk/headers/0643.xml
Peter Norvig List of Error: http://norvig.com/ngrams/spell-errors.txt
Spam like attributes
 Undisclosed Recipients
 Prize!!
 No name, lucky draw
 Suspicious URL
It is a process of classifying given documents into various classes. For e.g.:
 An e-mail is a SPAM or NOT?
 A product review is POSITIVE or NEGATIVE?
And so on…
We need a classification model, which works on the given inputs and produces
the desired output.
Inputs:
 a document d
 a fixed set of classes C = {c1, c2, c3, …., cn}
Desired Output:
a predicted class c ϵ C such that the document belongs to that class c.
Hand Coded Rules
Based on Combination of words for e.g. a black listed sender, words like
Viagra, dollars, impress a girl.
Supervised Machine Learning
Naïve Bayes
Support Vector Machines
Logistic Regression
KNN (K Nearest Neighbor)
What is a Sentiment?
It is an attitude, affectively colored beliefs or dispositions towards objects and
persons liking, loving, hating, valuing, desiring
What is the task of Sentiment Analysis
Detecting attitude holder of the attitude, target of the attitude, type of the
attitude.
Types of Analysis
 Simplest is to assign a binary value to a sentence or document
 Slightly complex is to rate on a scale of 1-10
 Toughest is to detect the target and source
1. The fragrance is just awesome, I love it
2. Keeps you going all day long, this is the best perfume
3. Seriously do you call it a perfume? It’s awful.
4. Thanks for this wonderful fragrance in the classy bottle. Great!!
5. I feel like being cheated after buying this piece of crap
6. If you are reading this because it is your darling fragrance, please wear it
at home exclusively, and tape the windows shut.
Attitude – awesome, best, classy, awful, crap, cheated, thanks
Target – perfume, bottle, fragrance,
Holder – purchaser, user
We are basically extracting the opinion.
1. We prefer to watch movies after reading the reviews
2. We prefer to buy products after reader the reviews
3. We prefer to invest in stocks after understanding the market sentiments
4. We want a profitable asset
5. We don’t want to get cheated with odd surprises
6. We believe that we can predict future (election result, market outcome)
Now all the above, more or less can be solved using the tool called ‘Sentiment
Analysis’.
Input : Test documents
 Tokenize the test documents
 Feature Extraction ( either all the tokens or a group of relevant tokens)
 Classification using any of the classifier
 Naïve Bayes Classifiers
 Maximum Entropy Classifiers
 Support Vector Machine Classifier
Our goal:
For a document d and set of classes C, we need to calculate probability of all
the classes given the document.
P(c|d) = P(d|c)*P(c)
P(d)
The class which maximizes P(c|d) is the class where the document belongs.
 P(d|c) – conditional probability of the document given the class
 P(c) – prior probability of the class
What is the practical meaning of prior P(c) and likelihood P(d|c) ?
P(c) means the probability of occurrence of the class, i.e. how often the class
occurs in the corpus.
P(d|c) means the probability of the occurrence of some set of features(of
certain length) given a class, i.e. P(x1, x2, x3, … xi | c). The joint
probability of all the features given a class.
Now calculating the joint probability requires enormous amount of parameters
to be calculated, may be of the order |x|n and these many for each class.
That would require enormous amount of training samples which is mostly not
available.
This looks complicated, we must try out simplifications!!
Simplifying Assumptions
 Position of the word in the document doesn’t matter – Bag of Words
 Feature probabilities given a class are independent - Conditional Independence
This simplifies our model to
P(x1, x2, x3… xi | c) = P(x1|c) * P(x2|c) * P(x3|c)* … *P(xi | c)
And hence, the probability of a class given a document reduces to
P(c|d) = P(c) * P(x1|c) * P(x2|c) * P(x3|c) *… * P(xn|c)
Calculating the priors: P(India) = 3/5 P(Pakistan) = 2/5
Calculating the conditional probabilities :
P(Delhi | India) = 0.15789473 P(Lahore | Pakistan) = 0.25
P(India | India) = 0.15789473 P(Islamabad | Pakistan) = 0.1875
P(Mumbai | India) = 0.2631579 P(Pakistan | Pakistan) = 0.1875
P(Chennai| India) = 0.10526316 P(Hyderabad | Pakistan) = 0.125
P(Hyderabad | India) = 0.15789473 P(India | Pakistan) = 0.0625
P(Lahore | India) = 0.05263158 P(Mumbai | Pakistan) = 0.0625
P(Islamabad | India) = 0.05263158 P(Chennai | Pakistan) = 0.0625
Vocabulary Size: 8
P(w|c) = count(w,c) + 1
count(c) +|V|
Priors: P(India) = 3/5 P(Pakistan) = 2/5
Conditional probabilities :
P(Lahore | India) = 0.05263158 P(Lahore | Pakistan) = 0.25
P(Hyderabad | India) = 0.15789473 P(Hyderabad | Pakistan) = 0.125
P(Chennai| India) = 0.10526316 P(Chennai | Pakistan) = 0.0625
P(Islamabad | India) = 0.05263158 P(Islamabad | Pakistan) = 0.1875
Test Doc : Lahore Hyderabad Chennai Islamabad
P(test doc | India) = P(India) * P(Lahore | India) * P(Hyderabad | India) * P(Chennai |
India) * P(Islamabad | India)
= 0.6 * 0.0526 * 0.1578 * 0.1052 * 0.0526 = 0.0000275
P(test doc | Pakistan) = P(Pakistan) * P(Lahore | Pakistan) * P(Hyderabad | Pakistan) *
P(Chennai | Pakistan) * P(Islamabad | Pakistan)
= 0.4 * 0.25 * 0.125 * 0.0625 * 0.1875 = 0.0001464
Tokenization Issues
 Data is available online in HTML, XML and various other mark up languages
 Twitter names, hash tags etc pollute the data
 Phone numbers, short forms, new words, emoticons, phone numbers etc..
Extracting Features
 Handling negations
 I didn’t like this movie vs I really liked this movie
 Which words to use?
 Choosing the words (I, this, movie etc do not belong to the set of words which
contribute to attitude)
The words which matter in the sentiments. Its better to train our models on these
lexicons instead of the complete list of words in the training documents.
Here are few links for the lexicons:
 The General Inquirer :
 http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
 http://www.wjh.harvard.edu/~inquirer/homecat.htm
 LIWC(Linguistic Inquiry and Word Count)
 http://www.liwc.net
 Negative emotions (bad, weird, hate, problem, crap)
 Positive Emotions ( love, wonderful, magnificent, lovely)
 Bing Liu's page on Opinion Mining
 https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Not all classifiers we build is a state of art system. We must refine and fine
tune it after building.
What are the parameters to judge a classifier?
Contingency Matrix
What is the accuracy of the system?
Accuracy = TP + TN
TP+TN+FP+FN
Task : Build a classifier to identify food items on web pages
Most of the tokens on a web page won’t be the name off a food item. Let us say
that there are 1000 words and only 10 words are names of food item.
Let us consider that our classifier is a bogus one and it always returns a false
for each word it encounters.
This means the accuracy of our system = TP + TN = 990/1000 = 99%
TP+TN+FP+FN
Hence, our 99% accurate system is not able to do what we needed, i.e.
detecting food items.
We definitely need a better parameter to judge our model. So we define two
parameters PRECISION and RECALL
Precision, is the percentage of selected items that are correct.
Recall, is the percentage of correct item that the system was able to select.
PRECISION = TP / ( TP + FP ) = 0/0 = UNDEFINED
RECALL = TP / (TP + FN) = 0/10 = 0
So, these parameters somewhere fairly judge our classifiers. Here, the recall is
zero and the precision too is zero.
The figures on the last slide didn’t give much insight into the roles of these two
parameters.
Slightly better classifier - capable of selecting the true food items
Precision = 10/(10 + 20) = 33%
Recall = 10/(10+10) = 50%
Hence these parameters fairly judge the classifier, so a combination of these
two measures can be a fair evaluation criteria. It is also called the F
measure.
Discriminative Language models
Maximum Entropy
Support Vector Machines
Other Advanced Applications
Named Entity Recognition
POS Tagging
Machine Translation & Probabilistic Parsing
CFGs
Language Grammars
Areas of research
Information Retrieval (Query Based and Generic)
Question & Answering
Summarization
Dharmendra Prasad
admin@techieme.in

Contenu connexe

Tendances

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
word level analysis
word level analysis word level analysis
word level analysis tjs1
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing Sandeep Wakchaure
 
Language Model (N-Gram).pptx
Language Model (N-Gram).pptxLanguage Model (N-Gram).pptx
Language Model (N-Gram).pptxHeneWijaya
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review Jayneel Vora
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingHemantha Kulathilake
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Taggingtheyaseen51
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine TranslationJaganadh Gopinadhan
 
6 shallow parsing introduction
6 shallow parsing introduction6 shallow parsing introduction
6 shallow parsing introductionThennarasuSakkan
 

Tendances (20)

Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
word level analysis
word level analysis word level analysis
word level analysis
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing
 
NLP
NLPNLP
NLP
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
Language Model (N-Gram).pptx
Language Model (N-Gram).pptxLanguage Model (N-Gram).pptx
Language Model (N-Gram).pptx
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological Parsing
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
 
Bleu vs rouge
Bleu vs rougeBleu vs rouge
Bleu vs rouge
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
6 shallow parsing introduction
6 shallow parsing introduction6 shallow parsing introduction
6 shallow parsing introduction
 

Similaire à Nlp

Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translationHrishikesh Nair
 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationChengeng Ma
 
Lecture 6
Lecture 6Lecture 6
Lecture 6hunglq
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.pptmilkesa13
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithmsRaghu nath
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingKaterina Vylomova
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Summary distributed representations_words_phrases
Summary distributed representations_words_phrasesSummary distributed representations_words_phrases
Summary distributed representations_words_phrasesYue Xiangnan
 
Ilja state2014expressivity
Ilja state2014expressivityIlja state2014expressivity
Ilja state2014expressivitymaartenmarx
 
Segmenting dna sequence into words
Segmenting dna sequence into wordsSegmenting dna sequence into words
Segmenting dna sequence into wordsLiang Wang
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPInsoo Chung
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture NotesFellowBuddy.com
 

Similaire à Nlp (20)

Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translation
 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classification
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Summary distributed representations_words_phrases
Summary distributed representations_words_phrasesSummary distributed representations_words_phrases
Summary distributed representations_words_phrases
 
Ilja state2014expressivity
Ilja state2014expressivityIlja state2014expressivity
Ilja state2014expressivity
 
Segmenting dna sequence into words
Segmenting dna sequence into wordsSegmenting dna sequence into words
Segmenting dna sequence into words
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture Notes
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
NLP new words
NLP new wordsNLP new words
NLP new words
 
sadf
sadfsadf
sadf
 
Module 11
Module 11Module 11
Module 11
 

Plus de Hyderabad Scalability Meetup

Geeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine LearningGeeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine LearningHyderabad Scalability Meetup
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh KumarHyderabad Scalability Meetup
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveHyderabad Scalability Meetup
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveHyderabad Scalability Meetup
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 

Plus de Hyderabad Scalability Meetup (15)

Serverless architectures
Serverless architecturesServerless architectures
Serverless architectures
 
GeekNight: Evolution of Programming Languages
GeekNight: Evolution of Programming LanguagesGeekNight: Evolution of Programming Languages
GeekNight: Evolution of Programming Languages
 
Geeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine LearningGeeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine Learning
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
 
Offline first geeknight
Offline first geeknightOffline first geeknight
Offline first geeknight
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Turbo charging v8 engine
Turbo charging v8 engineTurbo charging v8 engine
Turbo charging v8 engine
 
Git internals
Git internalsGit internals
Git internals
 
Internet of Things - GeekNight - Hyderabad
Internet of Things - GeekNight - HyderabadInternet of Things - GeekNight - Hyderabad
Internet of Things - GeekNight - Hyderabad
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
 
Java 8 Lambda Expressions
Java 8 Lambda ExpressionsJava 8 Lambda Expressions
Java 8 Lambda Expressions
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Docker by demo
Docker by demoDocker by demo
Docker by demo
 

Dernier

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Dernier (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Nlp

  • 1. a developer’s perspective to - Vanya Seth & Dharmendra Prasad
  • 2. NLP is a tool to devise ways to interpret human understandable languages and extract meaning out of it, which can be used by other humans (translators), machines or software systems for achieving greater objectives. Why do we need this?  Not everyone speaks the same language (MT)  Not all the languages are written the same way (MT)  We would like to automate tasks which are tedious (IE)  We desire more accuracy and less human errors!(Automation)
  • 3. 1. Spam Detection 2. Spelling correction 3. Parts of Speech Tagging 4. Named Entity Recognition 1. Co reference Resolution 2. Information Extraction 3. Sentiment Analysis 4. Machine Translation 1. Answering Questions 2. Summarization 3. Paraphrasing 4. Dialog
  • 4.  Ambiguity is pervasive Headline: Republicans Grill IRS Chief Over Lost Emails Meanings:  Republicans harshly question the chief about the emails  Republicans cook the chief using email as the fuel  New ways of writing  Twitter hashtags  All Capitals  Abused notations (U for You, FB for facebook, @ for At)  New words (Retweet, Unfriend etc)  Emoticons ( ,  and many others)
  • 5. We learn, we remember and we conquer  What tools we use?  We require the knowledge about the language  We require the knowledge about the world  We need a way to combine knowledge sources  How we do this?  Probabilistic models built upon language data for inferring language properties  P(“fragrant” -> “rose”) is high  P(“awful” -> “love”) is low
  • 6. What we mostly do in NLP is “Text Processing”. Normalizing the text in one way or the other. Word Tokenization, Text Search, Sentence Segmentation, Pattern Recognition, Disambiguating Words etc… Text Processing is Important, one important tool is Regular Expressions (http://regexpal.com) Word Tokenization (http://sentiment.christopherpotts.net/tokenizing.html) how many words are there in the text what is the size of the vocabulary and so on..
  • 8. Sentence completion How are _____? The weather is ______. Phrase Rearrangement little had Mary a lamb happy everyone during is holidays How can we solve these problems?
  • 9. Sentence completion – Probability(upcoming word) Phrase rearrangement – Probability(occurrence of a sentence) For both of these, we need a machinery, which is called “language model” In simple words – a language model is a black box, which has prior knowledge about the language(s) and for any question we consult the LM for an appropriate answer Formally – A language model is a model which computes either P(upcoming word) or P(occurrence of a sentence)
  • 10. This ideally depends on what kind of answer you want from the model. An answer to the question: P(upcoming word) - list of pairs(phrase – probability), a phrase with highest probability (or any word based on the complexity of the algorithm) P(phrase rearrangement) – list of pairs (sentence – probability), a sentence with highest probability (or any sentence based on the complexity of the algorithm) How do we get these probabilities?
  • 11. Goal: Calculating the Probability of a sequence of words P("The tiger is a fierce animal") = P(The) * P(tiger | The) * P(is | The tiger) * P(a | The tiger is) * * P(fierce | The tiger is a) * P(animal | The tiger is a fierce) This is the joint probability of the sequence of words by the chain rule of probability. P(animal | The tiger is a fierce) = Count(The tiger is a fierce animal) Count(The tiger is a fierce)
  • 12. It would be mostly sufficient to assume that P(animal| the tiger is a fierce) = P(animal| fierce) or P(animal| a fierce) Hence, P(the tiger is a fierce animal) = P(is| the tiger) * P(a| tiger is) * P(fierce| is a) * P(animal| a fierce)
  • 13. Vocabulary: Cold, You, They, How, Are, The, Weather, Hot, Is, Was Strategy 1: Assign probabilities to all the words depending on the number of times it occurred in the corpus  How are Cold?  How are You?  How are They?  How are How? All these sentences are equally likely. This model is called unigram model
  • 14. Strategy 2: Assign probabilities based on some context. Look for one previous word.  is cold , is hot, is gone  are gone, are you, are they  was cold, was hot  they are, the weather, you are Complete the sentence : The weather is ________ Three possible options are  The weather is hot  The weather is cold  The weather is gone This model is called bigram model
  • 15. Strategy 3: Start looking at more than one previous words  weather is cold , tea is hot,  they are gone, how are you, where are they  water was cold, food was hot Complete the sentence : The weather is ________ Possible options is : The weather is cold. Complete the sentence : How are ________? Possible options is : How are you? This model is called trigram model
  • 16. The computer I installed in the chemistry laboratory _______ Bigrams may result in : The computer I installed in the chemistry laboratory equipment Trigrams may result in : The computer I installed in the chemistry laboratory apparatus Long Distance Dependencies** Huge corpus with common n grams. “The computer I installed in the chemistry laboratory crashed”
  • 17. A 2000 words vocabulary has 2000*2000 = 4,000,000 possible bigrams A piece of document with 16000 tokens produces around 5000 unique bigrams This means 5000/4000000 or 99.875% of the bigrams are never seen. This will assign zero probability to all the unseen bigrams and hence, if we calculate the probability of a sentence with a new bigram, our model will return a value zero. To avoid this we use a technique called smoothing. The simplest of all is the Laplace’s Smoothing or the Add One Smoothing
  • 18. Laplace’s Smoothing: It suggests that, consider that you saw everything one more time. This solves the problems of unseen bigrams or ngrams. Applying Laplace’s smoothing we will calculate probability as below: Without Smoothing: P(wi| wi-1) = c(wi| wi-1) c(wi-1) Add One Smoothing: P(wi| wi-1) = c(wi| wi-1) + 1 c(wi-1) + |V| Why adding a |V| in the denominator? Seeing each word one more time means seeing all the unique words one more time and the total number of unique words is |V| so the denominator increases by |V|
  • 19. Online Available Corpus SRI Language Model ToolKit : http://www.speech.sri.com/projects/srilm/ Google Books Ngram Viewer: https://books.google.com/ngrams Google Ngram Corpus: http://googleresearch.blogspot.in/2006/08/all-our-n- gram-are-belong-to-you.html
  • 20.
  • 21. Typing is a manual process & not everyone types correctly. Spelling mistakes occur frequently. There are two spelling tasks in such scenarios  Error Detection  Error Correction  Auto Correct  hte -> the  Suggest one correction  theate-> theatre  Suggest list of correction  leding -> leading, lending
  • 22.  Non Word Errors  opportnity -> opportunity  graffe -> giraffe  Real Word Errors  Typographical Errors  In dog we trust -> god  Cognitive Errors (Homophones)  good buy -> bye  withdraw cache -> cash
  • 23. Non Word Errors – very easy to correct  Detection  Have a word list from the dictionary of the language  Check the word in the dictionary, if the word is not present, the word is an error.  Correction  Find candidate words which are similar to the error  Choose the best candidate based on any algorithm or model discussed next.
  • 24. Real Word Errors – tough to fix  Detection  There is no error detection possible, because even the error is a real word hence it is impossible to test the word against the given dictionary.  Correction  Find candidate words which are similar (pronunciation, spelling) to a word in the sentence  Do this for all the words in the sentences  Choose the best candidate based on any algorithm or model discussed next.
  • 25. N M E N A M E H T E T H E A C R E S S A C R E S INSERT TRANSPOSE DELETE
  • 26. The noisy channel is a probabilistic model which represents the real world conditions. We run a lot of guesses through the channel and the one which matches the most to the noisy word is our correct word.
  • 27. What are we trying to find? For an observation x(the noisy word), we are looking for a word w (from the vocabulary) which maximizes the probability of the word using this noisy channel. ŵ = argmax P(w|x) w ϵ V = argmax P(x|w) * P(w) / P(x) -> Bayes Rule w ϵ V P(x) is constant for all the w in the vocabulary, because x is the observation. P(x|w) is called the channel model and P(w) is called the language model.
  • 28. x = acress Possible candidates: Error (x) Correction (w) Correct Letter Error Letter Error Type acress actress t - Deletion acress cress - a Insertion acress caress ca ac Transposition acress access c r Substitution acress across o e Substitution acress acres - s Insertion acress acres - s Insertion
  • 29. Words Taken from Corpus of Contemporary English (400,000,000 words) Clearly the probability of across is the highest and our model will suggest the correction ‘across’ in a unigram language model Word Frequency P(word) actress 9448 0.00002362 cress 220 0.00000055 caress 686 0.00000171 access 35310 0.00008827 across 105559 0.00026389 acres 12874 0.00003218
  • 30. We just talked about the probability of the words (the language model), which is one factor contributing to the correction. The other factor which is the channel model, still needs to be consulted for a better correction task. The channel model consults a tool called Confusion Matrix for finding out the likelihood of a type of error: Types of Error: Insertion, Deletion, Transposition & Substitution Each confusion matrix tells the possibility of a given type of error.
  • 31.
  • 32. Words Taken from Corpus of Contemporary English (400,000,000 words) Clearly the probability of across is the highest and our model will suggest the correction ‘across’ in a unigram language model Word Correct letter Error Letter x|w P(x|word) P(word) Result 10e-9 actress t - c|ct 0.0001170 0.00002362 2.7 cress - a a|# 0.0000014 0.00000055 0.00078 caress ca ac ac|ca 0.0000016 0.00000171 0.0028 access c r r|c 0.0000002 0.00008827 0.019 across o e e|o 0.0000093 0.00026389 2.8 acres - s es|e 0.0000321 0.00003218 1.0 acres - s ss|s 0.0000342 0.00003218 1.0
  • 33. Wikipedia - Common English Spelling mistakes : https://en.wikipedia.org/wiki/Commonly_misspelled_English_words Birkbeck Spelling Error Corpus : http://ota.ox.ac.uk/headers/0643.xml Peter Norvig List of Error: http://norvig.com/ngrams/spell-errors.txt
  • 34.
  • 35.
  • 36. Spam like attributes  Undisclosed Recipients  Prize!!  No name, lucky draw  Suspicious URL
  • 37. It is a process of classifying given documents into various classes. For e.g.:  An e-mail is a SPAM or NOT?  A product review is POSITIVE or NEGATIVE? And so on… We need a classification model, which works on the given inputs and produces the desired output. Inputs:  a document d  a fixed set of classes C = {c1, c2, c3, …., cn} Desired Output: a predicted class c ϵ C such that the document belongs to that class c.
  • 38. Hand Coded Rules Based on Combination of words for e.g. a black listed sender, words like Viagra, dollars, impress a girl. Supervised Machine Learning Naïve Bayes Support Vector Machines Logistic Regression KNN (K Nearest Neighbor)
  • 39. What is a Sentiment? It is an attitude, affectively colored beliefs or dispositions towards objects and persons liking, loving, hating, valuing, desiring What is the task of Sentiment Analysis Detecting attitude holder of the attitude, target of the attitude, type of the attitude. Types of Analysis  Simplest is to assign a binary value to a sentence or document  Slightly complex is to rate on a scale of 1-10  Toughest is to detect the target and source
  • 40. 1. The fragrance is just awesome, I love it 2. Keeps you going all day long, this is the best perfume 3. Seriously do you call it a perfume? It’s awful. 4. Thanks for this wonderful fragrance in the classy bottle. Great!! 5. I feel like being cheated after buying this piece of crap 6. If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut. Attitude – awesome, best, classy, awful, crap, cheated, thanks Target – perfume, bottle, fragrance, Holder – purchaser, user We are basically extracting the opinion.
  • 41. 1. We prefer to watch movies after reading the reviews 2. We prefer to buy products after reader the reviews 3. We prefer to invest in stocks after understanding the market sentiments 4. We want a profitable asset 5. We don’t want to get cheated with odd surprises 6. We believe that we can predict future (election result, market outcome) Now all the above, more or less can be solved using the tool called ‘Sentiment Analysis’.
  • 42. Input : Test documents  Tokenize the test documents  Feature Extraction ( either all the tokens or a group of relevant tokens)  Classification using any of the classifier  Naïve Bayes Classifiers  Maximum Entropy Classifiers  Support Vector Machine Classifier
  • 43. Our goal: For a document d and set of classes C, we need to calculate probability of all the classes given the document. P(c|d) = P(d|c)*P(c) P(d) The class which maximizes P(c|d) is the class where the document belongs.  P(d|c) – conditional probability of the document given the class  P(c) – prior probability of the class
  • 44. What is the practical meaning of prior P(c) and likelihood P(d|c) ? P(c) means the probability of occurrence of the class, i.e. how often the class occurs in the corpus. P(d|c) means the probability of the occurrence of some set of features(of certain length) given a class, i.e. P(x1, x2, x3, … xi | c). The joint probability of all the features given a class. Now calculating the joint probability requires enormous amount of parameters to be calculated, may be of the order |x|n and these many for each class. That would require enormous amount of training samples which is mostly not available. This looks complicated, we must try out simplifications!!
  • 45. Simplifying Assumptions  Position of the word in the document doesn’t matter – Bag of Words  Feature probabilities given a class are independent - Conditional Independence This simplifies our model to P(x1, x2, x3… xi | c) = P(x1|c) * P(x2|c) * P(x3|c)* … *P(xi | c) And hence, the probability of a class given a document reduces to P(c|d) = P(c) * P(x1|c) * P(x2|c) * P(x3|c) *… * P(xn|c)
  • 46. Calculating the priors: P(India) = 3/5 P(Pakistan) = 2/5 Calculating the conditional probabilities : P(Delhi | India) = 0.15789473 P(Lahore | Pakistan) = 0.25 P(India | India) = 0.15789473 P(Islamabad | Pakistan) = 0.1875 P(Mumbai | India) = 0.2631579 P(Pakistan | Pakistan) = 0.1875 P(Chennai| India) = 0.10526316 P(Hyderabad | Pakistan) = 0.125 P(Hyderabad | India) = 0.15789473 P(India | Pakistan) = 0.0625 P(Lahore | India) = 0.05263158 P(Mumbai | Pakistan) = 0.0625 P(Islamabad | India) = 0.05263158 P(Chennai | Pakistan) = 0.0625 Vocabulary Size: 8 P(w|c) = count(w,c) + 1 count(c) +|V|
  • 47. Priors: P(India) = 3/5 P(Pakistan) = 2/5 Conditional probabilities : P(Lahore | India) = 0.05263158 P(Lahore | Pakistan) = 0.25 P(Hyderabad | India) = 0.15789473 P(Hyderabad | Pakistan) = 0.125 P(Chennai| India) = 0.10526316 P(Chennai | Pakistan) = 0.0625 P(Islamabad | India) = 0.05263158 P(Islamabad | Pakistan) = 0.1875 Test Doc : Lahore Hyderabad Chennai Islamabad P(test doc | India) = P(India) * P(Lahore | India) * P(Hyderabad | India) * P(Chennai | India) * P(Islamabad | India) = 0.6 * 0.0526 * 0.1578 * 0.1052 * 0.0526 = 0.0000275 P(test doc | Pakistan) = P(Pakistan) * P(Lahore | Pakistan) * P(Hyderabad | Pakistan) * P(Chennai | Pakistan) * P(Islamabad | Pakistan) = 0.4 * 0.25 * 0.125 * 0.0625 * 0.1875 = 0.0001464
  • 48. Tokenization Issues  Data is available online in HTML, XML and various other mark up languages  Twitter names, hash tags etc pollute the data  Phone numbers, short forms, new words, emoticons, phone numbers etc.. Extracting Features  Handling negations  I didn’t like this movie vs I really liked this movie  Which words to use?  Choosing the words (I, this, movie etc do not belong to the set of words which contribute to attitude)
  • 49. The words which matter in the sentiments. Its better to train our models on these lexicons instead of the complete list of words in the training documents. Here are few links for the lexicons:  The General Inquirer :  http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls  http://www.wjh.harvard.edu/~inquirer/homecat.htm  LIWC(Linguistic Inquiry and Word Count)  http://www.liwc.net  Negative emotions (bad, weird, hate, problem, crap)  Positive Emotions ( love, wonderful, magnificent, lovely)  Bing Liu's page on Opinion Mining  https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
  • 50. Not all classifiers we build is a state of art system. We must refine and fine tune it after building. What are the parameters to judge a classifier? Contingency Matrix What is the accuracy of the system? Accuracy = TP + TN TP+TN+FP+FN
  • 51. Task : Build a classifier to identify food items on web pages Most of the tokens on a web page won’t be the name off a food item. Let us say that there are 1000 words and only 10 words are names of food item. Let us consider that our classifier is a bogus one and it always returns a false for each word it encounters. This means the accuracy of our system = TP + TN = 990/1000 = 99% TP+TN+FP+FN Hence, our 99% accurate system is not able to do what we needed, i.e. detecting food items.
  • 52. We definitely need a better parameter to judge our model. So we define two parameters PRECISION and RECALL Precision, is the percentage of selected items that are correct. Recall, is the percentage of correct item that the system was able to select. PRECISION = TP / ( TP + FP ) = 0/0 = UNDEFINED RECALL = TP / (TP + FN) = 0/10 = 0 So, these parameters somewhere fairly judge our classifiers. Here, the recall is zero and the precision too is zero.
  • 53. The figures on the last slide didn’t give much insight into the roles of these two parameters. Slightly better classifier - capable of selecting the true food items Precision = 10/(10 + 20) = 33% Recall = 10/(10+10) = 50% Hence these parameters fairly judge the classifier, so a combination of these two measures can be a fair evaluation criteria. It is also called the F measure.
  • 54. Discriminative Language models Maximum Entropy Support Vector Machines Other Advanced Applications Named Entity Recognition POS Tagging Machine Translation & Probabilistic Parsing CFGs Language Grammars Areas of research Information Retrieval (Query Based and Generic) Question & Answering Summarization