SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Topics in Computational Linguistics
Week 5: ngrams and language model
Shu-Kai Hsieh
Lab of Ontologies, Language Processing and e-Humanities
GIL, National Taiwan University
March 28, 2014
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
..1 N-grams model
Evaluation
Smoothing Techniques
..2 Web-scaled N-grams
..3 Related Topics
..4 The Entropy of Natural Languages
..5 Lab
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Language models
• Statistical/probabilistic language models aim to compute
• either the prob. of a sentence or sequence of words,
P(S) = P(w1, w2, w3, ...wn), or
• the prob. of the upcoming word
P(wn|w1, w2, w3, ...wn−1)
(which will turn out to be closely related to computing the
probability of a sequence of words.)
• N-gram model is one of the most important tools in speech
and language processing.
• Varied applications: spelling checker, MT, Speech Recognition,
QA, etc.
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
..1 N-grams model
Evaluation
Smoothing Techniques
..2 Web-scaled N-grams
..3 Related Topics
..4 The Entropy of Natural Languages
..5 Lab
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Simple n-gram model
• Let’s start with calculating the P(S), say,
P(S) = P(學, 語言, 很, 有趣)
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Review of Joint and Conditional Probability
• Recall that the conditional prob. of X given Y, P(X|Y), is
defined in terms of the prob. of Y, P(Y), and the joint prob.
of X and Y, P(X, Y):
P(X|Y) =
P(X, Y)
P(Y)
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Review of Chain Rule of Probability
Conversely, the joint prob. P(X, Y) can be expressed in terms of
the conditional prob. P(X|Y).
P(X, Y) =
P(X|Y)
P(Y)
which leads to the chain rule
P(X1, X2, X3, · · · , Xn)
= P(X1)P(X2|X1)P(X3|X1, X2) · · · P(Xn|X1, · · · , Xn−1)
= P(X1)
∏n
i=2 P(Xi|X1, · · · , Xi−1)
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
The Chain Rule applied to calculate joint probability of
words in sentence
chain rule of probability
P(S) = P(wn
1) = P(w1)P(w2|w1)P(w3|w2
1)...P(wn|wn−1
1 )
=
∏n
k=1 P(wk|wk−1
1 )
= P(學) * P(語言|學) * P(很|學	語言) * P(有趣|學	語言	很)
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
How to Estimate these Probabilities?
• Maximum Likelihood Estimation (MLE): by dividing simply
counting in a corpus and normalize them so that they lie
between 0 and 1. (There are of course more sophisticated
algorithms) 1
count and divide
P(嗎	|	學	語言	很	有趣) = Count(學	語言	很	有趣	嗎) /
Count(學	語言	很	有趣)
1
MLE sometimes called relative frequency
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Markov Assumption: Don’t look too far into the past
Simplified idea: instead of computing the prob. of a word given its
entire history, we can approximate the history by just the last few
words.
P(嗎	|	學	語言	很	有趣) ≈ P( 嗎	|	有趣) OR,
P(嗎	|	學	語言	很	有趣) ≈ P( 嗎	|	很	有趣 )
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
In other words
• Bi-gram model: approximates the prob. of a word give all the
previous P(wn|wn−1
1 ) by using only the conditional prob. of
the preceding words P(wn|wn−1). Thus generalized as
P(wn|wn−1
1 ) ≈ P(wn|wn−1
n−N+1)
• Tri-gram: (your turn)
• We can extend to trigrams, 4-grams, 5-grams, knowing that
in general this is an insufficient model of language (because
language has long-distance dependencies). 我 在	一	個	非
常	奇特	的	機緣巧合 之下 學 梵文
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
In other words
• So given the bi-gram assumption for the prob. of an individual
word, we can compute the prob. of the entire sentence as
P(S) = P(wn
1) ≈
n∏
k=1
P(wk|wk−1)
• recall MLE on JM book equation (4.13)-(4.14)
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Example: Language Modeling of Alice.txt
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Exercise
• Walk through the example of Berkeley Restaurant Project
sentences (PP90-91)
BTW, we used to do everything in log space to avoid underflow
(also adding is faster than multiplying)
log(p1 ∗ p2 ∗ p3) = logp1 + logp2 + logp3
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Google n-gram and Google Suggestion
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Generating the Wall Street Journal vs Generating
Shakespeare
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Generating the Wall Street Journal vs Generating
Shakespeare
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
• Quadrigrams looks like Shakespeare because it is Shakespeare.
• N-gram model is very sensitive to the training corpus!
Overfitting issue
• N-grams only work well for word prediction if the test corpus
looks like the training corpus, but in real life, it often doesn’t.
• We need to train a more robust model that generalize, e.g.
Zeros issue, i.e., Things that don’t ever occur in the training
set but occur in the test set.
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Evaluation
..1 N-grams model
Evaluation
Smoothing Techniques
..2 Web-scaled N-grams
..3 Related Topics
..4 The Entropy of Natural Languages
..5 Lab
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Evaluation
Evaluating n-gram models
How good is our model? How to make it better(robust)?
• N-gram language models are evaluated by separating the
corpus into a training set and a test set, training the model on
the training set, and evaluating on the test set. An evaluation
metric tells us how well our model does on the test set.
• Extrinsic (in vivo) evaluation
• intrinsic evaluation: perplexity (2H
of of the language model
on a test set is used to compare language models.)
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Evaluation
Evaluation the N-gram Model
But the model relies heavily on the corpus the models were trained
on, and thus often results in overfitting!
Example
• Given a vocabulary of 20,000 types, the potential number of
bigrams is 20, 0002 = 400, 000, 000, and with tri-grams, it
amounts to the astronomic figure of 20, 0003. No corpus yet
has the size to cover the corresponding word combinations.
• MLE gives no hint on how to estimate their prob.
• Here we use smoothing (or discounting) techniques to
estimate prob. of unseen ngrams, presumably because a
distribution without zeros is smoother than one with zeros.Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Evaluation
Perplexity
• The best language model is one that best predicts an unseen
test set (i.e., Gives the highest P(sentence)).
• Perplexity is defined as the inverse probability of the test set,
normalized by the number of words.
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
The intuition of smoothing (from Dan Klein)
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Smoothing Techniques
Smoothing n-gram probabilities
• sparse data: the corpus is not big enough to have all the
bigrams covered with a realistic estimate.
• Smoothing algorithms provide a better way of estimating the
probability of n-grams than Maximum Likelihood Estimation.
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Smoothing Techniques
• Laplace Smoothing (a.k.a. add-one method)
• Interpolation
• Backoff
• Good-Turing Estimation(/Discounting)
• Kneser-Ney Smoothing
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Laplace Smoothing
• Pretend we saw each word one more time than we did.
• Re-estimate the counts by just add one to all the counts!
• Read the BeRP examples (JM pp99-100)
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Laplace Smoothing: Comparing with Raw Bigram Counts
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Laplace Smoothing: It’s a blunt estimation
• Too much probability mass is moved to all the zeros.
喧賓奪主: 為了處理大量的	zero,Chinese	food	可以少	10	倍!
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
(Katz) Backoff and Interpolation
Intuition
Sometimes it helps to use less context. Condition on less context
for contexts you haven’t learned much about.
• Backoff and Interpolation are another two strategies that
utilize n-grams of variable length.
• Backoff: use trigram if you have good evidence, otherwise
bigram, otherwise unigram.
• Interpolation: mix unigram, bigram, trigram.
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Katz Back-off
• The idea is to use the frequency of longest available n-grams,
and if no n-gram is available to back-off to the (n-1)-gram,
and then to (n-2)-gram, and so on.
• If n = 3, we first try trigrams, then bigrams, and finally
unigrams.
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
P∗
and α?
• P∗: the discounted probability rather than MLE probabilities,
such as Good-Turing.
• α: the normalizing factor
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Linear Interpolation 線性插值
將高階模型和低階模型作線性組合
• Simple interpolation
• Lambdas conditional on context
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Advanced Discounting Techniques
Intuition
To use the count of things you’ve seen once to help estimate the
count of things you’ve never seen.
• Good-Turing
• Witten-Bell
• Kneser-Ney
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Good-Turing Smoothing: Notations
• A word or N-gram (or any event) that occurs once is called
singleton or a hapax legomenon.
• Nc: the number of things we’ve seen c times, i.e., the
frequency of frequency c.
Example (In terms of bigrams)
N0 is the number of bigrams with count 0, N1 the number of
bigrams with count 1 (singleton), etc
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Good-Turing Smoothing:Intuition
[2]:pp101-102
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Good-Turing Smoothing: Answer
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing Techniques
Other advanced Smoothing Techniques
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
..1 N-grams model
Evaluation
Smoothing Techniques
..2 Web-scaled N-grams
..3 Related Topics
..4 The Entropy of Natural Languages
..5 Lab
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
How to deal with huge web-scaled ngrams
How might one build a language model (ngrams model) that
allows scaling to very large amounts of training data?
• Naive Pruning: Only store N-grams with count geq
threshold, and remove singletons of higher-order n-grams.
• Entropy-based pruning
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing for Web-scaled N-grams
“Standard backoff” uses variations of context-dependent backoff,
where p are pre-computed and stored probabilities, and λ are
back-off weights.
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Smoothing for Web-scaled N-grams
“Stupid backoff” [1] don’t apply any discounting and instead
directly use the relative frequencies (S is used instead of P to
emphasize that these are not probabilities but scores):
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
LM Tools and n-gram Resources
• CMU Statistical Language Modeling Toolkit
http://www.speech.cs.cmu.edu/SLM/toolkit.html
• SRILM http://www.speech.sri.com/projects/srilm/
• Google Web1T5-gram http://googleresearch.blogspot.
com/2006/08/all-our-n-gram-are-belong-to-you.html
• Google Book N-grams
• Chinese Web 5-gram http://www.ldc.upenn.edu/
Catalog/catalogEntry.jsp?catalogId=LDC2010T06
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Quick demo of CMU-LM
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Google book ngrams
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
From Corpus-based to Google-based Linguistics
Enhancing Linguistic Search with the Google Books Ngram
Viewer
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
From Corpus-based to Google-based Linguistics
Syntactic N-grams are coming out too!
http://commondatastorage.googleapis.com/books/
syntactic-ngrams/index.html
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Exercise
The Google Web 1T 5-Gram Database — SQLite Index & Web
Interface
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Applications
What Next Words Predication (based on Probabilistic Language
Models) can do today?
source: fandywang,2012
ExampleTopics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
You’d definitely like to try this
An Automatic CS Paper Generator
http://pdos.csail.mit.edu/scigen/
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Collocations
• Collocations are recurrent combinations of words.
Example
• Simple collocations are fixed ngrams, such as The Wall Street,
• Collocations with predicative relations involves
morpho-syntactic variations, such as the one linking make and
decision: to make a decision, decisions to be made, made an
important decision, etc.
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Collocations
• Statistically, collocates are events co-occur more often than
by chance.
• Measures used to calculate the strength of word preference are
Mutual Information, t-score and the likelihood ratio.
MI
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Lab
• ngramR for Google book ngram
• python nltk [see extra ipython notebook]
Example
For newbie in python
https://www.coursera.org/course/interactivepython
For quick starter (Develop and host Python from your
browser):https://www.pythonanywhere.com/
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Homework.week5
80% (4.3, JM book p122)
20% 預習	chapter	5 [2]
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Homework.week6
20% 閱讀中研院平衡語料庫說明手冊(
http://app.sinica.edu.tw/kiwi/mkiwi/98-04.pdf),預
習 chapter 6.
80% 實作服貿論述的	language	model (data will be provided),由
此建立自動	PRO/CON 文本產生器。
Topics in Computational Linguistics Shu-Kai Hsieh
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
. . .
. . . . . . . . . . . . . . .
N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab
Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and
Jeffrey Dean.
Large language models in machine translation.
In In Proceedings of the Joint Conference on Empirical
Methods in Natural Language Processing and Computational
Natural Language Learning. Citeseer, 2007.
Dan Jurafsky and James H Martin.
Speech & Language Processing.
Pearson Education India, 2000.
Topics in Computational Linguistics Shu-Kai Hsieh

Contenu connexe

Similaire à Cl.week5-6

[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...NAIST Machine Translation Study Group
 
Master's Thesis Alessandro Calmanovici
Master's Thesis Alessandro CalmanoviciMaster's Thesis Alessandro Calmanovici
Master's Thesis Alessandro CalmanoviciAlessandro Calmanovici
 
12.4.1 n-grams.pdf
12.4.1 n-grams.pdf12.4.1 n-grams.pdf
12.4.1 n-grams.pdfRajMani28
 
lec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdflec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdfykyog
 
HaiqingWang-MasterThesis
HaiqingWang-MasterThesisHaiqingWang-MasterThesis
HaiqingWang-MasterThesisHaiqing Wang
 
Ontology and the Lexiocn.week4
Ontology and the Lexiocn.week4Ontology and the Lexiocn.week4
Ontology and the Lexiocn.week4shukaihsieh
 
IP Mobility Concepts - Study Notes
IP Mobility Concepts - Study NotesIP Mobility Concepts - Study Notes
IP Mobility Concepts - Study NotesOxfordCambridge
 
Year 13 Computing Project
Year 13 Computing ProjectYear 13 Computing Project
Year 13 Computing ProjectAlex Tang
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Mechanising_Programs_in_IsabelleHOL
Mechanising_Programs_in_IsabelleHOLMechanising_Programs_in_IsabelleHOL
Mechanising_Programs_in_IsabelleHOLAnkit Verma
 
PhD-Thesis-ErhardRank
PhD-Thesis-ErhardRankPhD-Thesis-ErhardRank
PhD-Thesis-ErhardRankErhard Rank
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Functional Thinking Paradigm Over Syntax.pdf
Functional Thinking Paradigm Over Syntax.pdfFunctional Thinking Paradigm Over Syntax.pdf
Functional Thinking Paradigm Over Syntax.pdfDouglas Fernandes
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 

Similaire à Cl.week5-6 (20)

[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
 
Master's Thesis Alessandro Calmanovici
Master's Thesis Alessandro CalmanoviciMaster's Thesis Alessandro Calmanovici
Master's Thesis Alessandro Calmanovici
 
12.4.1 n-grams.pdf
12.4.1 n-grams.pdf12.4.1 n-grams.pdf
12.4.1 n-grams.pdf
 
lec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdflec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdf
 
HaiqingWang-MasterThesis
HaiqingWang-MasterThesisHaiqingWang-MasterThesis
HaiqingWang-MasterThesis
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
 
Ontology and the Lexiocn.week4
Ontology and the Lexiocn.week4Ontology and the Lexiocn.week4
Ontology and the Lexiocn.week4
 
IP Mobility Concepts - Study Notes
IP Mobility Concepts - Study NotesIP Mobility Concepts - Study Notes
IP Mobility Concepts - Study Notes
 
IP Mobility Concepts - Study Notes
IP Mobility Concepts - Study NotesIP Mobility Concepts - Study Notes
IP Mobility Concepts - Study Notes
 
Year 13 Computing Project
Year 13 Computing ProjectYear 13 Computing Project
Year 13 Computing Project
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Mechanising_Programs_in_IsabelleHOL
Mechanising_Programs_in_IsabelleHOLMechanising_Programs_in_IsabelleHOL
Mechanising_Programs_in_IsabelleHOL
 
PhD-Thesis-ErhardRank
PhD-Thesis-ErhardRankPhD-Thesis-ErhardRank
PhD-Thesis-ErhardRank
 
AI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, FlexudyAI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, Flexudy
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
Language models
Language modelsLanguage models
Language models
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Class Notes
Class Notes Class Notes
Class Notes
 
Functional Thinking Paradigm Over Syntax.pdf
Functional Thinking Paradigm Over Syntax.pdfFunctional Thinking Paradigm Over Syntax.pdf
Functional Thinking Paradigm Over Syntax.pdf
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 

Cl.week5-6