SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Natural Language Processing
Unit 2 – Tagging Problems and HMM
Anantharaman Narayana Iyer
narayana dot Anantharaman at gmail dot com
5th Sep 2014
References
• A tutorial on hidden Markov models and selected
applications in speech recognition, L Rabiner (cited
by over 19395 papers!)
• Fundamentals of speech recognition by L Rabiner,
et al, Pearson
• Discrete-Time processing of Speech Signals John R
Deller et al, IEEE Press Classic release
• Week 2 Language Modeling Problem, Parameter
Estimation in Language Models lectures of
Coursera course on Natural Language Processing
by Michael Collins
• Youtube videos:
https://www.youtube.com/watch?v=TPRoLreU9lA
Objective of today’s and the next class
• Introduce, using simple examples, the theory of HMM
• Mention the applications of HMM
• Forward, Backward algorithm, Viterbi algorithm
• Introduce the tagging problem and describe its relevance to NLP
• Describe the solutions to the tagging problems using HMM
• Pre requisite for this class:
• Mathematics for Natural Language Processing (Covered in Unit 1)
• Markov processes
• HMM intuition (Covered yesterday)
Motivation: Extending Markov Processes
• Suppose we are using Markov Processes to perform spelling error
detection/correction, given the input as a typed word
• In this example, the input alphabet is an unambiguous symbol that can be mapped to the
underlying alphabet directly
• What happens when the input is not a typed word but is a handwritten word?
• In this case, the input is a sequence of pen strokes, curves, angles, lines etc. The input itself
is stochastic in nature and hence mapping this to an underlying set of symbols is not
straightforward
• Hence for such inputs, Markov processes fall short as they make two simplifying
assumptions that don’t hold any longer:
• Markov model assumes that the input is a sequence of symbols
• Input symbols directly correspond to the states of the system
• These assumptions are too restrictive to many real world problems and hence
the model needs to be extended
Hidden Markov Models
• Hidden Markov Models provide a more powerful framework as they allow the
states to be separated from the input without requiring a direct mapping.
• In HMM, the states are hidden and abstract while the inputs are observables
• Eg, in handwriting recognition, the input that corresponds to “A” can be written in many ways, while
the representation of this in the computer is unique and deterministic.
• Thus, the input space may be discrete or continuous variables that are observed
while the states are discrete, abstract and hidden
• The observation is a probabilistic function of the state. The resulting model is a
doubly embedded stochastic process.
• The underlying stochastic process is not directly observable but can be observed
only through another set of stochastic processes that produce the sequence of
observations
Hidden Markov Model: Formalization
• HMM is a stochastic finite automaton specified by a 5-tuple:
HMM = (N, M, A, B, π) where:
N = Number of states (hidden). In general all states are fully connected (Ergodic Model). However, other
topologies are also found useful, for example in speech applications
M = Number of distinct observation symbols per state, where individual symbols are from the set: V = {v1,
v2, …vk}
A = State transition matrix where the elements aij correspond to the probability of transitioning to state j
from state i. That is: P(qi+1 = j |qi = i), 1 ≤ 𝑖, 𝑗 ≤ 𝑁
B = Emission probability distribution of the form {bi(k)} where bi(k) refers to the probability of emitting the
observable k when the HMM is in state i . This is of the form: bi(k) = P(xt = vk |qi = i), 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑘 ≤ 𝑀
π = probability of initial state distribution of the form {πi} where πi = P(q1 = i)
• The above can be represented by a compact notation:
𝜆 = (𝐴, 𝐵, 𝜋)
Trellis Diagram
• An HMM can be graphically depicted by Trellis diagram
Y1 Y2 Y3 Y4 Yn-1 Yn
X1 X2 X3 X4 Xn-1 Xn
……………….
……………….
Another example: Coin tossing
1 2
P(H)
P(H)
1 - P(H)
1 - P(H)
1 2
a11
a21
a22
a12
1 2
a11
a21
a22
a12
3
a33
a31
a13
a23
a32
Fig (a)
Fig (b)
Fig (c)
State 1 State 2 State 3
P(H) 0.5 0.75 0.25
P(T) 0.5 0.25 0.75
Exercise
• Consider the 3 state coin tossing model as shown in the previous slide and
suppose the state transition probabilities have a uniform distribution of 1/3 for
each aij.
Suppose you observe a sequence: O = (H H H H T H T T T T)
What state sequence is most likely? What is the probability of the observation sequence and
this most likely state sequence?
• Solution:
As all state transitions are equally probable, the most likely sequence is the one
where the probability of each individual observation is maximum. For H, the
most likely state is 2 and for T it is 3. Hence the state sequence that is most
likely is: q = (2 2 2 2 3 2 3 3 3 3)
P(O, q | 𝜆) = (0.75)10 (1/3)10
HMM as a Generator
• One can view the HMM as a generator that emits the observations
based on the emission probability distribution from a given state and
state transitions occur in a sequence as per the transition probability
• In reality HMM doesn’t “generate” the observable.
• But we imagine the HMM to be generating the observation sequence
and the state sequence that has the maximum probability of
generating the observed sequence is considered the output of HMM
for those problems where we are required to determine the hidden
state sequence
Representing the model 𝜆 = (𝐴, 𝐵, 𝜋) – JSON example
{
"hmm":
{
"A": {
"Coin 1" : {"Coin 1": 0.1, "Coin 2": 0.3, "Coin 3": 0.6},
"Coin 2" : {"Coin 1": 0.8, "Coin 2": 0.13, "Coin 3": 0.07},
"Coin 3" : {"Coin 1": 0.33, "Coin 2": 0.33, "Coin 3": 0.34}
},
"B": {
"Coin 1" : {"Heads": 0.15, "Tails": 0.85},
"Coin 2" : {"Heads": 0.5, "Tails": 0.5},
"Coin 3" : {"Heads": 0.5, "Tails": 0.5}
},
"pi": {"Coin 1": 0.33, "Coin 2": 0.33, "Coin 3": 0.34}
}
}
Applying HMM – the three basic problems
1. Given the observation sequence X and the model 𝜆 = (A, B, 𝜋), how
do we efficiently compute P(X| 𝜆), the probability of an observation
sequence given the model?
• Evaluation Problem
2. Given the observation sequence X and the model 𝜆 = (A, B, 𝜋), how
do we determine the state sequence Y that best explains the
observation?
• Decoding Problem
3. How do we adjust the model parameters 𝜆 = (A, B, 𝜋)to maximize
P(X| 𝜆), the probability of an observation sequence given the
model?
• Training Problem
Dynamic Programming Characteristics
• Dynamic Programming is an algorithm design technique that can
improve the performance significantly compared to brute force
methods
• Define the problem in a way that the solution can be obtained by solving sub-
problems and reusing the partial solutions obtained while solving the sub
problems.
• Overlapping sub problems: A recursive solution to the main problem
contains many repeating sub problems
• Optimal sub structure: Solutions to the sub problems can be
combined to provide an optimal solution to the larger problem
Forward Algorithm - Intuition
• Our goal is to determine the probability of a
sequence of observations (X1, X2, …, Xn) given 𝜆
• In the forward algorithm approach, we divide the
sequence X in to sub-sequences, compute the
probabilities, store them in the table for later use.
• The probability of a larger sequence is obtained by
combining the probabilities of these smaller
sequences.
• Specifically, we compute the joint probability of a
sub-sequence starting from time t = 1 where the
sub-sequence ends on a state y. We compute:
P(X1:t, Yt | 𝜆)
• We then compute P(X1:n | 𝜆) by marginalizing Y
Yt
Xt
……
……
Yt+1
Xt+1
…….
…….
y1
y2
y3
y1
y2
y3
Forward Algorithm
• Goal: Compute P(Yk, X1:k) assuming the model parameters to be known
• Approach: As we know the emission and transition probabilities, we try to factorize the joint distribution P(Yk, X1:k) in
terms of the known parameters and solve. In order to implement efficiently we use dynamic programming where a large
problem is solved by solving the overlapping sub-problems and combining the solution. To do this set up the recursion.
We can write: X1:k = X1, X2…Xk-1, Xk
From sum rule we know: P(X = xi) = 𝑗 𝑃(𝑋 = 𝑥𝑖, 𝑌 = 𝑦𝑗)
𝑃 𝑌𝑘, 𝑋1:𝑘 =
𝑦 𝑘
−
1
𝑚
𝑃 𝑌𝑘, 𝑌𝑘−1, 𝑋1:𝑘
𝑃 𝑌𝑘, 𝑋1:𝑘 =
𝑦 𝑘
−
1
𝑚
𝑃 (𝑋1:𝑘−1, 𝑌𝑘−1, 𝑌𝑘, 𝑋 𝑘)
From product rule the above factorizes to:
𝑦 𝑘
−
1
𝑚
𝑃 𝑋1:𝑘−1 𝑃 𝑌𝑘−1 𝑋1:𝑘−1 𝑃(𝑌𝑘 𝑌𝑘−1, 𝑋1:𝑘−1) 𝑃(𝑋 𝑘 𝑌𝑘, 𝑌𝑘−1, 𝑋1:𝑘−1)
= 𝑦 𝑘
−
1
𝑚 𝑃 𝑋1:𝑘−1 𝑃 𝑌𝑘−1 𝑋1:𝑘−1 𝑃(𝑌𝑘 𝑌𝑘−1) 𝑃(𝑋 𝑘 𝑌𝑘)
We can write: 𝛼 𝑘 𝑌𝑘 = 𝑦 𝑘
−
1
𝑚 𝑃 (𝑌𝑘 𝑌𝑘−1) 𝑃(𝑋 𝑘 𝑌𝑘)𝛼 𝑘−1(𝑌𝑘−1)
Initialization: 𝛼1 𝑌1 = 𝑃 𝑌1, 𝑋1 = 𝑃(𝑌1) P(X1|Y1)
We can now compute the different α values
Forward Algorithm Summary
Forward Algorithm: Implementation
def forward(self, obs):
self.fwd = [{}]
for y in self.states:
self.fwd[0][y] = self.pi[y] * self.B[y][obs[0]] # Initialize base cases
for t in range(1, len(obs)):
self.fwd.append({})
for y in self.states:
self.fwd[t][y] = sum((self.fwd[t-1][y0] * self.A[y0][y] * self.B[y][obs[t]]) for y0 in
self.states)
prob = sum((self.fwd[len(obs) - 1][s]) for s in self.states)
return prob
Demo – Let’s toss some coins!
• And see how probable getting some 10 consecutive heads is!
Backward Algorithm - Intuition
• Our goal is to determine the probability of a
sequence of observations (Xk+1, Xk+2, …, Xn|Yk, 𝜆)
• Given that the HMM has seen k observations and
ended up in a state Yk = y, compute the probability of
the remaining part: Xk+1, Xk+2, …, Xn
• We form the sub-sequences starting from the last
observation Xn and proceed backward to the first.
• Specifically, we compute the conditional probability
of a sub-sequence starting from k+1 and ending in n,
where the state at k is given.
• We can compute P(X1:n | 𝜆) by marginalizing Y. The
probability of an observation sequence computed by
backward algorithm will be equal to that computed
with forward algorithm.
Yt
Xt
……
……
Yt+1
Xt+1
…….
…….
y2
y1
y2
y3
Backward Algorithm - Derivation
• Goal: Compute P(Xk+1:n|Yk) assuming the model parameters to be
known
• P(Xk+1:n|Yk) = 𝑦 𝑘+1
𝑚
𝑃 𝑋 𝑘 + 1: 𝑛, 𝑌𝑘+1|𝑌𝑘
• To be explained on the board as typing a lot of equations is a pain!
• Basically we need to do the following:
• Express P(Xk+1:n|Yk) in a form where we can combine it with another random
variable and marginalize to get the required probability. We do this because our
strategy is to find a recurrence.
• Factorize the joint distribution obtained as above using chain rule
• Apply Markov assumptions and conditional independence property in order to
bring the factors to a form where the factors are known from our model
• Determine the base cases and termination conditions
Backward Algorithm: Summary
Backward Algorithm Implementation
def backward(self, obs):
self.bwk = [{} for t in range(len(obs))]
T = len(obs)
for y in self.states:
self.bwk[T-1][y] = 1
for t in reversed(range(T-1)):
for y in self.states:
self.bwk[t][y] = sum((self.bwk[t+1][y1] * self.A[y][y1] *
self.B[y1][obs[t+1]]) for y1 in self.states)
prob = sum((self.pi[y]* self.B[y][obs[0]] * self.bwk[0][y]) for y in self.states)
return prob
Demo – Let’s run the backward algorithm!
• And see if it computes the same value of probability for the given
observation as computed by the forward algorithm!
Viterbi Algorithm - Intuition
• Our goal is to determine the most probable state
sequence for a given sequence of observations (X1, X2, …,
Xn) given 𝜆
• This is a decoding process where we discover the hidden
state sequence looking at the observations
• Specifically, we need: argmaxY P(X1:t|Y1:t , 𝜆). This is
equivalent to finding argmaxY P(X1:t, Y1:t | 𝜆)
• In the forward algorithm approach, we computed the
probabilities along each path that led to the given state
and summed the probabilities to get the probability of
reaching that state regardless of the path taken.
• In Viterbi we are interested in only a specific path that
maximizes the probability of reaching the required state.
Each state along this path (the one that yields max
probability) forms the sequence of hidden states that we
are interested in.
Yt
Xt
……
……
Yt+1
Xt+1
…….
…….
y1
y2
y3
y1
y2
y3
Viterbi Algorithm
Viterbi Implementation
def viterbi(self, obs):
vit = [{}]
path = {}
for y in self.states:
vit[0][y] = self.pi[y] * self.B[y][obs[0]]
path[y] = [y]
for t in range(1, len(obs)):
vit.append({})
newpath = {}
for y in self.states:
(prob, state) = max((vit[t-1][y0] * self.A[y0][y] * self.B[y][obs[t]], y0) for y0 in self.states)
vit[t][y] = prob
newpath[y] = path[state] + [y]
path = newpath
(prob, state) = max((vit[len(obs) - 1][y], y) for y in self.states)
return (prob, path[state])
Demo – You can’t hide from my algorithm!
• If we are to guess this would we have concluded on the same
sequence?
Forward Backward Algorithm
Tagging Problems
• Purpose of the tagging is to facilitate further analysis
• Information extraction – LinkedIn resume extraction example
• Tags are the metadata that allows the given text to be processed
further
• E.g. POS tagging annotates each word in a text with a part of speech
tag. This helps finding noun/verb phrases and can be used for parsing
• E.g NER
Applications of taggers: POS, NER
Examples of Tagging Problem
• Microsoft will seek to draw more people to its internet-based services
with two new mid-range smartphones it unveiled, including one
designed to help people take better selfies.
• [('microsoft', 'NN'), ('will', 'MD'), ('seek', 'VB'), ('to', 'TO'), ('draw', 'VB'),
('more', 'JJR'), ('people', 'NNS'), ('to', 'TO'), ('its', 'PRP$'), ('internet-
based', 'JJ'), ('services', 'NNS'), ('with', 'IN'), ('two', 'CD'), ('new', 'JJ'),
('mid-range', 'JJ'), ('smartphones', 'NNS'), ('it', 'PRP'), ('unveiled',
'VBD'), (',', ','), ('including', 'VBG'), ('one', 'CD'), ('designed', 'VBN'),
('to', 'TO'), ('help', 'VB'), ('people', 'NNS'), ('take', 'VB'), ('better', 'JJR'),
('selfies', 'NNS'), ('.', '.')]
Modelling Tagging Problems using HMM
• We have an input sequence, that is a sequence of word tokens: x = x1, x2,…, xn
• We need to determine the tag sequence for this: y = y1, y2…yn
• We can model this using HMM where we treat the input tokens to be the
observable sequence and the tags to be the hidden states.
• Our goal is to determine the hidden state sequence: y = y1, y2…yn given the input
observable sequence x
• HMM will model the joint distribution: P(X, Y) where X are the input sequence
and Y represents the tag sequence
• The most likely tag sequence for x is:

Contenu connexe

Tendances

Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
pattern classification
pattern classificationpattern classification
pattern classification
Ranjan Ganguli
 
Knowledge Representation & Reasoning
Knowledge Representation & ReasoningKnowledge Representation & Reasoning
Knowledge Representation & Reasoning
Sajid Marwat
 

Tendances (20)

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
pattern classification
pattern classificationpattern classification
pattern classification
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
HIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATIONHIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATION
 
Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
 
NLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State AutomataNLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State Automata
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
 
Knowledge Representation & Reasoning
Knowledge Representation & ReasoningKnowledge Representation & Reasoning
Knowledge Representation & Reasoning
 
Chapter 1 (final)
Chapter 1 (final)Chapter 1 (final)
Chapter 1 (final)
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Planning
Planning Planning
Planning
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 

En vedette

Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
butest
 
Time management working smarter
 Time management working smarter Time management working smarter
Time management working smarter
eileen sutton
 
머피의 머신러닝: 17장 Markov Chain and HMM
머피의 머신러닝: 17장  Markov Chain and HMM머피의 머신러닝: 17장  Markov Chain and HMM
머피의 머신러닝: 17장 Markov Chain and HMM
Jungkyu Lee
 

En vedette (20)

Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)
 
Lect24 hmm
Lect24 hmmLect24 hmm
Lect24 hmm
 
Time management working smarter
 Time management working smarter Time management working smarter
Time management working smarter
 
PRML 13.2.2: The Forward-Backward Algorithm
PRML 13.2.2: The Forward-Backward AlgorithmPRML 13.2.2: The Forward-Backward Algorithm
PRML 13.2.2: The Forward-Backward Algorithm
 
Overview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language ProcessingOverview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language Processing
 
HMM (Hidden Markov Model)
HMM (Hidden Markov Model)HMM (Hidden Markov Model)
HMM (Hidden Markov Model)
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Bioalgo 2012-04-hmm
Bioalgo 2012-04-hmmBioalgo 2012-04-hmm
Bioalgo 2012-04-hmm
 
Multi Object Tracking | Presentation 2 | ID 103001
Multi Object Tracking | Presentation 2 | ID 103001Multi Object Tracking | Presentation 2 | ID 103001
Multi Object Tracking | Presentation 2 | ID 103001
 
Overview Of Video Object Tracking System
Overview Of Video Object Tracking SystemOverview Of Video Object Tracking System
Overview Of Video Object Tracking System
 
Hmm viterbi
Hmm viterbiHmm viterbi
Hmm viterbi
 
L05 word representation
L05 word representationL05 word representation
L05 word representation
 
Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
 
Building an Image Recognition Service - How to leverage IBM Watson for visual...
Building an Image Recognition Service - How to leverage IBM Watson for visual...Building an Image Recognition Service - How to leverage IBM Watson for visual...
Building an Image Recognition Service - How to leverage IBM Watson for visual...
 
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...
 
머피의 머신러닝: 17장 Markov Chain and HMM
머피의 머신러닝: 17장  Markov Chain and HMM머피의 머신러닝: 17장  Markov Chain and HMM
머피의 머신러닝: 17장 Markov Chain and HMM
 

Similaire à An overview of Hidden Markov Models (HMM)

Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
butest
 
CS221: HMM and Particle Filters
CS221: HMM and Particle FiltersCS221: HMM and Particle Filters
CS221: HMM and Particle Filters
zukun
 

Similaire à An overview of Hidden Markov Models (HMM) (20)

NLP_KASHK:Markov Models
NLP_KASHK:Markov ModelsNLP_KASHK:Markov Models
NLP_KASHK:Markov Models
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
HMM DAY-3.ppt
HMM DAY-3.pptHMM DAY-3.ppt
HMM DAY-3.ppt
 
pattern recognition
pattern recognition pattern recognition
pattern recognition
 
CS221: HMM and Particle Filters
CS221: HMM and Particle FiltersCS221: HMM and Particle Filters
CS221: HMM and Particle Filters
 
A bit about мcmc
A bit about мcmcA bit about мcmc
A bit about мcmc
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptx
 
AAC ch 3 Advance strategies (Dynamic Programming).pptx
AAC ch 3 Advance strategies (Dynamic Programming).pptxAAC ch 3 Advance strategies (Dynamic Programming).pptx
AAC ch 3 Advance strategies (Dynamic Programming).pptx
 
Module 1 ppt class.pptx
Module 1 ppt class.pptxModule 1 ppt class.pptx
Module 1 ppt class.pptx
 
Module ppt class.pptx
Module ppt class.pptxModule ppt class.pptx
Module ppt class.pptx
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
 
intro
introintro
intro
 
Variational inference
Variational inference  Variational inference
Variational inference
 
Daa notes 2
Daa notes 2Daa notes 2
Daa notes 2
 
Lec1 01
Lec1 01Lec1 01
Lec1 01
 
Numerical method for pricing american options under regime
Numerical method for pricing american options under regime Numerical method for pricing american options under regime
Numerical method for pricing american options under regime
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
myppt for health issues at IITB. Don't come to IITB
myppt for health issues at IITB. Don't come to IITBmyppt for health issues at IITB. Don't come to IITB
myppt for health issues at IITB. Don't come to IITB
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 

Plus de ananth

Plus de ananth (20)

Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variants
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligence
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Machine Learning Lecture 2 Basics
Machine Learning Lecture 2 BasicsMachine Learning Lecture 2 Basics
Machine Learning Lecture 2 Basics
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distance
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
 
Natural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlpNatural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlp
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

An overview of Hidden Markov Models (HMM)

  • 1. Natural Language Processing Unit 2 – Tagging Problems and HMM Anantharaman Narayana Iyer narayana dot Anantharaman at gmail dot com 5th Sep 2014
  • 2. References • A tutorial on hidden Markov models and selected applications in speech recognition, L Rabiner (cited by over 19395 papers!) • Fundamentals of speech recognition by L Rabiner, et al, Pearson • Discrete-Time processing of Speech Signals John R Deller et al, IEEE Press Classic release • Week 2 Language Modeling Problem, Parameter Estimation in Language Models lectures of Coursera course on Natural Language Processing by Michael Collins • Youtube videos: https://www.youtube.com/watch?v=TPRoLreU9lA
  • 3. Objective of today’s and the next class • Introduce, using simple examples, the theory of HMM • Mention the applications of HMM • Forward, Backward algorithm, Viterbi algorithm • Introduce the tagging problem and describe its relevance to NLP • Describe the solutions to the tagging problems using HMM • Pre requisite for this class: • Mathematics for Natural Language Processing (Covered in Unit 1) • Markov processes • HMM intuition (Covered yesterday)
  • 4. Motivation: Extending Markov Processes • Suppose we are using Markov Processes to perform spelling error detection/correction, given the input as a typed word • In this example, the input alphabet is an unambiguous symbol that can be mapped to the underlying alphabet directly • What happens when the input is not a typed word but is a handwritten word? • In this case, the input is a sequence of pen strokes, curves, angles, lines etc. The input itself is stochastic in nature and hence mapping this to an underlying set of symbols is not straightforward • Hence for such inputs, Markov processes fall short as they make two simplifying assumptions that don’t hold any longer: • Markov model assumes that the input is a sequence of symbols • Input symbols directly correspond to the states of the system • These assumptions are too restrictive to many real world problems and hence the model needs to be extended
  • 5. Hidden Markov Models • Hidden Markov Models provide a more powerful framework as they allow the states to be separated from the input without requiring a direct mapping. • In HMM, the states are hidden and abstract while the inputs are observables • Eg, in handwriting recognition, the input that corresponds to “A” can be written in many ways, while the representation of this in the computer is unique and deterministic. • Thus, the input space may be discrete or continuous variables that are observed while the states are discrete, abstract and hidden • The observation is a probabilistic function of the state. The resulting model is a doubly embedded stochastic process. • The underlying stochastic process is not directly observable but can be observed only through another set of stochastic processes that produce the sequence of observations
  • 6. Hidden Markov Model: Formalization • HMM is a stochastic finite automaton specified by a 5-tuple: HMM = (N, M, A, B, π) where: N = Number of states (hidden). In general all states are fully connected (Ergodic Model). However, other topologies are also found useful, for example in speech applications M = Number of distinct observation symbols per state, where individual symbols are from the set: V = {v1, v2, …vk} A = State transition matrix where the elements aij correspond to the probability of transitioning to state j from state i. That is: P(qi+1 = j |qi = i), 1 ≤ 𝑖, 𝑗 ≤ 𝑁 B = Emission probability distribution of the form {bi(k)} where bi(k) refers to the probability of emitting the observable k when the HMM is in state i . This is of the form: bi(k) = P(xt = vk |qi = i), 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑘 ≤ 𝑀 π = probability of initial state distribution of the form {πi} where πi = P(q1 = i) • The above can be represented by a compact notation: 𝜆 = (𝐴, 𝐵, 𝜋)
  • 7. Trellis Diagram • An HMM can be graphically depicted by Trellis diagram Y1 Y2 Y3 Y4 Yn-1 Yn X1 X2 X3 X4 Xn-1 Xn ………………. ……………….
  • 8. Another example: Coin tossing 1 2 P(H) P(H) 1 - P(H) 1 - P(H) 1 2 a11 a21 a22 a12 1 2 a11 a21 a22 a12 3 a33 a31 a13 a23 a32 Fig (a) Fig (b) Fig (c) State 1 State 2 State 3 P(H) 0.5 0.75 0.25 P(T) 0.5 0.25 0.75
  • 9. Exercise • Consider the 3 state coin tossing model as shown in the previous slide and suppose the state transition probabilities have a uniform distribution of 1/3 for each aij. Suppose you observe a sequence: O = (H H H H T H T T T T) What state sequence is most likely? What is the probability of the observation sequence and this most likely state sequence? • Solution: As all state transitions are equally probable, the most likely sequence is the one where the probability of each individual observation is maximum. For H, the most likely state is 2 and for T it is 3. Hence the state sequence that is most likely is: q = (2 2 2 2 3 2 3 3 3 3) P(O, q | 𝜆) = (0.75)10 (1/3)10
  • 10. HMM as a Generator • One can view the HMM as a generator that emits the observations based on the emission probability distribution from a given state and state transitions occur in a sequence as per the transition probability • In reality HMM doesn’t “generate” the observable. • But we imagine the HMM to be generating the observation sequence and the state sequence that has the maximum probability of generating the observed sequence is considered the output of HMM for those problems where we are required to determine the hidden state sequence
  • 11. Representing the model 𝜆 = (𝐴, 𝐵, 𝜋) – JSON example { "hmm": { "A": { "Coin 1" : {"Coin 1": 0.1, "Coin 2": 0.3, "Coin 3": 0.6}, "Coin 2" : {"Coin 1": 0.8, "Coin 2": 0.13, "Coin 3": 0.07}, "Coin 3" : {"Coin 1": 0.33, "Coin 2": 0.33, "Coin 3": 0.34} }, "B": { "Coin 1" : {"Heads": 0.15, "Tails": 0.85}, "Coin 2" : {"Heads": 0.5, "Tails": 0.5}, "Coin 3" : {"Heads": 0.5, "Tails": 0.5} }, "pi": {"Coin 1": 0.33, "Coin 2": 0.33, "Coin 3": 0.34} } }
  • 12. Applying HMM – the three basic problems 1. Given the observation sequence X and the model 𝜆 = (A, B, 𝜋), how do we efficiently compute P(X| 𝜆), the probability of an observation sequence given the model? • Evaluation Problem 2. Given the observation sequence X and the model 𝜆 = (A, B, 𝜋), how do we determine the state sequence Y that best explains the observation? • Decoding Problem 3. How do we adjust the model parameters 𝜆 = (A, B, 𝜋)to maximize P(X| 𝜆), the probability of an observation sequence given the model? • Training Problem
  • 13. Dynamic Programming Characteristics • Dynamic Programming is an algorithm design technique that can improve the performance significantly compared to brute force methods • Define the problem in a way that the solution can be obtained by solving sub- problems and reusing the partial solutions obtained while solving the sub problems. • Overlapping sub problems: A recursive solution to the main problem contains many repeating sub problems • Optimal sub structure: Solutions to the sub problems can be combined to provide an optimal solution to the larger problem
  • 14. Forward Algorithm - Intuition • Our goal is to determine the probability of a sequence of observations (X1, X2, …, Xn) given 𝜆 • In the forward algorithm approach, we divide the sequence X in to sub-sequences, compute the probabilities, store them in the table for later use. • The probability of a larger sequence is obtained by combining the probabilities of these smaller sequences. • Specifically, we compute the joint probability of a sub-sequence starting from time t = 1 where the sub-sequence ends on a state y. We compute: P(X1:t, Yt | 𝜆) • We then compute P(X1:n | 𝜆) by marginalizing Y Yt Xt …… …… Yt+1 Xt+1 ……. ……. y1 y2 y3 y1 y2 y3
  • 15. Forward Algorithm • Goal: Compute P(Yk, X1:k) assuming the model parameters to be known • Approach: As we know the emission and transition probabilities, we try to factorize the joint distribution P(Yk, X1:k) in terms of the known parameters and solve. In order to implement efficiently we use dynamic programming where a large problem is solved by solving the overlapping sub-problems and combining the solution. To do this set up the recursion. We can write: X1:k = X1, X2…Xk-1, Xk From sum rule we know: P(X = xi) = 𝑗 𝑃(𝑋 = 𝑥𝑖, 𝑌 = 𝑦𝑗) 𝑃 𝑌𝑘, 𝑋1:𝑘 = 𝑦 𝑘 − 1 𝑚 𝑃 𝑌𝑘, 𝑌𝑘−1, 𝑋1:𝑘 𝑃 𝑌𝑘, 𝑋1:𝑘 = 𝑦 𝑘 − 1 𝑚 𝑃 (𝑋1:𝑘−1, 𝑌𝑘−1, 𝑌𝑘, 𝑋 𝑘) From product rule the above factorizes to: 𝑦 𝑘 − 1 𝑚 𝑃 𝑋1:𝑘−1 𝑃 𝑌𝑘−1 𝑋1:𝑘−1 𝑃(𝑌𝑘 𝑌𝑘−1, 𝑋1:𝑘−1) 𝑃(𝑋 𝑘 𝑌𝑘, 𝑌𝑘−1, 𝑋1:𝑘−1) = 𝑦 𝑘 − 1 𝑚 𝑃 𝑋1:𝑘−1 𝑃 𝑌𝑘−1 𝑋1:𝑘−1 𝑃(𝑌𝑘 𝑌𝑘−1) 𝑃(𝑋 𝑘 𝑌𝑘) We can write: 𝛼 𝑘 𝑌𝑘 = 𝑦 𝑘 − 1 𝑚 𝑃 (𝑌𝑘 𝑌𝑘−1) 𝑃(𝑋 𝑘 𝑌𝑘)𝛼 𝑘−1(𝑌𝑘−1) Initialization: 𝛼1 𝑌1 = 𝑃 𝑌1, 𝑋1 = 𝑃(𝑌1) P(X1|Y1) We can now compute the different α values
  • 17. Forward Algorithm: Implementation def forward(self, obs): self.fwd = [{}] for y in self.states: self.fwd[0][y] = self.pi[y] * self.B[y][obs[0]] # Initialize base cases for t in range(1, len(obs)): self.fwd.append({}) for y in self.states: self.fwd[t][y] = sum((self.fwd[t-1][y0] * self.A[y0][y] * self.B[y][obs[t]]) for y0 in self.states) prob = sum((self.fwd[len(obs) - 1][s]) for s in self.states) return prob
  • 18. Demo – Let’s toss some coins! • And see how probable getting some 10 consecutive heads is!
  • 19. Backward Algorithm - Intuition • Our goal is to determine the probability of a sequence of observations (Xk+1, Xk+2, …, Xn|Yk, 𝜆) • Given that the HMM has seen k observations and ended up in a state Yk = y, compute the probability of the remaining part: Xk+1, Xk+2, …, Xn • We form the sub-sequences starting from the last observation Xn and proceed backward to the first. • Specifically, we compute the conditional probability of a sub-sequence starting from k+1 and ending in n, where the state at k is given. • We can compute P(X1:n | 𝜆) by marginalizing Y. The probability of an observation sequence computed by backward algorithm will be equal to that computed with forward algorithm. Yt Xt …… …… Yt+1 Xt+1 ……. ……. y2 y1 y2 y3
  • 20. Backward Algorithm - Derivation • Goal: Compute P(Xk+1:n|Yk) assuming the model parameters to be known • P(Xk+1:n|Yk) = 𝑦 𝑘+1 𝑚 𝑃 𝑋 𝑘 + 1: 𝑛, 𝑌𝑘+1|𝑌𝑘 • To be explained on the board as typing a lot of equations is a pain! • Basically we need to do the following: • Express P(Xk+1:n|Yk) in a form where we can combine it with another random variable and marginalize to get the required probability. We do this because our strategy is to find a recurrence. • Factorize the joint distribution obtained as above using chain rule • Apply Markov assumptions and conditional independence property in order to bring the factors to a form where the factors are known from our model • Determine the base cases and termination conditions
  • 22. Backward Algorithm Implementation def backward(self, obs): self.bwk = [{} for t in range(len(obs))] T = len(obs) for y in self.states: self.bwk[T-1][y] = 1 for t in reversed(range(T-1)): for y in self.states: self.bwk[t][y] = sum((self.bwk[t+1][y1] * self.A[y][y1] * self.B[y1][obs[t+1]]) for y1 in self.states) prob = sum((self.pi[y]* self.B[y][obs[0]] * self.bwk[0][y]) for y in self.states) return prob
  • 23. Demo – Let’s run the backward algorithm! • And see if it computes the same value of probability for the given observation as computed by the forward algorithm!
  • 24. Viterbi Algorithm - Intuition • Our goal is to determine the most probable state sequence for a given sequence of observations (X1, X2, …, Xn) given 𝜆 • This is a decoding process where we discover the hidden state sequence looking at the observations • Specifically, we need: argmaxY P(X1:t|Y1:t , 𝜆). This is equivalent to finding argmaxY P(X1:t, Y1:t | 𝜆) • In the forward algorithm approach, we computed the probabilities along each path that led to the given state and summed the probabilities to get the probability of reaching that state regardless of the path taken. • In Viterbi we are interested in only a specific path that maximizes the probability of reaching the required state. Each state along this path (the one that yields max probability) forms the sequence of hidden states that we are interested in. Yt Xt …… …… Yt+1 Xt+1 ……. ……. y1 y2 y3 y1 y2 y3
  • 26. Viterbi Implementation def viterbi(self, obs): vit = [{}] path = {} for y in self.states: vit[0][y] = self.pi[y] * self.B[y][obs[0]] path[y] = [y] for t in range(1, len(obs)): vit.append({}) newpath = {} for y in self.states: (prob, state) = max((vit[t-1][y0] * self.A[y0][y] * self.B[y][obs[t]], y0) for y0 in self.states) vit[t][y] = prob newpath[y] = path[state] + [y] path = newpath (prob, state) = max((vit[len(obs) - 1][y], y) for y in self.states) return (prob, path[state])
  • 27. Demo – You can’t hide from my algorithm! • If we are to guess this would we have concluded on the same sequence?
  • 29. Tagging Problems • Purpose of the tagging is to facilitate further analysis • Information extraction – LinkedIn resume extraction example • Tags are the metadata that allows the given text to be processed further • E.g. POS tagging annotates each word in a text with a part of speech tag. This helps finding noun/verb phrases and can be used for parsing • E.g NER
  • 31. Examples of Tagging Problem • Microsoft will seek to draw more people to its internet-based services with two new mid-range smartphones it unveiled, including one designed to help people take better selfies. • [('microsoft', 'NN'), ('will', 'MD'), ('seek', 'VB'), ('to', 'TO'), ('draw', 'VB'), ('more', 'JJR'), ('people', 'NNS'), ('to', 'TO'), ('its', 'PRP$'), ('internet- based', 'JJ'), ('services', 'NNS'), ('with', 'IN'), ('two', 'CD'), ('new', 'JJ'), ('mid-range', 'JJ'), ('smartphones', 'NNS'), ('it', 'PRP'), ('unveiled', 'VBD'), (',', ','), ('including', 'VBG'), ('one', 'CD'), ('designed', 'VBN'), ('to', 'TO'), ('help', 'VB'), ('people', 'NNS'), ('take', 'VB'), ('better', 'JJR'), ('selfies', 'NNS'), ('.', '.')]
  • 32. Modelling Tagging Problems using HMM • We have an input sequence, that is a sequence of word tokens: x = x1, x2,…, xn • We need to determine the tag sequence for this: y = y1, y2…yn • We can model this using HMM where we treat the input tokens to be the observable sequence and the tags to be the hidden states. • Our goal is to determine the hidden state sequence: y = y1, y2…yn given the input observable sequence x • HMM will model the joint distribution: P(X, Y) where X are the input sequence and Y represents the tag sequence • The most likely tag sequence for x is: