3. Markov Model
• Given 3 weather states:
– {S1, S2, S3} = {rain, cloudy, sunny}
Rain Cloudy Sunny
Rain 0.4 0.3 0.3
Cloudy 0.2 0.6 0.2
Sunny 0.1 0.1 0.8
• What is the probabilities for next 7 days
will be {sun, sun, rain, rain, sun, cloud,
sun} ?
4. Hidden Markov Model
• The states
– We don’t understand, Hidden!
– But it can be indirectly observed
• Example
– 北極or赤道(model), Hot/Cold(state), 1/2/3
ice cream(observation)
5. Hidden Markov Model
• The observation is a probability function
of state which is not observable directly
Hidden States
6. HMM Elements
• N, the number of states in the model
• M, the number of distinct observation
symbols
• A, the state transition probability distribution
• B, the observation symbol probability
distribution in states
• π, the initial state distribution λ: model
15. Forward Algorithm
• Initialization:
α1 (i ) = π i bi (O1 ) 1 ≤ i ≤ N
• Induction:
N 1 ≤ t ≤ T −1
αt +1 ( j ) = ∑αt ( i ) aij bj ( Ot +1 ) 1 ≤ j ≤ N
i=1
• Termination:
N
P(O | λ ) = ∑ αT (i )
i =1
16. Backward Algorithm
• Forward Algorithm
at (i ) = P(O1 , O2 ,..., Ot , qt = Si | λ )
• Backward Algorithm
– 給定時間 t 時狀態為 Si 的條件下,向後 向後局
向後
部觀察序列為 Ot+1, Ot+2, …, OT的機率
βt (i ) = P(Ot +1 , Ot + 2 ,..., OT , qt = Si | λ )
17. Backward Algorithm
• Initialization
βT (i ) = 1 1 ≤ i ≤ N
• Induction
N
t = T −1, T − 2, ...,1
βt (i ) = ∑ aij b j (Ot +1 ) β t +1 ( j )
j =1 1≤ i ≤ N
18. Backward Algorithm
R1 R1 R1
S1 S1 S1
R2 R2 R2
When OT = R1
S2 R1 S2 R1 S2 R1
R2 R2 R2
S3 R1 S3 R1 S3 R1
R2 R2 R2
1 2 3 t
N
β T −1 (1) = ∑ a1 j b j ( OT ) β T ( j )
j =1
= a11b1 ( OT ) + a12 b2 ( OT ) + a13b3 ( OT )
20. Solution 2
• 例: Choose the state qt which are individually
most likely
– γt(i) : the probability of being in state Si at
time t, given the observation sequence O,
and the model λ
P (O | qt = Si , λ ) α t ( i ) βt ( i ) α t ( i ) βt ( i )
γ t (i ) = = = N
P (O λ ) P (O λ )
∑ α t ( i ) βt ( i )
i =1
qt = argmax γ t ( i ) 1 ≤ t ≤ T
1≤i ≤ N
21. Viterbi algorithm
• The most widely used criteria is to find
the “single best state sequence”
maxmize P ( Q | O, λ ) ≈ maxmize P ( Q, O | λ )
• A formal technique exists, based on
dynamic programming methods, and is
called the Viterbi algorithm
22. Viterbi algorithm
• To find the single best state sequence, Q =
{q1, q2, …, qT}, for the given observation
sequence O = {O1, O2, …, OT}
• δt(i): the best score (highest prob.) along a
single path, at time t, which accounts for the
first t observations and end in state Si
δ t ( i ) = max P q1 q2 ... qt = Si , O1 O2 ... Ot λ
1 q , q ,..., q
2 t −1
23. Viterbi algorithm
• Initialization - δ1(i)
– When t = 1 the most probable path to a
state does not sensibly exist
– However we use the probability of being in
that state given t = 1 and the observable
state O1
δ1 ( i ) = π i bi ( O1 ) 1 ≤ i ≤ N
ψ (i ) = 0
24. Viterbi algorithm
• Calculate δt(i) when t > 1
– δt(i) : The most probable path to the state X
at time t
– This path to X will have to pass through one
of the states A, B or C at time (t-1)
Most probable path to A: δ t −1 ( A) a AX bX ( Ot )
25. Viterbi algorithm
• Recursion
δ t ( j ) = max δ t −1 ( i ) aij b j ( Ot )
2≤t ≤T
1≤ i ≤ N
ψ t ( j ) = argmax δ t −1 ( i ) aij 1≤ j ≤ N
1≤ i ≤ N
• Termination
P* = max δ T ( i )
1≤i ≤ N
q = argmax δ T ( i )
*
T
1≤i ≤ N
26. Viterbi algorithm
• Path (state sequence) backtracking
qt* = ψ t +1 (qt*+1 ) t = T − 1, T − 2, ..., 1
qT −1 = ψ T (qT ) = argmax δ T −1 ( i ) aiq*
* *
1≤i ≤ N T
...
...
* *
q1 = ψ 2 (q2 )
27. Solution 3
• 怎樣的模型 λ = (A, B, π) 最有可能產生
觀察到的現象
what 模型 maximize P(觀察到的現象|
模型)
• There is no known analytic solution. We
can choose λ = (A, B, π) such that P(O| λ)
is locally maximized using an iterative
procedure
28. Baum-Welch Method
• Define ξt(i, j) = P(qt=Si , qt+1=Sj|O, λ)
– The probability of being in state Si at time t,
and state Sj at time t+1
α t ( i ) aij b j ( Ot +1 ) βt +1 ( j )
ξt ( i, j ) =
P (O λ )
α t ( i ) aij b j ( Ot +1 ) βt +1 ( j )
= N N
∑∑ α ( i ) a b ( O ) β ( j )
i =1 j =1
t ij j t +1 t +1
29. Baum-Welch Method
• γt(i) : the probability of being in state Si at time
t, given the observation sequence O, and the
model λ
α t ( i ) βt ( i ) α ( i ) βt ( i )
γ t (i ) = = N t
P (O λ )
∑ α t ( i ) βt ( i )
• Relate γt(i) to ξt(i, j) i =1
α t ( i ) aij b j ( Ot +1 ) βt +1 ( j )
N ξt ( i, j ) =
γ t ( i ) = ∑ ξt ( i, j ) P (O λ )
j =1 α t ( i ) aij b j ( Ot +1 ) βt +1 ( j )
= N N
∑∑ α ( i ) a b ( O ) β ( j )
i =1 j =1
t ij j t +1 t +1
30. Baum-Welch Method
• The expected number of times that state Si is
visited
T −1
∑ γ ( i ) = Expected number of transitions from Si
t =1
t
• Similarly, the expected number of transitions
from state Si to state Sj
T −1
∑ ξ ( i, j ) = Expected number of transitions from S to S
t =1
t i j
31. Baum-Welch Method
• Re-estimation formulas for π, A and B
π i = γ1(i)
T −1
∑ξ (i, j)
t =1
t
expected number of transitions from state Si to S j
aij = T −1
=
expected number of transitions from state Si
∑γt (i)
t =1
T
∑t =1
γ t ( j)
s.t. Ot =vk expected number of times in state j and observing symbol vk
b j (k) = T
=
expected number of times in state j
∑γ ( j)
t =1
t
32. Baum-Welch Method
• P(O|λ) > P(O|λ)
• Iteratively use λ in place of λ and repeat
the re-estimation, we then can improve
P(O| λ) until some limiting point is
reached