SlideShare une entreprise Scribd logo
1  sur  72
A Revealing Introduction to
  Hidden Markov Models

         Mark Stamp




             HMM              1
Hidden Markov Models
 What    is a hidden Markov model (HMM)?
 A  machine learning technique and…
  A discrete hill climb technique

  Two for the price of one!

 Where    are HMMs used?
  Speech  recognition, malware
   detection, IDS, etc., etc., etc.
 Why   is it useful?
  Easy   to apply and efficient algorithms

                        HMM                   2
Markov Chain
 Markov    chain:
   “memoryless  random process”
   Transitions depend only on current state
    (Markov chain of order 1) and transition
    probability matrix
 Example?
   See   next slide…



                        HMM                    3
Markov Chain
                                         0.7

 Supposewe’re interested in
 average annual temperature              H

   Only   consider Hot and Cold
 From recorded history, we
                                   0.3         0.4


 obtain probabilities in
 diagram to the right                    C


                                         0.6



                        HMM                     4
Markov Chain
                                        0.7

 Transition   probability
 matrix                                 H


                                  0.3         0.4


 Matrix   is denoted as A
                                        C


 Note,   A is “row stochastic”         0.6



                       HMM                     5
Markov Chain
 Can also include                             0.7

  begin, end states
                                   0.6
 Begin state                                  H
  matrix is π
       In this example,   begin         0.3         0.4   end



                                               C
   Note that π is also            0.4
    row stochastic
                                               0.6

                           HMM                               6
Hidden Markov Model
 HMM     includes a Markov chain
   But   the Markov process is “hidden”
 Cannot   observe the Markov process
   Instead,   we observe something related (by
    probabilities) to hidden states
   It’s as if there is a “curtain” between
    Markov chain and observations
 Example    on next few slides…

                        HMM                       7
HMM Example
 ConsiderH/C temperature example
 Suppose we want to know H or C annual
  temperature in distant past
   Before thermometers (or humans) invented
   We just want to decide between H and C

 We assume transition between Hot and
 Cold years is same as today
   So,   the A matrix is known

                        HMM                    8
HMM Example
 Temp in past determined by Markov process
 But, we cannot observe temperature in past
 We findthat tree ring size is related to
  temperature
       Look at historical data to see the connection
   We consider 3 tree ring sizes
       Small, Medium, Large (S, M, L, respectively)
   Measure tree ring sizes and recorded
    temperatures to determine relationship

                            HMM                         9
HMM Example
 Wefind that tree ring sizes and
 temperature related by


 This   is known as the B matrix:


 Note   that B is also row stochastic

                      HMM                10
HMM Example
 Can we now find H/C temps in past?
 We cannot measure (observe) temps
 But we can measure tree ring sizes…
 …and tree ring sizes related to temps
   By   the B matrix
 Weought to be able to say something
 about average annual temperature

                        HMM               11
HMM Notation
A   lot of notation is required
  Notation   may be the most difficult part




                       HMM                     12
HMM Notation
 To simplify notation, observations are
  taken from the set {0,1,…,M-1}
 That is,
 The matrix A = {aij} is N x N, where


 The   matrix B = {bj(k)} is N x M, where



                      HMM                    13
HMM Example
 Consider our temperature example…
 What are the observations?
       V = {0,1,2}, which corresponds to S,M,L
   What are states of Markov process?
       Q = {H,C}
   What are A,B,π, and T?
     A,B,πon previous slides
     T is number of tree rings measured

   What are N and M?
       N = 2 and M = 3

                            HMM                   14
Generic HMM
 Generic   view of HMM




 HMM  defined by A,B,andπ
 We denote HMM “model” as λ = (A,B,π)
                    HMM                  15
HMM Example
   Suppose that we observe tree ring sizes
     For 4 year period of interest: S,M,S,L
     Then                         = (0, 1, 0, 2)
   Most likely (hidden) state sequence?
       We want most likely X = (x0, x1, x2, x3)
 Let πx0be prob. of starting in state x0
 Note         prob. of initial observation
       And ax0,x1 is prob. of transition x0 to x1
   And so on…

                              HMM                    16
HMM Example
 Bottom  line?
 We can compute P(X) for any X
 For X = (x0, x1, x2, x3) we have


 Suppose   we observe (0,1,0,2), then what
  is probability of, say, HHCC?
 Plug into formula above to find


                    HMM                   17
HMM Example
 Do same for all
  4-state
  sequences
 We find…
 The winner is?
   CCCH

 Not so fast my
 friend…

                    HMM   18
HMM Example
 The   pathCCCH scores the highest
 In dynamic programming (DP), we find
  highest scoring path
 But, HMM maximizes expected number
 of correct states
   Sometimes  called “EM algorithm”
   For “Expectation Maximization”

 How   does HMM work in this example?
                     HMM                 19
HMM Example
 For   first position…
   Sum probabilities for all paths that have H
    in 1st position, compare to sum of probs for
   paths with C in 1st position --- biggest wins
 Repeat   for each position and we find




                       HMM                         20
HMM Example



 So, HMM solution gives us CHCH
 While DP solution is CCCH
 Which solution is better?
 Neither!!!
       They use different definitions of “best”

                            HMM                    21
HMM Paradox?
 HMM  maximizes expected number of
 correct states
   Whereas    DP chooses “best” overall path
 Possiblefor HMM to choose a “path”
 that is impossible
   Could   be a transition probability of 0
 Cannot  get impossible path with DP
 Is this a flaw with HMM?
   No,   it’s a feature…
                            HMM                 22
HMM Model
 An  HMM is defined by the three
  matrices, A, B, and π
 Note that M and N are implied, since
  they are the dimensions of the matrices
 So, we denote HMM “model” as
λ = (A,B,π)



                   HMM                  23
The Three Problems
 HMMs used to solve 3 problems
 Problem 1: Given a model λ = (A,B,π) and
  observation sequence O, find P(O|λ)
       That is, we can score an observation sequence
        to see how well it fits a given model
   Problem 2: Given λ = (A,B,π) and O, find an
    optimal state sequence
       Uncover hidden part (like previous example)
   Problem 3: Given O, N, and M, find the
    model λ that maximizes probability of O
       That is, train a model to fit observations

                            HMM                         24
HMMs in Practice
 Typically,HMMs used as follows:
 Given an observation sequence…
   Assume    a (hidden) Markov process exists
 Train   a model based on observations
   Problem    3 (find N by trial and error)
 Thengiven a sequence of
 observations, score it versus the model
   Problem  1: high score implies it’s similar to
    training data, low score implies it’s not

                         HMM                         25
HMMs in Practice
 Previousslide gives sense in which HMM
 is a “machine learning” technique
   To train model, we do not need to specify
    anything except the parameter N
   And “best” N found by trial and error

 That   is, we don’t have to think too much
   Just train HMM and then use it
   Best of all, efficient algorithms for HMMs



                      HMM                        26
The Three Solutions
   We give detailed solutions to the three
    problems
       Note: We must provide efficient solutions
   Recall the three problems:
     Problem 1: Score an observation sequence
      versus a given model
     Problem 2: Given a model, “uncover” hidden part

     Problem 3: Given an observation sequence, train
      a model

                           HMM                      27
Solution 1
   Score observations versus a given model
       Given model λ = (A,B,π) and observation
        sequence O=(O0,O1,…,OT-1), find P(O|λ)
 Denote hidden states as
  X = (x0, x1, . . . , xT-1)
 Then from definition of B,
P(O|X,λ)=bx0(O0) bx1(O1) … bxT-1(OT-1)
 And from definition of A and π,
P(X|λ)=πx0 ax0,x1 ax1,x2 … axT-2,xT-1

                           HMM                    28
Solution 1
 Elementary conditional probability fact:
P(O,X|λ) = P(O|X,λ) P(X|λ)
 Sum over all possible state sequences X,
    P(O|λ) = Σ P(O,X|λ) = Σ P(O|X,λ) P(X|λ)
    = Σπx0bx0(O0)ax0,x1bx1(O1)…axT-2,xT-1bxT-1(OT-1)
 This “works” but way too costly
 Requires about 2TNT multiplications
       Why?
   There better be a better way…

                             HMM                       29
Forward Algorithm
   Instead of brute force: forward algorithm
       Or “alpha pass”
 For t = 0,1,…,T-1 and i=0,1,…,N-1, let
αt(i) = P(O0,O1,…,Ot,xt=qi|λ)
 Probability of “partial sum” to t, and
  Markov process is in state qi at step t
       What the?
   Can be computed recursively, efficiently

                          HMM                  30
Forward Algorithm
 Let α0(i) = πibi(O0) for i = 0,1,…,N-1
 For t = 1,2,…,T-1 and i=0,1,…,N-1, let

αt(i) =   (Σαt-1(j)aji)bi(Ot)
       Where the sum is from j = 0 to N-1
 From definition of αt(i) we see
P(O|λ) = ΣαT-1(i)
       Where the sum is from i = 0 to N-1
   Note this requires only N2T multiplications
                                HMM               31
Solution 2
   Given a model, find “most likely” hidden
    states: Given λ = (A,B,π) and O, find an
    optimal state sequence
     Recall that optimal means “maximize expected
      number of correct states”
     In contrast, DP finds best scoring path

   For temp/tree ring example, solved this
       But hopelessly inefficient approach
   A better way: backward algorithm
       Or “beta pass”

                           HMM                       32
Backward Algorithm
 For t = 0,1,…,T-1 and i=0,1,…,N-1, let
βt(i) = P(Ot+1,Ot+2,…,OT-1|xt=qi,λ)
 Probability of partial sum from t to end and
  Markov process in state qi at step t
 Analogous to the forward algorithm
 As with forward algorithm, this can be
  computed recursively and efficiently


                      HMM                    33
Backward Algorithm
 Let  βT-1(i) = 1for i = 0,1,…,N-1
 For t = T-2,T-3, …,1 and i=0,1,…,N-1, let
βt(i) = Σaijbj(Ot+1)βt+1(j)
   Where   the sum is from j = 0 to N-1




                       HMM                    34
Solution 2
  For t = 1,2,…,T-1 and i=0,1,…,N-1 define
γt(i) = P(xt=qi|O,λ)
       Most likely state at t is qi that maximizes γt(i)
   Note that γt(i) = αt(i)βt(i)/P(O|λ)
       And recall P(O|λ) = ΣαT-1(i)
   The bottom line?
     Forward algorithm solves Problem 1
     Forward/backward algorithms solve Problem 2



                             HMM                            35
Solution 3
 Train a model: Given O, N, and M, find λ
  that maximizes probability of O
 Here, we iteratively adjust λ = (A,B,π)
  to better fit the given observations O
   The size of matrices are fixed (N and M)
   But elements of matrices can change

 It   is amazing that this works!
   And   even more amazing that it’s efficient

                        HMM                       36
Solution 3
 For    t=0,1,…,T-2 and i,jin {0,1,…,N-
  1}, define “di-gammas” as
γt(i,j) = P(xt=qi, xt+1=qj|O,λ)
 Note γt(i,j) is prob of being in state qi at
  time t and transiting to state qj at t+1
 Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ)
 And γt(i) = Σγt(i,j)
   Where   sum is from j = 0 to N – 1

                       HMM                        37
Model Re-estimation
  Given di-gammas and gammas…
 For i = 0,1,…,N-1 let πi = γ0(i)
 For i = 0,1,…,N-1 and j = 0,1,…,N-1
aij = Σγt(i,j)/Σγt(i)
       Where both sums are from t = 0 to T-2
  For j = 0,1,…,N-1 and k = 0,1,…,M-1
bj(k) = Σγt(j)/Σγt(j)
       Both sums from from t = 0 to T-2 but only t for
        which Ot = kare counted in numerator
   Why does this work?

                            HMM                       38
Solution 3
 To   summarize…
1.   Initialize λ = (A,B,π)
2.   Compute αt(i), βt(i), γt(i,j), γt(i)
3.   Re-estimate the model λ = (A,B,π)
4.   If P(O|λ) increases, goto 2




                       HMM                  39
Solution 3
 Some fine points…
 Model initialization
     If we have a good guess for λ = (A,B,π) then we
      can use it for initialization
     If not, let πi ≈ 1/N, ai,j≈ 1/N, bj(k) ≈ 1/M

     Subject to row stochastic conditions

     But, do not initialize to uniform values

   Stopping conditions
     Stop after some number of iterations and/or…
     Stop if increase in P(O|λ) is “small”

                          HMM                       40
HMM as Discrete Hill Climb
 Algorithm on previous slides shows that
  HMM is a “discrete hill climb”
 HMM consists of discrete parameters
   Specifically,   the elements of the matrices
 And
    re-estimation process improves
 model by modifying parameters
   So, process “climbs” toward improved model
   This happens in a high-dimensional space


                         HMM                       41
Dynamic Programming
 Brief detour…
 For λ = (A,B,π) as above, it’s easy to
  define a dynamic program (DP)
 Executive summary:
   DP is forward algorithm, with “sum”
    replaced by “max”
 Precise   details on next few slides


                      HMM                  42
Dynamic Programming
 Let δ0(i) = πi bi(O0)for i=0,1,…,N-1
 For t=1,2,…,T-1 and i=0,1,…,N-1 compute
δt(i) = max (δt-1(j)aji)bi(Ot)
       Where the max is over j in {0,1,…,N-1}
 Note that at each t, the DP computes best
  path for each state, up to that point
 So, probability of best path is max δT-1(j)
 This max gives the best probability
       Not the best path, for that, see next slide


                            HMM                       43
Dynamic Programming
   To determine optimal path
     While computing deltas, keep track of pointers
      to previous state
     When finished, construct optimal path by
      tracing back points
 For example, consider temp example: recall
  that we observe (0,1,0,2)
 Probabilities for path of length 1:


   These are the only “paths” of length 1
                         HMM                       44
Dynamic Programming
   Probabilities for each path of length 2




 Best path of length 2 ending with H is CH
 Best path of length 2 ending with C is CC



                       HMM                    45
Dynamic Program
 Continuing,we compute best path ending
  at H and C at each step
 And save pointers --- why?




                   HMM                 46
Dynamic Program




 Best   final score is .002822
   And,   thanks to pointers, best path is CCCH
 But   what about underflow?
  A    serious problem in bigger cases

                        HMM                    47
Underflow Resistant DP
 Common    trick to prevent underflow
   Insteadof multiplying probabilities…
   …we add logarithms of probabilities

 Why    does this work?
   Because log(xy) = log x + log y
   Adding logs does not tend to 0

 Note   that we must avoid 0 probabilities


                      HMM                     48
Underflow Resistant DP
 Underflow resistant DP algorithm:
 Let δ0(i) = log(πi bi(O0))for i=0,1,…,N-1
 For t=1,2,…,T-1 and i=0,1,…,N-1 compute

δt(i) = max (δt-1(j) + log(aji) + log(bi(Ot)))
       Where the max is over j in {0,1,…,N-1}
 And score of best path is max δT-1(j)
 As before, must also keep track of paths



                            HMM                  49
HMM Scaling
 Trickierto prevent underflow in HMM
 We consider solution 3
   Since   it includes solutions 1 and 2
 Recall   for t = 1,2,…,T-1, i=0,1,…,N-1,
αt(i) = (Σαt-1(j)aj,i)bi(Ot)
 The idea is to normalize alphas so that
  they sum to 1
   Algorithm   on next slide

                         HMM                 50
HMM Scaling

 Given   αt(i) = (Σαt-1(j)aj,i)bi(Ot)
 Let a0(i) = α0(i) for i=0,1,…,N-1
 Let c0 = 1/Σa0(j)
 For i = 0,1,…,N-1, let a0(i) = c0a0(i)
 This takes care of t = 0 case
 Algorithm continued on next slide…



                          HMM              51
HMM Scaling
 For  t = 1,2,…,T-1 do the following:
   For i = 0,1,…,N-1,
at(i) =   (Σat-1(j)aj,i)bi(Ot)
 Let ct = 1/Σat(j)
 For i = 0,1,…,N-1 let at(i) = ctat(i)




                           HMM            52
HMM Scaling
 Easy   to show at(i) = c0c1…ct αt(i)   (♯)
   Simple   proof by induction
 So,  c0c1…ct is scaling factor at step t
 Also, easy to show that
at(i) = αt(i)/Σαt(j)
 Which implies ΣaT-1(i) = 1           (♯♯)



                        HMM                    53
HMM Scaling
 By combining (♯) and (♯♯), we have
1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i)
                = c0c1…cT-1 P(O|λ)
 Therefore, P(O|λ) = 1 / c0c1…cT-1
 To avoid underflow, we compute
log P(O|λ) = -Σlog(cj)
   Where   sum is from j = 0 to T-1


                       HMM             54
HMM Scaling
 Similarly, scale betas as ctβt(i)
 For re-estimation,
       Computeγt(i,j)andγt(i)using original formulas, but
        with scaled alphas and betas
 This gives us new values for λ = (A,B,π)
 “Easy exercise” to show re-estimate is
  exact when scaled alphas and betas used
 Also, P(O|λ) cancels from formula
       Use log P(O|λ) = -Σlog(cj) to decide if iterate
        improves
                             HMM                          55
All Together Now
 Complete pseudo code for Solution 3
 Given: (O0,O1,…,OT-1) and N and M
 Initialize: λ = (A,B,π)
     A is NxN, B is NxM and π is 1xN
     πi≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row
      stochastic, but not uniform
   Initialize:
     maxIters = max number of re-estimation steps
     iters = 0
     oldLogProb = -∞



                           HMM                           56
Forward Algorithm
 Forward    algorithm
   With   scaling




                         HMM   57
Backward Algorithm
 Backward algorithm
 or “beta pass”
   With   scaling
 Note:same scaling
 factor as alphas




                       HMM      58
Gammas
 Using scaled
  alphas and betas
 So formulas
  unchanged




                     HMM   59
Re-Estimation
 Again, using
  scaled gammas
 So formulas
  unchanged




                  HMM   60
Stopping Criteria
 Checkthat
 probability
 increases
   In practice, want
  logProb>oldLogPro
    b+ε
 And don’t
 exceed max
 iterations

                        HMM   61
English Text Example
 Suppose   Martian arrives on earth
   Sees written English text
   Wants to learn something about it

   Martians know about HMMs

 So,strip our all non-letters, make all
 letters lower-case
   27 symbols (letters, plus word-space)
   Train HMM on long sequence of symbols


                     HMM                    62
English Text
 For   first training case, initialize:
  N  = 2 and M = 27
   Elements of A and πare about ½ each

   Elements of B are each about 1/27

 We use 50,000 symbols for training
 After 1stiter: log P(O|λ) ≈ -165097
 After 100thiter: log P(O|λ) ≈ -137305



                        HMM                63
English Text
 Matrices    A and πconverge:


 What   does this tells us?
   Started in hidden state 1 (not state 0)
   And we know transition probabilities
    between hidden states
 Nothing   too interesting here
   We   don’t care about hidden states
                       HMM                    64
English Text
 What  about B
  matrix?
 This much more
  interesting…
   Why???




                   HMM   65
A Security Application
   Suppose we want to detect metamorphic
    computer viruses
     Such viruses vary their internal structure
     But function of malware stays same
     If sufficiently variable, standard signature
      detection will fail
   Can we use HMM for detection?
     What to use as observation sequence?
     Is there really a “hidden” Markov process?
     What about N, M, and T?
     How many Os needed for training, scoring?


                          HMM                        66
HMM for Metamorphic Detection
 Set of “family” viruses into 2 subsets
 Extract opcodes from each virus
 Append opcodes from subset 1 to make one
  long sequence
     Train HMM on opcode sequence (problem 3)
     Obtain a model λ = (A,B,π)

   Set threshold: score opcodes from files in
    subset 2 and “normal” files (problem 1)
     Can you sets a threshold that separates sets?
     If so, may have a viable detection method



                         HMM                          67
HMM for Metamorphic Detection
   Virus
    detection
    results from
    recent paper
       Note the
        separation
   This is good!




                     HMM        68
HMM Generalizations
   Here, assumed Markov process of order 1
       Current state depends only on previous state
        and transition matrix A
   Can use higher order Markov process
     Current state depends on n previous states
     Higher order vs size of N ? “Depth” vs “width”

 Can have A and B matrices depend on t
 HMM often combined with other
  techniques (e.g., neural nets)
                           HMM                         69
Generalizations
 Insome cases, limitation of HMM is
 that position information is not used
   In many applications this is OK/desirable
   In some apps, this is a serious problem

 Bioinformatics   applications
   DNA  sequencing, protein alignment, etc.
   Sequence alignment is crucial

   They use “profile HMMs” instead of HMMs


                      HMM                       70
References
Arevealing introduction to hidden
 Markov models, by M. Stamp
  http://www.cs.sjsu.edu/faculty/stamp/RUA
     /HMM.pdf
A tutorial on hidden Markov models and
 selected applications in speech
 recognition, by L.R. Rabiner
  http://www.cs.ubc.ca/~murphyk/Bayes/rabi
     ner.pdf
                    HMM                   71
References
 Hunting
        for metamorphic engines, W.
 Wong and M. Stamp
   Journal
          in Computer Virology, Vol. 2, No.
   3, December 2006, pp. 211-229
 Hunting for undetectable metamorphic
 viruses, D. Lin and M. Stamp
   Journalin Computer Virology, Vol. 7, No.
   3, August 2011, pp. 201-214


                      HMM                      72

Contenu connexe

Tendances

Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
Duy Tung Pham
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
Syed Atif Naseem
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 

Tendances (20)

Advantages and disadvantages of hidden markov model
Advantages and disadvantages of hidden markov modelAdvantages and disadvantages of hidden markov model
Advantages and disadvantages of hidden markov model
 
Hidden Markov Model paper presentation
Hidden Markov Model paper presentationHidden Markov Model paper presentation
Hidden Markov Model paper presentation
 
Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
 
Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
 
HIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATIONHIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATION
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
 
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
 
GMM
GMMGMM
GMM
 
Introduction to Systemics with focus on Systems Biology
Introduction to Systemics with focus on Systems BiologyIntroduction to Systemics with focus on Systems Biology
Introduction to Systemics with focus on Systems Biology
 
Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithm
 
Dynamic programming
Dynamic programming Dynamic programming
Dynamic programming
 
BTech Pattern Recognition Notes
BTech Pattern Recognition NotesBTech Pattern Recognition Notes
BTech Pattern Recognition Notes
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
 

En vedette

Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
butest
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
petitegeek
 

En vedette (15)

Introduction to Iteratees (Scala)
Introduction to Iteratees (Scala)Introduction to Iteratees (Scala)
Introduction to Iteratees (Scala)
 
Introducing the Ceylon Project - Gavin King presentation at QCon Beijing 2011
Introducing the Ceylon Project - Gavin King presentation at QCon Beijing 2011Introducing the Ceylon Project - Gavin King presentation at QCon Beijing 2011
Introducing the Ceylon Project - Gavin King presentation at QCon Beijing 2011
 
Travel & Living Quiz, IIT (BHU) - Finals
Travel & Living Quiz, IIT (BHU) - FinalsTravel & Living Quiz, IIT (BHU) - Finals
Travel & Living Quiz, IIT (BHU) - Finals
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
Saarang Travel and Living Quiz 2016 - Prelims
Saarang Travel and Living Quiz 2016 - PrelimsSaarang Travel and Living Quiz 2016 - Prelims
Saarang Travel and Living Quiz 2016 - Prelims
 
Hidden Markov Model & Stock Prediction
Hidden Markov Model & Stock PredictionHidden Markov Model & Stock Prediction
Hidden Markov Model & Stock Prediction
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov Models
 

Similaire à Hmm

Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
butest
 
Hidden Markov Model (HMM)
Hidden Markov Model (HMM)Hidden Markov Model (HMM)
Hidden Markov Model (HMM)
Abdullah al Mamun
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmm
nozomuhamada
 
Presentation on GMM
Presentation on GMMPresentation on GMM
Presentation on GMM
Moses sichei
 

Similaire à Hmm (14)

Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Hidden Markov Model (HMM)
Hidden Markov Model (HMM)Hidden Markov Model (HMM)
Hidden Markov Model (HMM)
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
 
NLP_KASHK:Markov Models
NLP_KASHK:Markov ModelsNLP_KASHK:Markov Models
NLP_KASHK:Markov Models
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmm
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
 
Introduction to hmc
Introduction to hmcIntroduction to hmc
Introduction to hmc
 
Network Security CS3-4
Network Security CS3-4 Network Security CS3-4
Network Security CS3-4
 
lect_23.pdf
lect_23.pdflect_23.pdf
lect_23.pdf
 
3_MLE_printable.pdf
3_MLE_printable.pdf3_MLE_printable.pdf
3_MLE_printable.pdf
 
A bit about мcmc
A bit about мcmcA bit about мcmc
A bit about мcmc
 
Presentation on GMM
Presentation on GMMPresentation on GMM
Presentation on GMM
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Hmm

  • 1. A Revealing Introduction to Hidden Markov Models Mark Stamp HMM 1
  • 2. Hidden Markov Models  What is a hidden Markov model (HMM)? A machine learning technique and…  A discrete hill climb technique  Two for the price of one!  Where are HMMs used?  Speech recognition, malware detection, IDS, etc., etc., etc.  Why is it useful?  Easy to apply and efficient algorithms HMM 2
  • 3. Markov Chain  Markov chain:  “memoryless random process”  Transitions depend only on current state (Markov chain of order 1) and transition probability matrix  Example?  See next slide… HMM 3
  • 4. Markov Chain 0.7  Supposewe’re interested in average annual temperature H  Only consider Hot and Cold  From recorded history, we 0.3 0.4 obtain probabilities in diagram to the right C 0.6 HMM 4
  • 5. Markov Chain 0.7  Transition probability matrix H 0.3 0.4  Matrix is denoted as A C  Note, A is “row stochastic” 0.6 HMM 5
  • 6. Markov Chain  Can also include 0.7 begin, end states 0.6  Begin state H matrix is π  In this example, begin 0.3 0.4 end C  Note that π is also 0.4 row stochastic 0.6 HMM 6
  • 7. Hidden Markov Model  HMM includes a Markov chain  But the Markov process is “hidden”  Cannot observe the Markov process  Instead, we observe something related (by probabilities) to hidden states  It’s as if there is a “curtain” between Markov chain and observations  Example on next few slides… HMM 7
  • 8. HMM Example  ConsiderH/C temperature example  Suppose we want to know H or C annual temperature in distant past  Before thermometers (or humans) invented  We just want to decide between H and C  We assume transition between Hot and Cold years is same as today  So, the A matrix is known HMM 8
  • 9. HMM Example  Temp in past determined by Markov process  But, we cannot observe temperature in past  We findthat tree ring size is related to temperature  Look at historical data to see the connection  We consider 3 tree ring sizes  Small, Medium, Large (S, M, L, respectively)  Measure tree ring sizes and recorded temperatures to determine relationship HMM 9
  • 10. HMM Example  Wefind that tree ring sizes and temperature related by  This is known as the B matrix:  Note that B is also row stochastic HMM 10
  • 11. HMM Example  Can we now find H/C temps in past?  We cannot measure (observe) temps  But we can measure tree ring sizes…  …and tree ring sizes related to temps  By the B matrix  Weought to be able to say something about average annual temperature HMM 11
  • 12. HMM Notation A lot of notation is required  Notation may be the most difficult part HMM 12
  • 13. HMM Notation  To simplify notation, observations are taken from the set {0,1,…,M-1}  That is,  The matrix A = {aij} is N x N, where  The matrix B = {bj(k)} is N x M, where HMM 13
  • 14. HMM Example  Consider our temperature example…  What are the observations?  V = {0,1,2}, which corresponds to S,M,L  What are states of Markov process?  Q = {H,C}  What are A,B,π, and T?  A,B,πon previous slides  T is number of tree rings measured  What are N and M?  N = 2 and M = 3 HMM 14
  • 15. Generic HMM  Generic view of HMM  HMM defined by A,B,andπ  We denote HMM “model” as λ = (A,B,π) HMM 15
  • 16. HMM Example  Suppose that we observe tree ring sizes  For 4 year period of interest: S,M,S,L  Then = (0, 1, 0, 2)  Most likely (hidden) state sequence?  We want most likely X = (x0, x1, x2, x3)  Let πx0be prob. of starting in state x0  Note prob. of initial observation  And ax0,x1 is prob. of transition x0 to x1  And so on… HMM 16
  • 17. HMM Example  Bottom line?  We can compute P(X) for any X  For X = (x0, x1, x2, x3) we have  Suppose we observe (0,1,0,2), then what is probability of, say, HHCC?  Plug into formula above to find HMM 17
  • 18. HMM Example  Do same for all 4-state sequences  We find…  The winner is?  CCCH  Not so fast my friend… HMM 18
  • 19. HMM Example  The pathCCCH scores the highest  In dynamic programming (DP), we find highest scoring path  But, HMM maximizes expected number of correct states  Sometimes called “EM algorithm”  For “Expectation Maximization”  How does HMM work in this example? HMM 19
  • 20. HMM Example  For first position…  Sum probabilities for all paths that have H in 1st position, compare to sum of probs for paths with C in 1st position --- biggest wins  Repeat for each position and we find HMM 20
  • 21. HMM Example  So, HMM solution gives us CHCH  While DP solution is CCCH  Which solution is better?  Neither!!!  They use different definitions of “best” HMM 21
  • 22. HMM Paradox?  HMM maximizes expected number of correct states  Whereas DP chooses “best” overall path  Possiblefor HMM to choose a “path” that is impossible  Could be a transition probability of 0  Cannot get impossible path with DP  Is this a flaw with HMM?  No, it’s a feature… HMM 22
  • 23. HMM Model  An HMM is defined by the three matrices, A, B, and π  Note that M and N are implied, since they are the dimensions of the matrices  So, we denote HMM “model” as λ = (A,B,π) HMM 23
  • 24. The Three Problems  HMMs used to solve 3 problems  Problem 1: Given a model λ = (A,B,π) and observation sequence O, find P(O|λ)  That is, we can score an observation sequence to see how well it fits a given model  Problem 2: Given λ = (A,B,π) and O, find an optimal state sequence  Uncover hidden part (like previous example)  Problem 3: Given O, N, and M, find the model λ that maximizes probability of O  That is, train a model to fit observations HMM 24
  • 25. HMMs in Practice  Typically,HMMs used as follows:  Given an observation sequence…  Assume a (hidden) Markov process exists  Train a model based on observations  Problem 3 (find N by trial and error)  Thengiven a sequence of observations, score it versus the model  Problem 1: high score implies it’s similar to training data, low score implies it’s not HMM 25
  • 26. HMMs in Practice  Previousslide gives sense in which HMM is a “machine learning” technique  To train model, we do not need to specify anything except the parameter N  And “best” N found by trial and error  That is, we don’t have to think too much  Just train HMM and then use it  Best of all, efficient algorithms for HMMs HMM 26
  • 27. The Three Solutions  We give detailed solutions to the three problems  Note: We must provide efficient solutions  Recall the three problems:  Problem 1: Score an observation sequence versus a given model  Problem 2: Given a model, “uncover” hidden part  Problem 3: Given an observation sequence, train a model HMM 27
  • 28. Solution 1  Score observations versus a given model  Given model λ = (A,B,π) and observation sequence O=(O0,O1,…,OT-1), find P(O|λ)  Denote hidden states as X = (x0, x1, . . . , xT-1)  Then from definition of B, P(O|X,λ)=bx0(O0) bx1(O1) … bxT-1(OT-1)  And from definition of A and π, P(X|λ)=πx0 ax0,x1 ax1,x2 … axT-2,xT-1 HMM 28
  • 29. Solution 1  Elementary conditional probability fact: P(O,X|λ) = P(O|X,λ) P(X|λ)  Sum over all possible state sequences X, P(O|λ) = Σ P(O,X|λ) = Σ P(O|X,λ) P(X|λ) = Σπx0bx0(O0)ax0,x1bx1(O1)…axT-2,xT-1bxT-1(OT-1)  This “works” but way too costly  Requires about 2TNT multiplications  Why?  There better be a better way… HMM 29
  • 30. Forward Algorithm  Instead of brute force: forward algorithm  Or “alpha pass”  For t = 0,1,…,T-1 and i=0,1,…,N-1, let αt(i) = P(O0,O1,…,Ot,xt=qi|λ)  Probability of “partial sum” to t, and Markov process is in state qi at step t  What the?  Can be computed recursively, efficiently HMM 30
  • 31. Forward Algorithm  Let α0(i) = πibi(O0) for i = 0,1,…,N-1  For t = 1,2,…,T-1 and i=0,1,…,N-1, let αt(i) = (Σαt-1(j)aji)bi(Ot)  Where the sum is from j = 0 to N-1  From definition of αt(i) we see P(O|λ) = ΣαT-1(i)  Where the sum is from i = 0 to N-1  Note this requires only N2T multiplications HMM 31
  • 32. Solution 2  Given a model, find “most likely” hidden states: Given λ = (A,B,π) and O, find an optimal state sequence  Recall that optimal means “maximize expected number of correct states”  In contrast, DP finds best scoring path  For temp/tree ring example, solved this  But hopelessly inefficient approach  A better way: backward algorithm  Or “beta pass” HMM 32
  • 33. Backward Algorithm  For t = 0,1,…,T-1 and i=0,1,…,N-1, let βt(i) = P(Ot+1,Ot+2,…,OT-1|xt=qi,λ)  Probability of partial sum from t to end and Markov process in state qi at step t  Analogous to the forward algorithm  As with forward algorithm, this can be computed recursively and efficiently HMM 33
  • 34. Backward Algorithm  Let βT-1(i) = 1for i = 0,1,…,N-1  For t = T-2,T-3, …,1 and i=0,1,…,N-1, let βt(i) = Σaijbj(Ot+1)βt+1(j)  Where the sum is from j = 0 to N-1 HMM 34
  • 35. Solution 2  For t = 1,2,…,T-1 and i=0,1,…,N-1 define γt(i) = P(xt=qi|O,λ)  Most likely state at t is qi that maximizes γt(i)  Note that γt(i) = αt(i)βt(i)/P(O|λ)  And recall P(O|λ) = ΣαT-1(i)  The bottom line?  Forward algorithm solves Problem 1  Forward/backward algorithms solve Problem 2 HMM 35
  • 36. Solution 3  Train a model: Given O, N, and M, find λ that maximizes probability of O  Here, we iteratively adjust λ = (A,B,π) to better fit the given observations O  The size of matrices are fixed (N and M)  But elements of matrices can change  It is amazing that this works!  And even more amazing that it’s efficient HMM 36
  • 37. Solution 3  For t=0,1,…,T-2 and i,jin {0,1,…,N- 1}, define “di-gammas” as γt(i,j) = P(xt=qi, xt+1=qj|O,λ)  Note γt(i,j) is prob of being in state qi at time t and transiting to state qj at t+1  Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ)  And γt(i) = Σγt(i,j)  Where sum is from j = 0 to N – 1 HMM 37
  • 38. Model Re-estimation  Given di-gammas and gammas…  For i = 0,1,…,N-1 let πi = γ0(i)  For i = 0,1,…,N-1 and j = 0,1,…,N-1 aij = Σγt(i,j)/Σγt(i)  Where both sums are from t = 0 to T-2  For j = 0,1,…,N-1 and k = 0,1,…,M-1 bj(k) = Σγt(j)/Σγt(j)  Both sums from from t = 0 to T-2 but only t for which Ot = kare counted in numerator  Why does this work? HMM 38
  • 39. Solution 3  To summarize… 1. Initialize λ = (A,B,π) 2. Compute αt(i), βt(i), γt(i,j), γt(i) 3. Re-estimate the model λ = (A,B,π) 4. If P(O|λ) increases, goto 2 HMM 39
  • 40. Solution 3  Some fine points…  Model initialization  If we have a good guess for λ = (A,B,π) then we can use it for initialization  If not, let πi ≈ 1/N, ai,j≈ 1/N, bj(k) ≈ 1/M  Subject to row stochastic conditions  But, do not initialize to uniform values  Stopping conditions  Stop after some number of iterations and/or…  Stop if increase in P(O|λ) is “small” HMM 40
  • 41. HMM as Discrete Hill Climb  Algorithm on previous slides shows that HMM is a “discrete hill climb”  HMM consists of discrete parameters  Specifically, the elements of the matrices  And re-estimation process improves model by modifying parameters  So, process “climbs” toward improved model  This happens in a high-dimensional space HMM 41
  • 42. Dynamic Programming  Brief detour…  For λ = (A,B,π) as above, it’s easy to define a dynamic program (DP)  Executive summary:  DP is forward algorithm, with “sum” replaced by “max”  Precise details on next few slides HMM 42
  • 43. Dynamic Programming  Let δ0(i) = πi bi(O0)for i=0,1,…,N-1  For t=1,2,…,T-1 and i=0,1,…,N-1 compute δt(i) = max (δt-1(j)aji)bi(Ot)  Where the max is over j in {0,1,…,N-1}  Note that at each t, the DP computes best path for each state, up to that point  So, probability of best path is max δT-1(j)  This max gives the best probability  Not the best path, for that, see next slide HMM 43
  • 44. Dynamic Programming  To determine optimal path  While computing deltas, keep track of pointers to previous state  When finished, construct optimal path by tracing back points  For example, consider temp example: recall that we observe (0,1,0,2)  Probabilities for path of length 1:  These are the only “paths” of length 1 HMM 44
  • 45. Dynamic Programming  Probabilities for each path of length 2  Best path of length 2 ending with H is CH  Best path of length 2 ending with C is CC HMM 45
  • 46. Dynamic Program  Continuing,we compute best path ending at H and C at each step  And save pointers --- why? HMM 46
  • 47. Dynamic Program  Best final score is .002822  And, thanks to pointers, best path is CCCH  But what about underflow? A serious problem in bigger cases HMM 47
  • 48. Underflow Resistant DP  Common trick to prevent underflow  Insteadof multiplying probabilities…  …we add logarithms of probabilities  Why does this work?  Because log(xy) = log x + log y  Adding logs does not tend to 0  Note that we must avoid 0 probabilities HMM 48
  • 49. Underflow Resistant DP  Underflow resistant DP algorithm:  Let δ0(i) = log(πi bi(O0))for i=0,1,…,N-1  For t=1,2,…,T-1 and i=0,1,…,N-1 compute δt(i) = max (δt-1(j) + log(aji) + log(bi(Ot)))  Where the max is over j in {0,1,…,N-1}  And score of best path is max δT-1(j)  As before, must also keep track of paths HMM 49
  • 50. HMM Scaling  Trickierto prevent underflow in HMM  We consider solution 3  Since it includes solutions 1 and 2  Recall for t = 1,2,…,T-1, i=0,1,…,N-1, αt(i) = (Σαt-1(j)aj,i)bi(Ot)  The idea is to normalize alphas so that they sum to 1  Algorithm on next slide HMM 50
  • 51. HMM Scaling  Given αt(i) = (Σαt-1(j)aj,i)bi(Ot)  Let a0(i) = α0(i) for i=0,1,…,N-1  Let c0 = 1/Σa0(j)  For i = 0,1,…,N-1, let a0(i) = c0a0(i)  This takes care of t = 0 case  Algorithm continued on next slide… HMM 51
  • 52. HMM Scaling  For t = 1,2,…,T-1 do the following:  For i = 0,1,…,N-1, at(i) = (Σat-1(j)aj,i)bi(Ot)  Let ct = 1/Σat(j)  For i = 0,1,…,N-1 let at(i) = ctat(i) HMM 52
  • 53. HMM Scaling  Easy to show at(i) = c0c1…ct αt(i) (♯)  Simple proof by induction  So, c0c1…ct is scaling factor at step t  Also, easy to show that at(i) = αt(i)/Σαt(j)  Which implies ΣaT-1(i) = 1 (♯♯) HMM 53
  • 54. HMM Scaling  By combining (♯) and (♯♯), we have 1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i) = c0c1…cT-1 P(O|λ)  Therefore, P(O|λ) = 1 / c0c1…cT-1  To avoid underflow, we compute log P(O|λ) = -Σlog(cj)  Where sum is from j = 0 to T-1 HMM 54
  • 55. HMM Scaling  Similarly, scale betas as ctβt(i)  For re-estimation,  Computeγt(i,j)andγt(i)using original formulas, but with scaled alphas and betas  This gives us new values for λ = (A,B,π)  “Easy exercise” to show re-estimate is exact when scaled alphas and betas used  Also, P(O|λ) cancels from formula  Use log P(O|λ) = -Σlog(cj) to decide if iterate improves HMM 55
  • 56. All Together Now  Complete pseudo code for Solution 3  Given: (O0,O1,…,OT-1) and N and M  Initialize: λ = (A,B,π)  A is NxN, B is NxM and π is 1xN  πi≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row stochastic, but not uniform  Initialize:  maxIters = max number of re-estimation steps  iters = 0  oldLogProb = -∞ HMM 56
  • 57. Forward Algorithm  Forward algorithm  With scaling HMM 57
  • 58. Backward Algorithm  Backward algorithm or “beta pass”  With scaling  Note:same scaling factor as alphas HMM 58
  • 59. Gammas  Using scaled alphas and betas  So formulas unchanged HMM 59
  • 60. Re-Estimation  Again, using scaled gammas  So formulas unchanged HMM 60
  • 61. Stopping Criteria  Checkthat probability increases  In practice, want logProb>oldLogPro b+ε  And don’t exceed max iterations HMM 61
  • 62. English Text Example  Suppose Martian arrives on earth  Sees written English text  Wants to learn something about it  Martians know about HMMs  So,strip our all non-letters, make all letters lower-case  27 symbols (letters, plus word-space)  Train HMM on long sequence of symbols HMM 62
  • 63. English Text  For first training case, initialize: N = 2 and M = 27  Elements of A and πare about ½ each  Elements of B are each about 1/27  We use 50,000 symbols for training  After 1stiter: log P(O|λ) ≈ -165097  After 100thiter: log P(O|λ) ≈ -137305 HMM 63
  • 64. English Text  Matrices A and πconverge:  What does this tells us?  Started in hidden state 1 (not state 0)  And we know transition probabilities between hidden states  Nothing too interesting here  We don’t care about hidden states HMM 64
  • 65. English Text  What about B matrix?  This much more interesting…  Why??? HMM 65
  • 66. A Security Application  Suppose we want to detect metamorphic computer viruses  Such viruses vary their internal structure  But function of malware stays same  If sufficiently variable, standard signature detection will fail  Can we use HMM for detection?  What to use as observation sequence?  Is there really a “hidden” Markov process?  What about N, M, and T?  How many Os needed for training, scoring? HMM 66
  • 67. HMM for Metamorphic Detection  Set of “family” viruses into 2 subsets  Extract opcodes from each virus  Append opcodes from subset 1 to make one long sequence  Train HMM on opcode sequence (problem 3)  Obtain a model λ = (A,B,π)  Set threshold: score opcodes from files in subset 2 and “normal” files (problem 1)  Can you sets a threshold that separates sets?  If so, may have a viable detection method HMM 67
  • 68. HMM for Metamorphic Detection  Virus detection results from recent paper  Note the separation  This is good! HMM 68
  • 69. HMM Generalizations  Here, assumed Markov process of order 1  Current state depends only on previous state and transition matrix A  Can use higher order Markov process  Current state depends on n previous states  Higher order vs size of N ? “Depth” vs “width”  Can have A and B matrices depend on t  HMM often combined with other techniques (e.g., neural nets) HMM 69
  • 70. Generalizations  Insome cases, limitation of HMM is that position information is not used  In many applications this is OK/desirable  In some apps, this is a serious problem  Bioinformatics applications  DNA sequencing, protein alignment, etc.  Sequence alignment is crucial  They use “profile HMMs” instead of HMMs HMM 70
  • 71. References Arevealing introduction to hidden Markov models, by M. Stamp  http://www.cs.sjsu.edu/faculty/stamp/RUA /HMM.pdf A tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabiner  http://www.cs.ubc.ca/~murphyk/Bayes/rabi ner.pdf HMM 71
  • 72. References  Hunting for metamorphic engines, W. Wong and M. Stamp  Journal in Computer Virology, Vol. 2, No. 3, December 2006, pp. 211-229  Hunting for undetectable metamorphic viruses, D. Lin and M. Stamp  Journalin Computer Virology, Vol. 7, No. 3, August 2011, pp. 201-214 HMM 72