next up previous contents
Next: Ensemble learning for hidden Up: Hidden Markov models Previous: Continuous observations   Contents

Learning algorithms

Rabiner [48] lists three basic problems for HMMs:

  1. Given the observation sequence $ \boldsymbol{X}$ and the model $ \boldsymbol {\theta }$, how can the probability $ p(\boldsymbol{X}\vert \boldsymbol{\theta})$ be computed efficiently?
  2. Given the observation sequence $ \boldsymbol{X}$ and the model $ \boldsymbol {\theta }$, how can the corresponding optimal state sequence $ \boldsymbol {M}$ be estimated?
  3. Given the observation sequence $ \boldsymbol{X}$, how can the model parameters $ \boldsymbol {\theta }$ be optimised to better describe the observations?

Problem 1 is not a learning problem but rather one needed in evaluating the model. Its difficulty comes from the fact that one needs to calculate the sum over all the possible state sequences. This can be diverted by calculating the probabilities recursively with respect to the length of the observation sequence. This operation forms one half of the forward-backward algorithm and it is very typical for HMM calculations. The algorithm uses an auxiliary variable $ \alpha_i(t)$ which is defined as

$\displaystyle \alpha_i(t) = p(x_{1:t}, M_t = i \vert \boldsymbol{\theta}).$ (4.8)

Here we have used the notation $ x_{1:t} = ( x(1), \ldots, x(t) )$.

The algorithm to evaluate $ p(\boldsymbol{X}\vert \boldsymbol{\theta})$ proceeds as follows:

  1. Initialisation:

    $\displaystyle \alpha_j(1) = b_j(x(1)) \pi_j$    

  2. Iteration:

    $\displaystyle \alpha_j(t+1) = \left[ \sum_{i=1}^N \alpha_i(t) a_{ij} \right] b_j(x(t+1))$    

  3. Termination:

    $\displaystyle p(\boldsymbol{X}\vert \boldsymbol{\theta}) = \sum_{i=1}^N \alpha_i(T).$    

The solution of Problem 2 is much more difficult. Rabiner uses classical statistics and interprets it as a maximum likelihood estimation problem for the state sequence, the solution of which is given by the Viterbi algorithm. A Bayesian solution to the problem can be found with a simple modification of the solution of the next problem which essentially uses the posterior of the hidden states.

Rabiner's statement of Problem 3 is more precise than ours and asks for the maximum likelihood solution optimising $ p(\boldsymbol{X}\vert \boldsymbol{\theta})$ with respect to $ \boldsymbol {\theta }$. This problem is solved by the Baum-Welch algorithm which is an application of the EM algorithm to the problem.

The Baum-Welch algorithm uses the complete forward-backward procedure with a backward pass to compute $ \beta_i(t) = p(x_{t:T} \vert
M_t = i, \boldsymbol{\theta})$. This can be done very much like the forward pass:

  1. Initialisation:

    $\displaystyle \beta_i(T) = b_i(x(T))$    

  2. Iteration:

    $\displaystyle \beta_i(t) = b_i(x(t)) \left[ \sum_{j=1}^N a_{ij} \beta_j(t+1) \right].$    

The M-step of Baum-Welch algorithm can be expressed in terms of $ \eta_{ij}(t)$, the posterior probability that there was a transition between state $ i$ and state $ j$ at time step $ t$ given $ \boldsymbol{X}$ and $ \boldsymbol {\theta }$. This probability can be easily calculated with forward and backward variables $ \alpha$ and $ \beta$:

$\displaystyle \eta_{ij}(t) = p(M_t = i, M_{t+1} = j \vert \boldsymbol{X}, \boldsymbol{\theta}) = \frac{1}{Z_n} \alpha_i(t) a_{ij} \beta_j(t+1)$ (4.9)

where $ Z_n$ is a normalising constant such that $ \sum_{i,j=1}^N
\eta_{ij}(t) = 1.$ Then the M-step for $ a_{ij}$ is

$\displaystyle a_{ij} = \frac{1}{Z'_n} \sum_{t=1}^{T-1} \eta_{ij}(t)$ (4.10)

where $ Z'_n$ is another constant needed to make the transition probabilities normalised. There are similar update formulas for other parameters. The algorithm can also easily be extended to take into account the possible priors of different variables [39].



Subsections
next up previous contents
Next: Ensemble learning for hidden Up: Hidden Markov models Previous: Continuous observations   Contents
Antti Honkela 2001-05-30