Hidden Markov Model

Assume that all latent variables form a Markov chain, giving rise to the graphical structure, this is known as a state space model.

If the latent variables are discrete, then we obtain the hidden Markov model, or HMM.

Note that the observed variables in an HMM may be discrete or continuous, and a variety of different conditional distributions can be used to model them.

Transition and Emmission Probabilities

If we allow the probability distribution of $z_n$ to depend on the state of the previous latent variable $z_{n-1}$ through a conditional distribution $p(z_n\vert z_{n−1})$.

Because the latent variables are K-dimensional binary variables, this conditional distribution corresponds to a table of numbers that we denote by $\mathbf{A}$, the elements of which are known as transition probabilities

Transition Probabilities

Transition Probabilities is defined as:

\[\begin{aligned} p(\boldsymbol{z}_n\vert \boldsymbol{z}_{n-1}, \mathbf{A}) &= \prod_{k=1}^K \prod_{k=1}^K A_{jk}^{z_{n-1,j},z_{nk}} \\ p(\boldsymbol{z_1}\vert \boldsymbol{\pi}) &= \prod_{k=1}^K \pi_k^{z_{1k}} \end{aligned}\]

where $\sum_k z_{1k} = 1$
A state transition diagram of HMM as shown follow
A Lattice or Trellis diagram of HMM as shown follow

Emmission Probabilities

Emmission Probabilities:

\[\begin{aligned} p(\boldsymbol{x}_n\vert \boldsymbol{z}_{n}, \boldsymbol{w}) &= \prod_{k=1}^K p(\boldsymbol{x}_n\vert \boldsymbol{w})^{z_{nk}} \end{aligned}\]

$\boldsymbol{w}$ is a set of parameter of

Homogeneous Model

We shall focuss attention on homogeneous models for which all of the conditional distributions governing the latent variables share the same parameters $\mathbf{A}$, and similarly all of the emission distributions share the same parameters $\boldsymbol{w}$ (the extension to more general cases is straightforward).

HMM and Mixture Model

Note that a mixture model for an i.i.d data set corresponds to the special case in which the parameters $A_{jk}$ are the same for all values of $j$, so that the conditional distribution $p(z_n\vert z_n−1)$ is independent of $z_{n−1}$.

Under the Homogeneous Model assumption, the joint probability distribution over both latent and observed variables is then given by

\[\begin{aligned} p(\boldsymbol{X} ,\boldsymbol{Z} \vert \boldsymbol{\theta}) = p(\boldsymbol{z}_1\vert \boldsymbol{\pi}) \left[ \prod_{n=1}^N p(\boldsymbol{z}_n\vert \boldsymbol{z}_{n-1}, \mathbf{A}) \right] \prod_{n=1}^N p(\boldsymbol{x}_m\vert \boldsymbol{z}_m, \boldsymbol{w}) \end{aligned}\]

\[\boldsymbol{\theta} = \{\boldsymbol{\pi}, \mathbf{A}, \boldsymbol{w}\}\]

Most of our discussion of the hidden Markov model will be independent of the particular choice of the emission probabilities.

Indeed, the model is tractable for a wide range of emission distributions including discrete tables, Gaussians, and mixtures of Gaussians. It is also possible to exploit discriminative models such as neural networks.

These can be used to model the emission density $p(x\vert z)$ directly, or to provide a representation for $p(z\vert x)$ that can be converted into the required emission density $p(x\vert z) $ using Bayes’ theorem.

We also can gain a better understanding of the hidden Markov model by considering it from a generative point of view. Just like a ancestral sampling for a directed graphical model.

Left-to-Right Model

By setting the elements of $A_{jk}$ of $\mathbf{A}$ to zero if $j > k$

the Transition Diagram and Lattice or Trellis Diagram of Left-to-Right Model

Inference and Learning of HMM

Given a data set $\mathbf{X} = {\boldsymbol{x}_1, . . . , \boldsymbol{x}_N}$

Likelihood Function

The likelihood function is obtained from the joint distribution by marginalizing over the latent variables

\[\begin{aligned} p(\mathbf{X} \vert \boldsymbol{\theta}) = \sum_{\mathbf{Z}} p(\mathbf{X} ,\mathbf{Z} \vert \boldsymbol{\theta}) \end{aligned}\]

EM for HMM

E-step

\[\begin{aligned} \mathcal{Q}(\boldsymbol{\theta}, \boldsymbol{\theta}^{old}) =& \sum_{\mathbf{Z}} p(\mathbf{Z}\vert \mathbf{X}, \boldsymbol{\theta}^{old}) \ln p(\mathbf{X} ,\mathbf{Z} \vert \boldsymbol{\theta}) \\\\ =& \sum_{k=1}^K \mathbb{E}[z_{1k}]\ln \pi_k + \sum_{n=2}^N\sum_{j=1}^K\sum_{k=1}^K \mathbb{E}[z_{n-1,j},z_{nk}]\ln A_{jk} \\ & + \sum_{n=1}^N \sum_{k=1}^K \mathbb{E}[z_{nk}] \ln p(\boldsymbol{x}_n \vert \boldsymbol{w}_k) \end{aligned}\]

To simplify notion, we define:
\[\begin{aligned} \gamma_{nk} &= \mathbb{E}[z_{nk}] = \sum_{\boldsymbol{z}_n} \gamma(\boldsymbol{z}_n) z_{nk} \\ \xi_{z_{n-1,j}, z_{nk}} &= \mathbb{E}[z_{n-1,j}, z_{nk}] = \sum_{\boldsymbol{z}_n} \xi(\boldsymbol{z}_{n-1}, \boldsymbol{z}_{n})z_{n-1,j}z_{nk} \end{aligned}\]

M-Step

\[\begin{aligned} \boldsymbol{\theta} = \arg \max_{\boldsymbol{\theta}} \mathcal{Q}(\boldsymbol{\theta}, \boldsymbol{\theta}^{old}) \end{aligned}\]

then we can obtain:
\[\begin{aligned} \pi_k &= \frac {\gamma_{1k}}{\sum_{j=1}^K \gamma_{1j}} \\ A_{jk} &= \frac {\sum_{n=2}^N \xi_{z_{n-1,j}z_{nk}}}{\sum_{l=1}^K\sum_{n=2}^K\xi_{z_{n-1,j}z_{nl}}} \end{aligned}\]

The EM algorithm must be initialized by choosing starting values for $\boldsymbol{\pi}$ and $\mathbf{A}$, which should of course respect the summation constraints associated with their probabilistic interpretation.

To maximize $\mathcal{Q}(\boldsymbol{\theta}, \boldsymbol{\theta}^{old})$ with respect to $\boldsymbol{w}_k$, we notice that only the final term depends on $\boldsymbol{w}_k$ and furthermore this term has exactly the same form as the data-dependent term in the corresponding function for a standard mixture distribution for i.i.d data,

If the parameters $\boldsymbol{w}_k$ are independent for the different components, then this term decouples into a sum of terms one for each value of $k$, each of which can be maximized independently

The EM algorithm requires initial values for the parameters of the emission distribution. One way to set these is first to treat the data initially as i.i.d. and fit the emission density by maximum likelihood, and then use the resulting values to initialize the parameters for EM.

The forward-backward algorithm

Next we seek an efficient procedure for evaluating the quantities $\gamma_{nk}$ and $\xi_{z_{n-1,j}, z_{nk}}$, corresponding to the E step of the EM algorithm.

The graph for the hidden Markov model is a tree, and so we know that the posterior distribution of the latent variables can be obtained efficiently using a two-stage message passing algorithm

In the particular context of the hidden Markov model, this is known as the forward-backward algorithm or the Baum-Welch algorithm.

There are in fact several variants of the basic algorithm, all of which lead to the exact marginals, according to the precise form of the messages that are propagated along the chain. Alpha-beta algorithm.

Evaluate $\gamma_{nk}$

Recall that for a discrete multinomial random variable the expected value of one of its components is just the probability of that component having the value 1.

Thus we are interested in finding the posterior distribution $p(\boldsymbol{z}_n\vert \mathbf{X})$.

This represents a vector of length $K$ whose entries correspond to the expected values of $z_{nk}$.

\[\begin{aligned} \gamma_{\boldsymbol{z}_n} = p(\boldsymbol{z}_n\vert \mathbf{X}) = \frac {p(\mathbf{X}\vert \boldsymbol{z}_n) P(\boldsymbol{z}_n)}{p(\mathbf{X})} \end{aligned}\]

Note that the denominator $p(\mathbf{X})$ is implicitly conditioned on the parameters $\boldsymbol{\theta}^{old}$ of the HMM and hence represents the likelihood function.
Using some conditional independence property:
\[\begin{aligned} \gamma_{\boldsymbol{z}_n} &= \frac {p(\boldsymbol{x}_1,...,\boldsymbol{x}_n, \boldsymbol{z}_n)p(\boldsymbol{x}_{n+1},...,\boldsymbol{x}_N \vert \boldsymbol{z}_n)}{p(\mathbf{X})} = \frac {\alpha(\boldsymbol{z}_n)\beta(\boldsymbol{z}_n)}{p(\mathbf{X})} \\ \\ \alpha(\boldsymbol{z}_n) &= p(\boldsymbol{x}_1,...,\boldsymbol{x}_n, \boldsymbol{z}_n) \\ \beta(\boldsymbol{z}_n) &= p(\boldsymbol{x}_{n+1},...,\boldsymbol{x}_N \vert \boldsymbol{z}_n) \end{aligned}\]
recursion formula of $\alpha(\boldsymbol{z}_n)$
\[\begin{aligned} \alpha(\boldsymbol{z}_n) &= p(\boldsymbol{x}_1,...,\boldsymbol{x}_n, \boldsymbol{z}_n) \\ &= p(\boldsymbol{x}_1,...,\boldsymbol{x}_n\vert \boldsymbol{z}_n)p(\boldsymbol{z}_n) \\&= p(\boldsymbol{x}_n\vert \boldsymbol{z}_n)p(\boldsymbol{x}_1,...,\boldsymbol{x}_{n-1}\vert \boldsymbol{z}_n) p(\boldsymbol{z}_n) \\ &= p(\boldsymbol{x}_n\vert \boldsymbol{z}_n)p(\boldsymbol{x}_1,...,\boldsymbol{x}_{n-1}, \boldsymbol{z}_n) \\ &= p(\boldsymbol{x}_n\vert \boldsymbol{z}_n)\sum_{\boldsymbol{z}_{n-1}} p(\boldsymbol{x}_1,...,\boldsymbol{x}_{n-1}, \boldsymbol{z}_{n-1} ,\boldsymbol{z}_n) \\ &= p(\boldsymbol{x}_n\vert \boldsymbol{z}_n)\sum_{\boldsymbol{z}_{n-1}} p(\boldsymbol{x}_1,...,\boldsymbol{x}_{n-1}, \boldsymbol{z}_{n-1})p(\boldsymbol{z}_n\vert \boldsymbol{z}_{n-1}) \\ &= p(\boldsymbol{x}_n\vert \boldsymbol{z}_n)\sum_{\boldsymbol{z}_{n-1}} \alpha(\boldsymbol{z}_{n-1})p(\boldsymbol{z}_n\vert \boldsymbol{z}_{n-1}) \\ \\ \alpha(\boldsymbol{z}_1) &= p(\boldsymbol{x}_1 \vert \boldsymbol{z}_1)p(\boldsymbol{z}_1) = \prod_{k=1}^K (\pi_k p(\boldsymbol{x}_1\vert \boldsymbol{w}_k))^{z_{1k}} \end{aligned}\]
- forward message passing: $1 \rightarrow n$
recursion formula of $\beta(\boldsymbol{z}_n)$
\[\begin{aligned} \beta(\boldsymbol{z}_n) &= p(\boldsymbol{x}_{n+1},...,\boldsymbol{x}_N \vert \boldsymbol{z}_n) \\&= \sum_{\boldsymbol{z}_{n+1}} p(\boldsymbol{x}_{n+2},...,\boldsymbol{x}_N \vert \boldsymbol{z}_{n+1}) p(\boldsymbol{x}_{n+1}\vert \boldsymbol{z}_{n+1}) p(\boldsymbol{z}_{n+1}\vert \boldsymbol{z}_{n}) \\ &= p(\boldsymbol{x}_{n+1},...,\boldsymbol{x}_N \vert \boldsymbol{z}_n) \\&= \sum_{\boldsymbol{z}_{n+1}} \beta(\boldsymbol{z}_{n+1}) p(\boldsymbol{x}_{n+1}\vert \boldsymbol{z}_{n+1}) p(\boldsymbol{z}_{n+1}\vert \boldsymbol{z}_{n}) \end{aligned}\]
- backward message passing: $N \rightarrow n$
- here we also need a strarting condition for the recursion, and we can be obtained by setting $n=N$, and easily obtain: $\beta(\boldsymbol{Z}_N) = 1$
- However, the quantity $p(\mathbf{X})$ represents the likelihood function whose value we typically wish to monitor during the EM optimization, and so it is useful to be able to evaluate it.
  \[\begin{aligned} p(\mathbf{X}) &= \sum_{\boldsymbol{z}_n} \alpha(\boldsymbol{z}_n)\beta(\boldsymbol{z}_n)\\ &= \sum_{\boldsymbol{z}_N} \alpha(\boldsymbol{z}_N) \end{aligned}\]

Evaluate $\xi_{z_{n-1,j}, z_{nk}}$

\[\begin{aligned}\xi(\boldsymbol{z}_{n-1}, \boldsymbol{z}_{n}) = p(\boldsymbol{z}_{n-1}, \boldsymbol{z}_{n} \vert \mathbf{X})\end{aligned}\]

for each of the $K \times K$ settings for $(\boldsymbol{z}{n-1}, \boldsymbol{z}{n})$.

\[\begin{aligned} \xi(\boldsymbol{z}_{n-1}, \boldsymbol{z}_{n}) &= p(\boldsymbol{z}_{n-1}, \boldsymbol{z}_{n} \vert \mathbf{X}) \\ &= \frac {p(\mathbf{X}\vert \boldsymbol{z}_{n-1}, \boldsymbol{z}_{n})p(\boldsymbol{z}_{n-1}, \boldsymbol{z}_{n})}{p(\mathbf{X})} \\ &= \frac {p(\boldsymbol{x}_{`n+1`}, ...,\boldsymbol{x}_{n-1} \vert \boldsymbol{z}_{n-1})p(\boldsymbol{z}_{n-1})p(\boldsymbol{x}_n\vert \boldsymbol{z}_n) p(\boldsymbol{x}_{n+1}, ...,\boldsymbol{x}_{N}\vert \boldsymbol{z}_n)p(\boldsymbol{z}_{n}\vert \boldsymbol{z}_{n-1})}{p(\mathbf{X})} \\ &= \frac {\alpha(\boldsymbol{z}_{n-1}) p(\boldsymbol{x}_n\vert \boldsymbol{z}_n) p(\boldsymbol{z}_{n}\vert \boldsymbol{z}_{n-1}) \beta(\boldsymbol{z}_{n})}{p(\mathbf{X})} \end{aligned}\]

Predictive Distribution

\[\begin{aligned} p(\boldsymbol{x}_{N+1}\vert \mathbf{X}) &= \sum_{\boldsymbol{z}_{N+1}} p(\boldsymbol{x}_{N+1}, \boldsymbol{z}_{N+1} \vert \mathbf{X}) \\ &= \sum_{\boldsymbol{z}_{N+1}} p(\boldsymbol{x}_{N+1} \vert \boldsymbol{z}_{N+1}) \sum_{\boldsymbol{z}_N} p(\boldsymbol{z}_{N+1}, \boldsymbol{z}_N \vert \mathbf{X}) \\ &= \sum_{\boldsymbol{z}_{N+1}} p(\boldsymbol{x}_{N+1} \vert \boldsymbol{z}_{N+1}) \sum_{\boldsymbol{z}_N} p(\boldsymbol{z}_{N+1}\vert \boldsymbol{z}_N) p(\boldsymbol{z}_N \vert \mathbf{X}) \\ &= \sum_{\boldsymbol{z}_{N+1}} p(\boldsymbol{x}_{N+1} \vert \boldsymbol{z}_{N+1}) \sum_{\boldsymbol{z}_N} p(\boldsymbol{z}_{N+1}\vert \boldsymbol{z}_N) \frac {p(\boldsymbol{z}_N, \mathbf{X}) }{p(\mathbf{X})} \\ &= \frac {1}{p(\mathbf{X})} \sum_{\boldsymbol{z}_{N+1}} p(\boldsymbol{x}_{N+1} \vert \boldsymbol{z}_{N+1}) \sum_{\boldsymbol{z}_N} p(\boldsymbol{z}_{N+1}\vert \boldsymbol{z}_N) \alpha(\boldsymbol{z}_N) \end{aligned}\]

Scaling Factors

There is an important issue that must be addressed before we can make use of the forward backward algorithm in practice.

Because these probabilities are often significantly less than unity, as we work our way forward along the chain, the values of $α(\boldsymbol{z}_{n})$ can go to zero exponentially quickly.

In the case of i.i.d. data, we implicitly circumvented this problem with the evaluation of likelihood functions by taking logarithms.

Unfortunately, this will not help here because we are forming sums of products of small numbers (we are in fact implicitly summing over all possible paths through the lattice diagram

re-scale:

\[\begin{aligned}&\alpha(\boldsymbol{z}_{n}) \\ &\beta(\boldsymbol{z}_{n})\end{aligned}\]

this leads their values remain of order unity
\[\begin{aligned} \widehat{\alpha}(\boldsymbol{z}_n) = p(\boldsymbol{z}_n \vert \boldsymbol{x}_1,...,\boldsymbol{x}_n) =\frac {\alpha(\boldsymbol{z}_n)}{p(\boldsymbol{x}_1,...,\boldsymbol{x}_n)} \end{aligned}\]
In order to relate the scaled and original alpha variables, we introduce scaling factors defined by conditional distributions over the observed variables
\[\begin{aligned} c_n = p(\boldsymbol{x}_n\vert \boldsymbol{x}_1,...,\boldsymbol{x}_{n-1}) \end{aligned}\]
- Easily, we have:
\[\begin{aligned} p( \boldsymbol{x}_1,...,\boldsymbol{x}_{n}) &=\prod_{m=1}^n c_m \\ \alpha(\boldsymbol{z}_n) &= \widehat{\alpha}(\boldsymbol{z}_n)\prod_{m=1}^n c_m \\ \\ c_n \widehat{\alpha}(\boldsymbol{z}_n) &= p(\boldsymbol{x}_n\vert \boldsymbol{z}_n)\sum_{\boldsymbol{z}_{n-1}} \widehat{\alpha}(\boldsymbol{z}_{n-1})p(\boldsymbol{z}_n\vert \boldsymbol{z}_{n-1}) \end{aligned}\]
- Note: we also have to evaluate and store $c_n$, but it is easily done because it is the coefficient that normalizes the right-hand side to give $\widehat{\alpha}(\boldsymbol{z}_{n})$
Also, for $\beta(\boldsymbol{z}_n)$
\[\begin{aligned} \beta(\boldsymbol{z}_n) &= \widehat{\beta}(\boldsymbol{z}_n) \prod_{m=n+1}^N c_m \\ \widehat{\beta}(\boldsymbol{z}_n) &= \frac {p(\boldsymbol{x}_{n+1}, ..., \boldsymbol{x}_{N}\vert \boldsymbol{z}_{n})}{p(\boldsymbol{x}_{n+1},...\boldsymbol{x}_{N}\vert \boldsymbol{x}_{1},...,\boldsymbol{x}_{n})} \\ c_{n+1}\widehat{\beta}(\boldsymbol{z}_{n}) &= \sum_{\boldsymbol{z}_{n+1}} \widehat{\beta}(\boldsymbol{z}_{n+1}) p(\boldsymbol{x}_{n+1}\vert \boldsymbol{z}_{n+1}) p(\boldsymbol{z}_{n+1}\vert \boldsymbol{z}_{n}) \end{aligned}\]
we also have a approximation of $p(\mathbf{X})$:
\[\begin{aligned} p(\mathbf{X}) = \prod_{n=1}^N c_n \end{aligned}\]
in the end, we got two iterative equations as follow:
\[\begin{aligned} \gamma(\boldsymbol{z}_n) &= \widehat{\alpha}(\boldsymbol{z}_n) \widehat{\beta}(\boldsymbol{z}_{n}) \\ \xi(\boldsymbol{z}_{n-1}, \boldsymbol{z}_n) &= c_n\widehat{\alpha}(\boldsymbol{z}_{n-1})p(\boldsymbol{x}_n\vert \boldsymbol{z}_n)p(\boldsymbol{z}_n\vert \boldsymbol{z}_{n-1}) \widehat{\beta}(\boldsymbol{z}_{n}) \end{aligned}\]

The Viterbi Algorithm

\[\begin{aligned} \end{aligned}\]

Hidden Markov Model

Hidden Markov Model

Transition and Emmission Probabilities

Transition Probabilities

Emmission Probabilities

Homogeneous Model

HMM and Mixture Model

Left-to-Right Model

Inference and Learning of HMM

Likelihood Function

EM for HMM

E-step

M-Step

The forward-backward algorithm

Evaluate $\gamma_{nk}$

Evaluate $\xi_{z_{n-1,j}, z_{nk}}$

Predictive Distribution

Scaling Factors

The Viterbi Algorithm

Further Reading

Markov Chain

Markov Chain Monte Carlo

Factor Analysis