T-61.5020 Statistical Natural Language Processing
Answers 10 -- Speech recognition and language model evaluation
Version 1.1

1.
Again, we will use the Viterbi algorithm to find the most probable state sequence from a Hidden Markov Model. There are three differences to the weather model presented in the earlier exercise: The emissions are now done in the state transitions, the model has some null transitions, and the final state is determined.

a)
Let's initialize the grid such that the initial state is $ S_1$. We will collect only non-zero probability values.
$\displaystyle \delta_0(1)$ $\displaystyle =$ $\displaystyle 1$  

The first observation

The initial state can lead only the second or fourth state, so let's calculate those probabilities:

$\displaystyle \delta_1(2)$ $\displaystyle =$ $\displaystyle a_{12} b_{12}(o_1) = 0.5 \cdot 10^{-1} = 5 \cdot 10^{-2}$  
$\displaystyle \psi_1(2)$ $\displaystyle =$ $\displaystyle 1$  
$\displaystyle \delta_1(4)$ $\displaystyle =$ $\displaystyle a_{14} b_{14}(o_1) = 0.5 \cdot 10^{-3} = 5 \cdot 10^{-4}$  
$\displaystyle \psi_1(4)$ $\displaystyle =$ $\displaystyle 1$  

The second observation

From the second state we can go to the third state, and from the fourth state to the fifth state, so there is no choices to be made for those steps.

$\displaystyle \delta_2(3)$ $\displaystyle =$ $\displaystyle \delta_1(2) a_{23} b_{23}(o_2) = 5 \cdot 10^{-2} \cdot 1.0 \cdot 10^{-1} = 5 \cdot 10^{-3}$  
$\displaystyle \psi_2(3)$ $\displaystyle =$ $\displaystyle 2$  
$\displaystyle \delta_2(5)$ $\displaystyle =$ $\displaystyle \delta_1(4) a_{45} b_{45}(o_2) = 5 \cdot 10^{-4} \cdot 1.0 \cdot 10^{-4} = 5 \cdot 10^{-8}$  
$\displaystyle \psi_2(5)$ $\displaystyle =$ $\displaystyle 4$  

However, we should notice that the states $ S_3$ and $ S_5$ can lead to the initial state with a null transition. Thus after the second observation we can go also to $ S_1$:
$\displaystyle \delta_2(1)$ $\displaystyle =$ $\displaystyle \max( \delta_2(3) a_{31} \,,\, \delta_2(5) a_{51} )$  
  $\displaystyle =$ $\displaystyle \max( 5 \cdot 10^{-3} \cdot 0.9 \,,\, 5 \cdot 10^{-8} \cdot 1.0 )$  
  $\displaystyle =$ $\displaystyle 4.5 \cdot 10^{-3}$  
$\displaystyle \psi_2(1)$ $\displaystyle =$ $\displaystyle 3$  

The third observation

Now the possible transitions are from $ S_1$ to $ S_2$ or $ S_4$, and from $ S_3$ to $ S_4$.

$\displaystyle \delta_3(2)$ $\displaystyle =$ $\displaystyle \delta_2(1) a_{12} b_{12}(o_3) = 4.5 \cdot 10^{-3} \cdot 0.5 \cdot 10^{-3} = 2.25 \cdot 10^{-6}$  
$\displaystyle \psi_3(2)$ $\displaystyle =$ $\displaystyle 1$  
$\displaystyle \delta_3(4)$ $\displaystyle =$ $\displaystyle \max( \delta_2(1) a_{14} b_{14}(o_3) \,,\, \delta_2(3) a_{34} b_{34}(o_3) )$  
  $\displaystyle =$ $\displaystyle \max( 4.5 \cdot 10^{-3} \cdot 0.5 \cdot 10^{-2} \,,\, 5 \cdot 10^{-3} \cdot 0.1 \cdot 10^{-1} )$  
  $\displaystyle =$ $\displaystyle 5 \cdot 10^{-5}$  
$\displaystyle \psi_3(4)$ $\displaystyle =$ $\displaystyle 3$  

The fourth observation

Again, from $ S_2$ we can go only to $ S_3$ and from $ S_4$ to $ S_5$.

$\displaystyle \delta_4(3)$ $\displaystyle =$ $\displaystyle \delta_3(2) a_{23} b_{23}(o_4) = 2.25 \cdot 10^{-6} \cdot 1.0 \cdot 10^{-3} = 2.25 \cdot 10^{-9}$  
$\displaystyle \psi_4(3)$ $\displaystyle =$ $\displaystyle 2$  
$\displaystyle \delta_4(5)$ $\displaystyle =$ $\displaystyle \delta_3(4) a_{45} b_{45}(o_4) = 5 \cdot 10^{-5} \cdot 1.0 \cdot 10^{-1} = 5 \cdot 10^{-6}$  
$\displaystyle \psi_4(5)$ $\displaystyle =$ $\displaystyle 4$  

Final state

In the end we should arrive to the final state $ S_1$. With a null transition:

$\displaystyle \delta_4(1)$ $\displaystyle =$ $\displaystyle \max( \delta_4(3) a_{31} \,,\, \delta_4(5) a_{51} )$  
  $\displaystyle =$ $\displaystyle \max( 2.25 \cdot 10^{-9} \cdot 0.9 \,,\, 5 \cdot 10^{-6} \cdot 1.0 )$  
  $\displaystyle =$ $\displaystyle 5 \cdot 10^{-6}$  
$\displaystyle \psi_4(1)$ $\displaystyle =$ $\displaystyle 5$  

The calculated grid is in the Figure 1. By following the arrows from the end to the beginning, we obtain the most probable sequence $ S_1 \to S_2 \to S_3 \to S_4 \to S_5 \to S_1$. This corresponds to the word ``jaon''.

Figure 1: The Viterbi grid after the calculations.
\begin{figure}\centering\epsfig{file=viterbi.eps,width=0.75\textwidth}
\end{figure}

b)
In this case we must take into account the probabilities given by the language model. The probability values $ \delta$ are calculated conditioned by the different choice of the word $ w_j$: $ \delta_t(i,w_j)$. The probability for the word is added to the calculations at each point where the word is selected. When we arrive to the initial state again, the selections determine which of the bigram probabilities is used. After that, they can be forgotten, as the language model does not use longer contexts.

Let's initialize the grid as before. We do not select the word yet.

$\displaystyle \delta_0(1, \textrm{\_})$ $\displaystyle =$ $\displaystyle 1$  

The first observation

The initial state leads to $ S_2$ and $ S_4$. The second state can start either the word ``ja'' or ``jaon'', so both must be taken into account.

$\displaystyle \delta_1(2, \textrm{ja})$ $\displaystyle =$ $\displaystyle P(\textrm{ja}) a_{12} b_{12}(o_1) = 10^{-2} \cdot 0.5 \cdot 10^{-1} = 5 \cdot 10^{-4}$  
$\displaystyle \psi_1(2, \textrm{ja})$ $\displaystyle =$ $\displaystyle 1$  
$\displaystyle \delta_1(2, \textrm{jaon})$ $\displaystyle =$ $\displaystyle P(\textrm{jaon}) a_{12} b_{12}(o_1) = 10^{-5} \cdot 0.5 \cdot 10^{-1} = 5 \cdot 10^{-7}$  
$\displaystyle \psi_1(2, \textrm{jaon})$ $\displaystyle =$ $\displaystyle 1$  
$\displaystyle \delta_1(4, \textrm{on})$ $\displaystyle =$ $\displaystyle P(\textrm{on}) a_{14} b_{14}(o_1) = 10^{-2} \cdot 0.5 \cdot 10^{-3} = 5 \cdot 10^{-6}$  
$\displaystyle \psi_1(4, \textrm{on})$ $\displaystyle =$ $\displaystyle 1$  

The second observation

The second state leads only to the third state and the fourth state to the fifth state. In addition, the first state can be reached with a null transition. This is of course possible only for the words that end at this point.

$\displaystyle \delta_2(3, \textrm{ja})$ $\displaystyle =$ $\displaystyle \delta_1(2, \textrm{ja}) a_{23} b_{23}(o_2) = 5 \cdot 10^{-4} \cdot 1.0 \cdot 10^{-1} = 5 \cdot 10^{-5}$  
$\displaystyle \psi_2(3, \textrm{ja})$ $\displaystyle =$ $\displaystyle 2$  
$\displaystyle \delta_2(3, \textrm{jaon})$ $\displaystyle =$ $\displaystyle \delta_1(2, \textrm{jaon}) a_{23} b_{23}(o_2) = 5 \cdot 10^{-7} \cdot 1.0 \cdot 10^{-1} = 5 \cdot 10^{-8}$  
$\displaystyle \psi_2(3, \textrm{jaon})$ $\displaystyle =$ $\displaystyle 2$  
$\displaystyle \delta_2(5, \textrm{on})$ $\displaystyle =$ $\displaystyle \delta_1(4, \textrm{on}) a_{45} b_{45}(o_2) = 5 \cdot 10^{-6} \cdot 1.0 \cdot 10^{-4} = 5 \cdot 10^{-10}$  
$\displaystyle \psi_2(5, \textrm{on})$ $\displaystyle =$ $\displaystyle 4$  
       
$\displaystyle \delta_2(1, \textrm{ja})$ $\displaystyle =$ $\displaystyle \delta_2(3, \textrm{ja}) a_{31} = 5 \cdot 10^{-5} \cdot 0.9 = 4.5 \cdot 10^{-5}$  
$\displaystyle \psi_2(1, \textrm{ja})$ $\displaystyle =$ $\displaystyle 3$  
$\displaystyle \delta_2(1, \textrm{on})$ $\displaystyle =$ $\displaystyle \delta_2(5, \textrm{on}) a_{51} = 5 \cdot 10^{-10} \cdot 1.0 = 5 \cdot 10^{-10}$  
$\displaystyle \psi_2(1, \textrm{on})$ $\displaystyle =$ $\displaystyle 5$  

The third observation

Possible transitions are from $ S_1$ to $ S_2$ or $ S_4$, and from $ S_3$ to $ S_4$. The transitions from $ S_1$ start new words, so the probabilities from the language model are taken into account. In addition, as we had two possible words in state $ S_1$, we can now select the more probable one.

$\displaystyle \delta_3(2, \textrm{ja})$ $\displaystyle =$ $\displaystyle \max( P(\textrm{ja} \vert \textrm{ja}) \delta_2(1, \textrm{ja}) \...
...xtrm{ja} \vert \textrm{on}) \delta_2(1, \textrm{on}) ) \cdot a_{12} b_{12}(o_3)$  
  $\displaystyle =$ $\displaystyle \max( 10^{-4} \cdot 4.5 \cdot 10^{-5} \,,\, 10^{-2} \cdot 5 \cdot 10^{-10} ) \cdot 0.5 \cdot 10^{-1}$  
  $\displaystyle =$ $\displaystyle 2.25 \cdot 10^{-10}$  
$\displaystyle \psi_3(2, \textrm{ja})$ $\displaystyle =$ $\displaystyle 1$  
$\displaystyle \delta_3(4, \textrm{on})$ $\displaystyle =$ $\displaystyle \max( P(\textrm{on} \vert \textrm{ja}) \delta_2(1, \textrm{ja}) \...
...xtrm{on} \vert \textrm{on}) \delta_2(1, \textrm{on}) ) \cdot a_{14} b_{14}(o_3)$  
  $\displaystyle =$ $\displaystyle \max( 10^{-2} \cdot 4.5 \cdot 10^{-5} \,,\, 10^{-4} \cdot 5 \cdot 10^{-10} ) \cdot 0.5 \cdot 10^{-2}$  
  $\displaystyle =$ $\displaystyle 2.25 \cdot 10^{-9}$  
$\displaystyle \psi_3(4, \textrm{on})$ $\displaystyle =$ $\displaystyle 1$  
$\displaystyle \delta_3(4, \textrm{jaon})$ $\displaystyle =$ $\displaystyle \delta_2(3, \textrm{jaon}) a_{34} b_{34}(o_3)$  
  $\displaystyle =$ $\displaystyle 5 \cdot 10^{-8} \cdot 1.0 \cdot 10^{-1} = 5 \cdot 10^{-10}$  
$\displaystyle \psi_3(4, \textrm{jaon})$ $\displaystyle =$ $\displaystyle 3$  

The fourth observation

From the second state we can only to the third state, and from the fourth state only to the fifth state. Also the first state can be reached with a null transition.

$\displaystyle \delta_4(3, \textrm{ja})$ $\displaystyle =$ $\displaystyle \delta_3(2, \textrm{ja}) a_{23} b_{23}(o_4) = 2.25 \cdot 10^{-10} \cdot 1.0 \cdot 10^{-3} = 2.25 \cdot 10^{-13}$  
$\displaystyle \psi_4(3, \textrm{ja})$ $\displaystyle =$ $\displaystyle 2$  
$\displaystyle \delta_4(5, \textrm{on})$ $\displaystyle =$ $\displaystyle \delta_3(4, \textrm{on}) a_{45} b_{45}(o_4) = 2.25 \cdot 10^{-9} \cdot 1.0 \cdot 10^{-1} = 2.25 \cdot 10^{-10}$  
$\displaystyle \psi_4(5, \textrm{on})$ $\displaystyle =$ $\displaystyle 4$  
$\displaystyle \delta_4(5, \textrm{jaon})$ $\displaystyle =$ $\displaystyle \delta_3(4, \textrm{jaon}) a_{45} b_{45}(o_4) = 5 \cdot 10^{-10} \cdot 1.0 \cdot 10^{-1} = 5 \cdot 10^{-11}$  
$\displaystyle \psi_4(5, \textrm{jaon})$ $\displaystyle =$ $\displaystyle 4$  
       
$\displaystyle \delta_4(1, \textrm{ja})$ $\displaystyle =$ $\displaystyle \delta_4(3, \textrm{ja}) a_{31} = 0.9 \cdot 2.25 \cdot 10^{-10} = 2.025 \cdot 10^{-13}$  
$\displaystyle \psi_4(1, \textrm{ja})$ $\displaystyle =$ $\displaystyle 3$  
$\displaystyle \delta_4(1, \textrm{on})$ $\displaystyle =$ $\displaystyle \delta_4(5, \textrm{on}) a_{51} = 1.0 \cdot 2.25 \cdot 10^{-9} = 2.25 \cdot 10^{-10}$  
$\displaystyle \psi_4(1, \textrm{on})$ $\displaystyle =$ $\displaystyle 5$  
$\displaystyle \delta_4(1, \textrm{jaon})$ $\displaystyle =$ $\displaystyle \delta_4(5, \textrm{jaon}) a_{51} = 1.0 \cdot 5 \cdot 10^{-10} = 5 \cdot 10^{-11}$  
$\displaystyle \psi_4(1, \textrm{jaon})$ $\displaystyle =$ $\displaystyle 5$  

The grid after the final step is in Figure 2. The different word choices are drawn with different arrows. The most probable of the three paths that have led to the final state is $ \delta_4(1, \textrm{on})$. When we follow the arrows backwards in time, we get the most probable sequence $ S_1 \to S_2 \to S_3 \to S_1 \to S_4 \to S_5 \to S_1$. This corresponds to the two-word sequence ``ja on''.

Figure 2: The grid after reaching the final state.
\begin{figure}\centering\epsfig{file=viterbilm.eps,width=0.8\textwidth}
\end{figure}

2.
The models build with the units from segmentation B have about three times as many unit types as models build from segmentation A. The tokens in A are smaller on average, and thus the evaluation data includes more of them. The tokenwise cross-entropies cannot be compared directly because of this. For example, if the text was segmented to individual letters, the tokens would be quite easy to predict on average, but the likelihood of the whole data is not likely to be very high.

Instead of direct comparison, we can first normalize the entropies so that they are based on words. The cross-entropy of test data $ D$ could be calculated as

$\displaystyle H_M(D) = \frac{1}{n} \sum_{i=1}^n \log P_M(D_i) = \frac{1}{n} \log P_M(D).$ (1)

If we divide the logarithm of data likelihood $ P_M(D)$ by the number of words in the data, $ W_D$, instead of the number of tokens in the data, $ n$, we get the normalized, word-based entropy:

$\displaystyle H_M^{W}(D) = \frac{1}{W_D} \log P_M(D).$ (2)

As we know $ H_M(D)$, $ n$ and $ W_D$, we can calculate the normalized entropy as follows:

$\displaystyle H_M^{W}(D) = \frac{n}{W_D} H_M(D).$ (3)

Let's convert the given entropies to word-based estimates:

$\displaystyle H_{A1}^{W}(D)$ $\displaystyle =$ $\displaystyle \frac{344\,960}{100\,000} \cdot 4.54 = 15.66$  
$\displaystyle H_{A2}^{W}(D)$ $\displaystyle =$ $\displaystyle \frac{344\,960}{100\,000} \cdot 4.39 = 15.14$  
$\displaystyle H_{A3}^{W}(D)$ $\displaystyle =$ $\displaystyle \frac{344\,960}{100\,000} \cdot 4.31 = 14.87$  
$\displaystyle H_{B1}^{W}(D)$ $\displaystyle =$ $\displaystyle \frac{301\,271}{100\,000} \cdot 5.19 = 15.64$  
$\displaystyle H_{B2}^{W}(D)$ $\displaystyle =$ $\displaystyle \frac{301\,271}{100\,000} \cdot 5.02 = 15.12$  
$\displaystyle H_{B3}^{W}(D)$ $\displaystyle =$ $\displaystyle \frac{301\,271}{100\,000} \cdot 4.93 = 14.85$  

It seems that the entropies with the segmentation B are somewhat better in models of all magnitudes. However, as the differences are small, and models B have larger models, the exact sizes must be taken into account. The comparison is easy if we draw plot the results to size-entropy coordinates; see Figure 3.

Figure 3: Normalized cross-entropies.
\begin{figure}\centering\epsfig{file=entropy.eps,width=0.5\textwidth}
\end{figure}

The break-line that connects the points of the segmentation A is nearer to the left-down corner that the lines of connecting B, which means better accuracies for the models of same size.

Next we will take a look at the recognition results. The error rates have been calculated per words, so there is no need for normalization. The word error rates (WER) are plotted against model sizes in Figure 4. We see that the results are mixed for the small and large models: Segmentation A works better for the small models, but B seems to outperform it after the size grows over 900000 n-grams.

Figure 4: Word error rates.
\begin{figure}\centering\epsfig{file=wer.eps,width=0.5\textwidth}
\end{figure}

It seems to be quite clear that the models based on segmentation A are better than those based on B, if the model size is small. For larger models, the results are very close. In addition, the performance is not known for models smaller than half million or larger than one million n-grams. To get more reliable results, we would need more measurement points and test the statistical significance between the values (e.g. with Wilcoxon signed-rank test).



svirpioj[a]cis.hut.fi