T-61.5020 Statistical Natural Language Processing
Answers 2 -- Entropy and perplexity
Version 1.0

Let's use the definition of the entropy,

$\displaystyle H(X)=\sum_{x\in X}p(x)\log_2\frac{1}{p(x)},$    

with given values:
$\displaystyle H(X)$ $\displaystyle =$ $\displaystyle \frac{3}{32}\log_2\frac{32}{3}+ \frac{3}{16}\log_2\frac{16}{3}
    $\displaystyle + \frac{1}{8}\log_2 8 +
\frac{1}{8}\log_2 8
+\frac{1}{4}\log_2 4$  
  $\displaystyle =$ $\displaystyle 2.50 ~\textrm{bits}$  

For the solution we need the probability $ P(S=s)$ (a random substantive is $ s$). It can be obtained from the right margin probability of the given table. In addition we need the probability

$\displaystyle P(V=v \vert S=s) = \frac{P(S=s,V=v)}{P(S=s)}.$    

The entropy of the source, as we know that the previous symbol was a substantive, is

$\displaystyle H(X_i\vert X_{i-1}\in S)=\sum_{S=\{\textrm{'kissa','tuuli','kiipeilijš'}\}}p(s=S)H(V\vert s=S).$    

For this we need to calculate the conditional entropy $ H(V\vert s=S)$. For the word 'kissa',

$\displaystyle H(V\vert s=\textrm{'kissa'})$ $\displaystyle =$ $\displaystyle \hspace{-1.5cm}
p(v=V\vert s=\textrm{'kissa'})
\log_2 (p(v=V\vert s=\textrm{'kissa'})^{-1})$  
  $\displaystyle =$ $\displaystyle \hspace{-1.5cm}
\log_2 \frac{P(s=\textrm{'kissa'})}{p(s=\textrm{'kissa'},v=V)}$  
  $\displaystyle =$ $\displaystyle \frac 18 \frac {16}{3} \log_2 (8
\frac{3}{16} ) + \frac 1{16} \frac {16}{3} \log_2( 16 \frac{3}{16} )$  
  $\displaystyle =$ $\displaystyle \frac23 \log_2 \frac 32 + \frac13 \log_2 3.$  

As we place the probabilities for each word in $ S$, we get
$\displaystyle H(X_i\vert X_{i-1}\in S)$ $\displaystyle =$ $\displaystyle \frac{3}{16} (\frac 23 \log_2 \frac 32 + \frac 13 \log_2 3) +
\frac 38 (\frac 16 \log_2 6 + \frac 46 \log_2 \frac 64 +\frac16 \log_2 6)$  
    $\displaystyle +\frac 7{16} (\frac 17\log_2 7 + \frac67 \log_2 \frac76)$  
  $\displaystyle =$ $\displaystyle 0.90 ~\textrm{bits}.$  

What is the probability for a random word to be ``kissa''? As both categories $ S$ and $ V$ are of equal probability, the result is

$\displaystyle P(x=\textrm{'kissa'})=P(x\in S) P(S=x) = 0.5\cdot \frac{3}{16}=\frac 3{32}$      

We note that the distribution in (a) is actually a marginal distribution for the joint distribution in (b).

To conclude, when we know the behavior of the source more accurately, the produced words are less suprising and we can code them with less bits (0.9 bit < 2.5 bit).

In the sentences of the described language, the first word is always a noun and the second word is always a verb. The noun does not depend on the previous words, and the verb depends only on the previous noun.

Let's denote the probabilities of the language as $ P(S,V)$, and the probabilities given by the model as $ P_M(S,V)$. We want to calculate the expected coding length of a sentence when it is coded with the model:

$\displaystyle E(-\log P_M(S,V)) = - \sum_{s \in S, v \in V} P(S=s,V=v) \log P_M(S=s,V=v).$      

This measure is called cross-entropy.

For the model, the noun and the verb of a sentence are independent, so $ P_M(S=s,V=v) = P_M(S=s) P_M(V=v)$. By using that and writing the sum open first for the nouns and then for the verbs we get:

    $\displaystyle E(-\log P_M(S,V))$  
  $\displaystyle =$ $\displaystyle - \sum_{s \in S, v \in V}
P(S = s, V = v) \log P_M(S = s, V = v)$  
  $\displaystyle =$ $\displaystyle - \sum_{s \in S} \sum_{v \in V}
P(S = s) P(V = v \vert S = s) \log (P_M(s) P_M(v))$  
  $\displaystyle =$ $\displaystyle - P(S = \textrm{kissa}) \sum_{v \in V} P(V = v \vert S = \textrm{kissa}) \log (P_M(\textrm{kissa}) P_M(v))$  
    $\displaystyle - P(S = \textrm{tuuli}) \sum_{v \in V} P(V = v \vert S = \textrm{tuuli}) \log (P_M(\textrm{tuuli}) P_M(v))$  
    $\displaystyle - P(S = \textrm{kiipelijš}) \sum_{v \in V} P(V = v \vert S = \textrm{kiipelijš}) \log (P_M(\textrm{kiipelijš}) P_M(v))$  
  $\displaystyle =$ $\displaystyle - \frac{3}{16} \cdot
\big[\frac{1}{8}\frac{16}{3} \log(\frac{3}{32} \frac{1}{8}) +
\frac{1}{16}\frac{16}{3} \log(\frac{3}{32} \frac{1}{4}) \big]$  
    $\displaystyle - \frac{3}{8} \cdot
\big[\frac{1}{16}\frac{8}{3} \log(\frac{3}{16...
...16} \frac{1}{8}) +
\frac{1}{16}\frac{8}{3} \log(\frac{3}{16} \frac{1}{4}) \big]$  
    $\displaystyle - \frac{7}{16} \cdot
\big[\frac{1}{16}\frac{16}{7} \log(\frac{7}{32} \frac{1}{8}) +
\frac{3}{8}\frac{16}{7} \log(\frac{7}{32} \frac{1}{4}) \big]$  
  $\displaystyle =$ $\displaystyle 5.01$  

The average coding length (or cross-entropy) for a sentence is thus 5.01 bits.

Each sentence includes two words, so the average coding length for one word is 2.50 bits. The result equals to what was calculated in part (a). This is due to the fact that the distribution over which the expected value of the coding lengths is calculated is the same.

Each of the 30 elementary events has a probability of $ \frac{1}{30}$. Just place these into the definition of entropy:
$\displaystyle H(X)$ $\displaystyle =$ $\displaystyle \sum_{x\in X}p(x)\log_2\frac{1}{p(x)}$  
  $\displaystyle =$ $\displaystyle \sum_{i=1}^{30}\frac{1}{30}\log_2(30)$  
  $\displaystyle =$ $\displaystyle \log_2(30) \approx 4.91 ~ \textrm{bits}$  

To generate a one letter word, the given random language should generate two symbols, i.e. word boundary after something else. The probability for this is

$\displaystyle P(s=t_1)=\frac{1}{30}\cdot \frac{1}{30}$    

and there are 29 words of this kind.

Respectively, the probability of a word of two letters is

$\displaystyle P(s=t_1,t_1)=\frac{1}{30} \cdot \frac{1}{30} \cdot \frac{1}{30},$    

there are $ 29^2$ words of this kind, and so on.

Let's calculate the entropy:

$\displaystyle H(X)\!\!\!\!$ $\displaystyle =$ $\displaystyle \sum_{x\in X}p(x)\log_2\frac{1}{p(x)}$  
$\displaystyle $ $\displaystyle =$ $\displaystyle 29 \times (\frac{1}{30})^2 \log_2(30^2) +
29^2 \times (\frac{1}{30})^3 \log_2(30^3) +
29^3 \times (\frac{1}{30})^4 \log_2(30^4) + \dots$  
$\displaystyle $ $\displaystyle =$ $\displaystyle \frac{1}{29} \left(
\left(\frac{29}{30}\right)^2 \!\!\! \cdot \! ...
...t(\frac{29}{30}\right)^4 \!\!\! \cdot \! 4 \!\cdot \!\log_2(30) + \dots \right)$  
$\displaystyle $ $\displaystyle =$ $\displaystyle \frac{\log_2(30)}{29} \left(
-\frac{29}{30} + \sum_{i=0}^\infty i\cdot \left( \frac{29}{30} \right)
^i \right)$  

The sum term has a well-known solution. Let's quickly go through it:

$\displaystyle \sum_{i=0}^\infty i q^i = q + 2q^2 + 3q^3 + 4q^4 + \dots$ (1)

Multiply both sides by $ q$.

$\displaystyle q\sum_{i=0}^\infty i q^i = q^2 + 2q^3 + 3q^4 + 4q^5 + \dots$ (2)

Subtract equation 2 from equation 1.
$\displaystyle (1-q)\sum_{i=0}^\infty i q^i$ $\displaystyle =$ $\displaystyle q + q^2 + q^3 + q^4 + \dots$ (3)
$\displaystyle \sum_{i=0}^\infty i q^i$ $\displaystyle =$ $\displaystyle \frac{q + q^2 + q^3 + q^4 + \dots}{1-q}$ (4)

Multiply equation 4 by $ q$.

$\displaystyle q\sum_{i=0}^\infty i q^i = \frac{q^2 + q^3 + q^4 + q^5 + \dots}{1-q}$ (5)

Subtract equation 4 from equation 5 to obtain the solution:
$\displaystyle (1-q)\sum_{i=0}^\infty i q^i$ $\displaystyle =$ $\displaystyle \frac{q}{1-q}$ (6)
$\displaystyle \sum_{i=0}^\infty i q^i$ $\displaystyle =$ $\displaystyle \frac{q}{(1-q)^2}$ (7)

(In order to make the subtractions and the multiplications, $ q \neq 0$ and $ \vert q\vert<1$.)

Applying this to the orginal problem, we obtain the result of

    $\displaystyle \frac {\log_2(30)}{29}(-\frac{29}{30} +
  $\displaystyle =$ $\displaystyle \log_2(30)(30-\frac{1}{30})$  
  $\displaystyle =$ $\displaystyle 147 ~\textrm{bits}$  

At first glance this may seem confusing: Shouldn't the result be same as in part (a)? A quick verification shows that it is indeed right: Expectation value for word length is 29, so entropy per symbol is approximately $ 147/(29+1)=4.90$ bits.

There is also another reason for the results not to be exactly same: The first source may generate successively two word boundaries, the second source can not. In consequence, the second source has a bit lower entropy.

Let's mark the perplexities for the models as $ Perp_1$, $ Perp_2$ and $ Perp_3$.
    $\displaystyle Perp_1(\textrm{'kissa'},\textrm{'menee'},\textrm{'puuhun'})$  
    $\displaystyle =P_1(\textrm{word$_1$='kissa'},\textrm{word$_2$='menee'},\textrm{word$_3$='puuhun'})^{-\frac13}$  
    $\displaystyle =\left(
    $\displaystyle =(0.1\cdot 0.1 \cdot 0.1)^{-\frac 13} = 10$  

The model always chooses one of ten different words with equal probabilities, so this is exactly what we should get.
    $\displaystyle Perp_2(\textrm{'kissa'},\textrm{'menee'},\textrm{'puuhun'})$  
  $\displaystyle =$ $\displaystyle P_2(\textrm{word$_1$=subject},\textrm{word$_2$=verb},\textrm{word$_3$=object})^{-\frac13}$  
  $\displaystyle =$ $\displaystyle \left(
  $\displaystyle =$ $\displaystyle (0.33\cdot 0.33 \cdot 0.33)^{-\frac 13} = 3$  

The model selects always one out of three options, so also this result seems reasonable.
    $\displaystyle Perp_3(\textrm{'kissa'},\textrm{'menee'},\textrm{'puuhun'})$  
  $\displaystyle =$ $\displaystyle P_3(\textrm{word$_1$='kissa'},\textrm{word$_2$='menee'},\textrm{word$_3$='puuhun'})^{-\frac13}$  
  $\displaystyle =$ $\displaystyle (
P_3(\textrm{word='kissa'\vert word=first})$  
    $\displaystyle \cdot P_3(\textrm{word='menee' \vert previous\_word = 'kissa'})$  
    $\displaystyle \cdot P_3(\textrm{word='puuhun' \vert previous\_word $=$\ 'menee'}) )^{-\frac13}$  
  $\displaystyle =$ $\displaystyle (0.25\cdot 0.33 \cdot 0.33)^{-\frac 13} = 3.32$  

The last model chooses from 3.32 words on average.

The models 1 and 3 are comparable, as they operate with the same set of symbols. Of these two, model 3 seems to be much better.

Model 2 is not comparable to others, as it operates on a smaller set of symbols and gets a low entropy because of that. An extreme example would be a model for which every word goes to the same category. This kind of model would never be ``surprised'', and the perplexity would be one.

Let's examine the next test sentence. For the first model,
    $\displaystyle Perp_1(\textrm{'valas'},\textrm{'on'},\textrm{'kala'},\textrm{'paitsi'},\textrm{'ettei'})$  
  $\displaystyle =$ $\displaystyle (P_1(\textrm{word='valas'})P_1(\textrm{word='on'})
    $\displaystyle \cdot P_1(\textrm{word='paitsi'})P_1(\textrm{word='ettei'}))^{-\frac15}$  
  $\displaystyle =$ $\displaystyle (0.1\cdot 0.1 \cdot 0.1 \cdot 0 \cdot 0 )^{-\frac 15}$  
  $\displaystyle =$ $\displaystyle \frac{1}{0^\frac 15} = \infty.$  

We note that perplexity cannot be calculated, if the model gives a probability of zero for any word in the test set. Often those words are excluded. This way the result is
    $\displaystyle Perp_1(\textrm{'valas'},\textrm{'on'},\textrm{'kala'})$  
  $\displaystyle =$ $\displaystyle \left(
P_1(\textrm{word='kala'})\right)^{-\frac13} =10.$  

To report a meaningful result, the perplexity is not enough, but we should also count the words that were not recognized by the model. In this case, $ \frac25\cdot 100\%=40\%$ words were out of model's vocabulary. For the next model,
    $\displaystyle Perp_2(\textrm{'valas'},\textrm{'on'})$  
  $\displaystyle =$ $\displaystyle \left(
  $\displaystyle =$ $\displaystyle (0.33\cdot 0.33)^{-\frac 12} = 3$  

This model misses 60% of the words.

Also model 3 recognizes only the two first words.

    $\displaystyle Perp_3(\textrm{'valas'},\textrm{'on'})$  
  $\displaystyle =$ $\displaystyle (
P_3(\textrm{word='valas'\vert word=first})$  
    $\displaystyle \cdot P_3(\textrm{word='on' \vert previous\_word = 'valas'})) )^{-\frac13}$  
  $\displaystyle =$ $\displaystyle (0.25\cdot 0.33)^{-\frac 12} = 3.5$  

The out-of-vocabulary (OOV) rate is 60%.

As before, model 2 is not comparable with the rest. Models 1 and 3 can be compared, as long as we take into account the out-of-vocabulary rates. Model 1 covers more vocabulary, but model 3 gives better perplexity. Creating a language model is often balancing between these properties.

To conclude, perplexity can be used to compare two language models, if the results are calculated similarly and the OOV rates are announced. When comparing results from several sources, both issues must be carefully observed to prevent wrong conclusions.

As a final conclusion of these exercises, let's list the different entropy measures: