T-61.5020 Statistical Natural Language Processing
Answers 2 -- Entropy and perplexity
Version 1.0

1.
a)
Let's use the definition of the entropy,

with given values:

b)
For the solution we need the probability (a random substantive is ). It can be obtained from the right margin probability of the given table. In addition we need the probability

The entropy of the source, as we know that the previous symbol was a substantive, is

For this we need to calculate the conditional entropy . For the word 'kissa',

As we place the probabilities for each word in , we get

What is the probability for a random word to be kissa''? As both categories and are of equal probability, the result is

We note that the distribution in (a) is actually a marginal distribution for the joint distribution in (b).

To conclude, when we know the behavior of the source more accurately, the produced words are less suprising and we can code them with less bits (0.9 bit < 2.5 bit).

c)
In the sentences of the described language, the first word is always a noun and the second word is always a verb. The noun does not depend on the previous words, and the verb depends only on the previous noun.

Let's denote the probabilities of the language as , and the probabilities given by the model as . We want to calculate the expected coding length of a sentence when it is coded with the model:

This measure is called cross-entropy.

For the model, the noun and the verb of a sentence are independent, so . By using that and writing the sum open first for the nouns and then for the verbs we get:

The average coding length (or cross-entropy) for a sentence is thus 5.01 bits.

Each sentence includes two words, so the average coding length for one word is 2.50 bits. The result equals to what was calculated in part (a). This is due to the fact that the distribution over which the expected value of the coding lengths is calculated is the same.

2.
a)
Each of the 30 elementary events has a probability of . Just place these into the definition of entropy:

b)
To generate a one letter word, the given random language should generate two symbols, i.e. word boundary after something else. The probability for this is

and there are 29 words of this kind.

Respectively, the probability of a word of two letters is

there are words of this kind, and so on.

Let's calculate the entropy:

The sum term has a well-known solution. Let's quickly go through it:

 (1)

Multiply both sides by .

 (2)

Subtract equation 2 from equation 1.
 (3) (4)

Multiply equation 4 by .

 (5)

Subtract equation 4 from equation 5 to obtain the solution:
 (6) (7)

(In order to make the subtractions and the multiplications, and .)

Applying this to the orginal problem, we obtain the result of

At first glance this may seem confusing: Shouldn't the result be same as in part (a)? A quick verification shows that it is indeed right: Expectation value for word length is 29, so entropy per symbol is approximately bits.

There is also another reason for the results not to be exactly same: The first source may generate successively two word boundaries, the second source can not. In consequence, the second source has a bit lower entropy.

3.
a)
Let's mark the perplexities for the models as , and .

The model always chooses one of ten different words with equal probabilities, so this is exactly what we should get.

The model selects always one out of three options, so also this result seems reasonable.

The last model chooses from 3.32 words on average.

The models 1 and 3 are comparable, as they operate with the same set of symbols. Of these two, model 3 seems to be much better.

Model 2 is not comparable to others, as it operates on a smaller set of symbols and gets a low entropy because of that. An extreme example would be a model for which every word goes to the same category. This kind of model would never be surprised'', and the perplexity would be one.

b)
Let's examine the next test sentence. For the first model,

We note that perplexity cannot be calculated, if the model gives a probability of zero for any word in the test set. Often those words are excluded. This way the result is

To report a meaningful result, the perplexity is not enough, but we should also count the words that were not recognized by the model. In this case, words were out of model's vocabulary. For the next model,

This model misses 60% of the words.

Also model 3 recognizes only the two first words.

The out-of-vocabulary (OOV) rate is 60%.

As before, model 2 is not comparable with the rest. Models 1 and 3 can be compared, as long as we take into account the out-of-vocabulary rates. Model 1 covers more vocabulary, but model 3 gives better perplexity. Creating a language model is often balancing between these properties.

To conclude, perplexity can be used to compare two language models, if the results are calculated similarly and the OOV rates are announced. When comparing results from several sources, both issues must be carefully observed to prevent wrong conclusions.

As a final conclusion of these exercises, let's list the different entropy measures:

• Entropy

Interpretation: Self-information of the source, or the average coding length that is needed to send a message with the optimal coding

• Cross-entropy

Interpretation: Average coding length needed to send the message using the model for the coding

• Relative entropy or Kullback-Leibler divergence

Interpretation: How many bits on average are lost if the message is coded with the model

• Perplexity

Interpretation: Average branching factor of the model for the data given by the source

svirpioj[a]cis.hut.fi