T-61.5020 Statistical Natural Language Processing
Answers 2 -- Entropy and perplexity
Version 1.0
![]() |
![]() |
![]() |
![]() |
|
![]() |
|||
![]() |
![]() |
The entropy of the source, as we know that the previous symbol was a substantive, is
![]() |
For this we need to calculate the conditional entropy .
For the word 'kissa',
![]() |
![]() |
![]() |
|
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
|||
![]() |
![]() |
What is the probability for a random word to be ``kissa''? As both
categories and
are of equal probability, the result is
![]() |
To conclude, when we know the behavior of the source more accurately, the produced words are less suprising and we can code them with less bits (0.9 bit < 2.5 bit).
Let's denote the probabilities of the language as , and the
probabilities given by the model as
. We want to calculate
the expected coding length of a sentence when it is coded with the
model:
![]() |
For the model, the noun and the verb of a sentence are independent, so
. By using that and writing the
sum open first for the nouns and then for the verbs we get:
![]() |
|||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
|||
![]() |
|||
![]() |
![]() |
||
![]() |
|||
![]() |
|||
![]() |
![]() |
Each sentence includes two words, so the average coding length for one word is 2.50 bits. The result equals to what was calculated in part (a). This is due to the fact that the distribution over which the expected value of the coding lengths is calculated is the same.
![]() |
![]() |
![]() |
|
![]() |
![]() |
||
![]() |
![]() |
![]() |
Respectively, the probability of a word of two letters is
![]() |
Let's calculate the entropy:
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
Applying this to the orginal problem, we obtain the result of
![]() |
|||
![]() |
![]() |
||
![]() |
![]() |
There is also another reason for the results not to be exactly same: The first source may generate successively two word boundaries, the second source can not. In consequence, the second source has a bit lower entropy.
![]() |
|||
![]() |
|||
![]() |
|||
![]() |
![]() |
|||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
|||
![]() |
|||
![]() |
![]() |
The models 1 and 3 are comparable, as they operate with the same set of symbols. Of these two, model 3 seems to be much better.
Model 2 is not comparable to others, as it operates on a smaller set of symbols and gets a low entropy because of that. An extreme example would be a model for which every word goes to the same category. This kind of model would never be ``surprised'', and the perplexity would be one.
![]() |
|||
![]() |
![]() |
||
![]() |
|||
![]() |
![]() |
||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
||
![]() |
![]() |
Also model 3 recognizes only the two first words.
![]() |
|||
![]() |
![]() |
||
![]() |
|||
![]() |
![]() |
As before, model 2 is not comparable with the rest. Models 1 and 3 can be compared, as long as we take into account the out-of-vocabulary rates. Model 1 covers more vocabulary, but model 3 gives better perplexity. Creating a language model is often balancing between these properties.
As a final conclusion of these exercises, let's list the different entropy measures: