T-61.5020 Statistical Natural Language Processing
Answers 1 -- Basics of probability calculus
Version 1.0
First of the given probabilities, 
 ,
tells that is we see a word of three letters, the probability that it is
an abbreviation is 0.8, and 0.2 for something else.
Next one,
,
tells that is we see a word of three letters, the probability that it is
an abbreviation is 0.8, and 0.2 for something else.
Next one, 
 ,
tells that the probability for a random word being exactly three 
letters long is
,
tells that the probability for a random word being exactly three 
letters long is  .
.
The probability for a random word being three letter abbreviation 
is get by product of the given probabilities. First we look how 
probable it is for a word to be three letters long, then how 
probable it is to be abbreviation when being three letters:
|  | |||
|  |  | ||
|  |  | 
 ja stem ``siittää'' 
by
 ja stem ``siittää'' 
by  . The result of the recognition is
. The result of the recognition is  and correct stem
 and correct stem
 . Now we can write the given probabilities:
. Now we can write the given probabilities:
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | 
To answer the given question, we need the Bayes' theorem:
|  | 
|  |  |  | |
|  |  | ||
|  |  | 
To generate a one letter word, the given random language should generate two symbols, i.e. word boundary after something else. The probability for this is
|  | 
Respectively, the probability of a word of two letters is
|  | 
 words of this kind. For three letter words,
 words of this kind. For three letter words,
|  | 
 .
.
As the probability of the word is directly proportional to its
expected incidence in the test data, we can make a table similar
to the table 1.3 in the book by directly calculating probabilities.
As words of same length have equal probability, and they cannot be
sorted by frequency, we count the  value for only one word per
length. The results are presented in table 1 and drawn
to Figure 1.
 value for only one word per
length. The results are presented in table 1 and drawn
to Figure 1.
|  |  |  | 
| 15 | 1111 | 16111 | 
| 450 | 37.04 | 16648 | 
| 13064 | 1.235 | 16129 | 
| 378900 | 0.0412 | 15593 | 
| 1098800 | 0.00137 | 15073 | 
| 318660000 | 0.0000457 | 14570 | 
 remains quite
constant for a large range of
 remains quite
constant for a large range of  . Zipf's discovery may not seem
so extraordinary in this light.
. Zipf's discovery may not seem
so extraordinary in this light.
|  |  |  | |
|  |  |  | 
 .
.
Expectation value:
|  |  |  | |
|  |  | ||
|  |  | ||
|  |  | 
Variance:
|  |  |  | |
|  |  | ||
|  |  | 
Now we can use formula
|  | 
|  |  |  | 
Expectation value of the sum of independent random variables
|  |  |  | |
|  |  | ||
|  |  | ||
|  |  | ||
|  |  | ||
|  |  | 
Variance of a random variable multiplied by a constant
|  |  |  | |
|  |  | ||
|  |  | ||
|  |  | 
Variance of the sum of independent random variables
|  |  |  | |
|  |  | ||
|  | |||
|  |  | ||
|  |  | ||
|  |  | ||
|  |  | ||
|  |  | ||
|  | |||
|  |  | ||
|  |  | 
Now we have all the needed formulas. We want to calculate expectation
value for the sum  , where
, where  is the random variable corresponding
to the first throw and
 is the random variable corresponding
to the first throw and  to the second.
 to the second.
|  | 
|  | |||
|  | 
|  | 
|  | 
The expectation value and variance do not tell everything about the distribution. In figure 2 there are results for varying number of dice tosses simulated using Matlab. The shape of the distribution moves nearer to the normal (gaussian) distribution as the number of dices grow. This is why natural phenomena are often modelled using normal distribution: If many small random events affect to the result, it will be normal distributed. This is also is good excuse for transforming calculations to easier forms.
More formal proof for that the distribution will approach normal is found from http:// mathworld.wolfram.com/CentralLimitTheorem.html
The goal is to minimize the total code length
 :
:
|  | 
 and
 and 
 :
:
|  | 
|  | 
|  | 
 :
:
|  | 
 does not depend on the parameters, so we can leave 
it out. Thus we see that the same goal is obtained by maximizing the 
posterior distribution of the model:
 does not depend on the parameters, so we can leave 
it out. Thus we see that the same goal is obtained by maximizing the 
posterior distribution of the model:
|  |