T-61.5020 Statistical Natural Language Processing
Exercises 1 -- Basics of probability calculus
Version 1.0
Let us observe a stemming program for Finnish. By using the context, it can conduct whether the stem for word-form ``siitä'' is ``se'' (it) or ``siittää'' (conceive). For a word-form of ``se'' the program can determine the right stem for probability of 0.95. The same holds for the word-forms of ``siittää''. Because the stem ``se'' is much more common, only one of a thousand ``siitä'' should be conducted to the stem ``siittää''.
The program tells that for an occurred word-form ``siitä'', the corresponding stem is ``siittää''. What is the probability that the program is correct?
So called Zipf's law is often referred when simple statistics are
calculated from a language. Let the words be tabulated so that the
most common word comes first (), and the rest follow by the order
of frequency (
). Next to the word is written how many
times it occurred it the text (
). Zipf alleges that
![]() |
Does this apply to a randomly generated language, that has 30 letters including the word boundary?
Ideal MDL is of little practical use, since it has been shown that it is impossible to design an algorithm that finds the shortest possible Turing machine. Therefore there exists several variations, of which one quite common is so called two-part coding scheme.
In two-part coding scheme we first select a model class that can
describe some data given the model parameters . The goal is to
describe and send a set of data,
, that is assumed to be generated
by a model of the decided class, with the minimum number of bits
possible. As the receiver does not know the parameters that we choose,
also they must be sent. Denote
as the description length
needed to encode the parameters, and
as the
description length needed to encode the data when the parameters are
known. Thus we need to minimize the total code length
.
In statistical modeling, the coding of the model class corresponds to
probability distribution
, and the coding of parameters
to distribution
. From information theory we know that if
probability of a message is
, the minimum code length for the
message is
bits. Show that the optimal selection of the
parameters in two-part coding scheme equals to Maximum A Posteriori
estimation.