Exercises 8 -- N-gram language models

Version 1.0

- 1.
- This task is recommended to do without looking ahead,
so that only the information given so far affects your estimates.
a) The task is to estimate the probability of the word following words ``tuntumaan jo'' (

*feel already*). The possible followers are words- ja [
*and*] - hyvältä [
*good*] - kumisaapas [
*rubber boot*] - keväältä [
*(like) spring*(season)] - ilman [
*without*] - päihtyneeltä [
*drunk*] - turhalta [
*vain*] - koirineen [
*with (his) dogs*] - öljyiseltä [
*oily*] - Turku [a city in Finland]

Give a probability value for each so that they sum up to one. Compare the estimates given by yourself to ones calculated directly from a text corpus.

b) Now you know the full beginning of the sentence, which is ``Leuto sää ja soidinmenonsa aloittaneet tiaiset ovat saaneet helmikuun tuntumaan jo'' (free translation:

*``Mild weather and the titmice that have started their displays have made the February feel already''*). Estimate the same probabilities using this full context.c) What kind of knowledge would a language model need to order to match up with a human in the b) case?

(Word used in the original sentence is found on the next page.)

- ja [
- 2.
- A language model has a vocabulary of 64000 words in base forms. We know
that the word history is (1) ``vuosi joka olla'' (
*year that be*) or (2) ``tämä tehtävä vaikuttaa'' (*this task appear*). Estimate the probabilities for the next word to be either ``olla'', ``leuto'', or ``gorilla''. Estimate both unigram and bigram probabilities using- a)
- ...maximum likelihood estimates
- b)
- ...ML estimates with Laplace smoothing
- c)
- ...ML estimates with Lidstone (additive) smoothing with parameter .

`http://www.cis.hut.fi/Opinnot/T-61.5020/Exercises08/extra/ex8-2_data.txt`. - 3.
- In the previous exercise we calculated separate smoothed distributions
for unigrams and bigrams. However, it is more sensible to combine the
estimates of n-grams of different lengths either with a back-off or an
interpolated model. E.g., when we observed the probabilities
of words ``olla'' and ``leuto'' in the context ``vaikuttaa'', the estimates
were equal because there were no occurrences for either of the bigrams.
Yet we know form the unigram probabilities that ``olla'' is much more
likely to occur than ``leuto'', and thus we can assume that it is
more likely also in an unseen context.
Use an interpolated bigram model to calculate probabilities for the examples of the previous exercise. Smooth the bigram estimates using absolute discounting with discount parameter .

- 4.
- Calculate perplexity for the following sentence: ``Kielen oppiminen on
monimutkainen ja huonosti ymmärretty tapahtumaketju.''
Probabilities for a back-off n-gram model can be counted as follows:

svirpioj[a]cis.hut.fi