T-61.5020 Statistical Natural Language Processing
Answers 9 -- Statistical machine translation
Version 1.1

1.
We are trying to find the most probable translation for the Swedish sentence :

Let's use the model presented in the course book for the probability :

where is the lenght of the original Swedish sentence and is the lenght of the translated English sentence. For the two possibilities:

Here we tried the all possible translation rules for each Swedish word. Because the set of the rules is very sparse, the calculation became as simple as that.

The prior probablity is obtained from the language model. Let's calculate it for both of the models:

By multiplying the prior and the translation probability, we see that the latter translation is more probable:

Notice that our translation model does not care about the word order. As neither the unigram model does that, the full model gives no importance to the order. Also, if the most probable sentence is asked instead of testing alternatives, there will be no articles or word into'' in it. The reason is that adding them will not affect the translation probability, and always reduces the language model probability. So the language model favours shorter sentences. By increasing the model context to trigram we might get a model that puts the articles and word order better in their place.

In common case we need some heuristics to choose the translations that will be considered. Calculating probabilities for all the possible alternatives is impossible in practice.

2.
Let's use the word = tosiasia'' (fact) as an example. It has occurred in 983 sentences. In order to do normalization, we must also count the number of occurrences (sentences where they occurred in) for every English word.

a-b)

Twenty English words that had the largest values for the number of co-occurrences and the normalized number of co-occurrences are given in the table below. We see that neither of the methods gave desired results. For unnormalized frequencies, the problem is with the very common words, that occur in almost any sentence and thus also with our . For normalized frequencies, the problem is reversed, i.e. very rare words. If a word that occurs only once happen to occur with , it will give the maximum value, .

 the 851 that 765 is 720 fact 632 of 599 a 523 and 515 to 497 in 481 it 318 this 311 are 246 we 243 not 239 for 221 have 210 be 199 which 192 on 182 has 173
 winkler 1.0000 visarequired 1.0000 visaexempt 1.0000 veiling 1.0000 valuejudgment 1.0000 undisputable 1.0000 stayers 1.0000 semipermeable 1.0000 rulingout 1.0000 roentgen 1.0000 residuarity 1.0000 regionallevel 1.0000 redhaired 1.0000 poorlyfounded 1.0000 philippic 1.0000 pemelin 1.0000 paiania 1.0000 overcultivation 1.0000 outturns 1.0000 onesixth 1.0000

c)

The problem in the previous methods was that they did not take into account the bidirectionality of the translation: For to be a probable translation for , should occur in those sentences were occurred, and also should occur in those sentences were occurred. In this case, both probability estimates and should be high. Let's use the product of those probabilities as the weight for .

The results are in the left-most table on the next page. This time we found the correct translation, and another closely related word, reality, has the next highest value.

Let's try also the test that was presented in context of the collocations:

where

and is the number of sentences in the corpus. For the words that will get the value over 3.843, the probability that the co-occurrences were there by chance is less than 5%.

The words that have the largest values are it the right-side table. The test seems to work very nicely: Only fact'' exceeded the chosen confidence value. On the other hand, if we would like to have alternative translations, such as reality'', a method that gave probability values would be more convenient.

In practice, the translation probabilities are often determined iteratively using the EM algorithm. This way one can limit that one English word would be a translation for many Finnish words. However, a method such as above might be used for initialization of the probabilities.

 fact -4.0184 reality -6.0493 winkler -6.1975 that -6.3200 is -6.4256 visarequired -6.8906 visaexempt -6.8906 veiling -6.8906 valuejudgment -6.8906 undisputable -6.8906 stayers -6.8906 semipermeable -6.8906 rulingout -6.8906 roentgen -6.8906 residuarity -6.8906 regionallevel -6.8906 redhaired -6.8906 poorlyfounded -6.8906 philippic -6.8906 pemelin -6.8906
 fact 17.3120 reality 2.2027 winkler 2.0000 that 1.4287 is 1.2133 visarequired 1.0000 visaexempt 1.0000 veiling 1.0000 valuejudgment 1.0000 undisputable 1.0000 stayers 1.0000 semipermeable 1.0000 rulingout 1.0000 roentgen 1.0000 residuarity 1.0000 regionallevel 1.0000 redhaired 1.0000 poorlyfounded 1.0000 philippic 1.0000 pemelin 1.0000

svirpioj[a]cis.hut.fi