T-61.5020 Answers 9.

T-61.5020 Statistical Natural Language Processing
Answers 9 -- Statistical machine translation
Version 1.1

1.

We are trying to find the most probable translation $\hat e$ for the Swedish sentence

$\displaystyle \hat e = \qopname\relax m{argmax}_eP(e\vert r) = \qopname\relax m{argmax}_eP(e)P(r\vert e)$

Let's use the model presented in the course book for the probability $P(r\vert e)$ :

$\displaystyle P(r\vert e)=\frac1Z \sum_{a_1=0}^l\cdots\sum_{a_m=0}^l \prod_{j=1}^m P(r_j\vert e_{a_j})$

where

is the lenght of the original Swedish sentence and

is the lenght of the translated English sentence. For the two possibilities:

$\displaystyle P(r\vert e_1)=1.0\cdot0.7\cdot0.9\cdot1.0\cdot1.0\cdot0.1=0.063$
$\displaystyle P(r\vert e_2)=1.0\cdot0.7\cdot1.0\cdot1.0\cdot1.0\cdot1.0=0.7$

Here we tried the all possible translation rules for each Swedish word. Because the set of the rules is very sparse, the calculation became as simple as that.

The prior probablity is obtained from the language model. Let's calculate it for both of the models:

$\displaystyle P(e_1)$	$\displaystyle =$	$\displaystyle \prod_{i=1}^l P(w_i) = 0.18\cdot0.05\cdot0.01\cdot0.13\cdot0.1\cdot0.12\cdot0.02 = 2.8\cdot10^{-9}$
$\displaystyle P(e_2)$	$\displaystyle =$	$\displaystyle 0.18\cdot0.07\cdot0.11\cdot0.21\cdot0.01\cdot0.13\cdot0.1\cdot0.01=3.8\cdot10^{-10}$

By multiplying the prior and the translation probability, we see that the latter translation is more probable:

$\displaystyle P(e_1)P(r\vert e_1)=0.063\cdot2.8\cdot10^{-9}=1.8\cdot10^{-10}$
$\displaystyle P(e_2)P(r\vert e_2)=2.6\cdot10^{-10}$

Notice that our translation model does not care about the word order. As neither the unigram model does that, the full model gives no importance to the order. Also, if the most probable sentence is asked instead of testing alternatives, there will be no articles or word ``into'' in it. The reason is that adding them will not affect the translation probability, and always reduces the language model probability. So the language model favours shorter sentences. By increasing the model context to trigram we might get a model that puts the articles and word order better in their place.

In common case we need some heuristics to choose the translations that will be considered. Calculating probabilities for all the possible alternatives is impossible in practice.

2.

Let's use the word

= ``tosiasia'' (fact) as an example. It has occurred in 983 sentences. In order to do normalization, we must also count the number of occurrences (sentences where they occurred in) for every English word.

a-b)

Twenty English words that had the largest values for the number of co-occurrences and the normalized number of co-occurrences are given in the table below. We see that neither of the methods gave desired results. For unnormalized frequencies, the problem is with the very common words, that occur in almost any sentence and thus also with our

. For normalized frequencies, the problem is reversed, i.e. very rare words. If a word that occurs only once happen to occur with

, it will give the maximum value,

the 851

that 765

is 720

fact 632

of 599

a 523

and 515

to 497

in 481

it 318

this 311

are 246

we 243

not 239

for 221

have 210

be 199

which 192

on 182

has 173

$\frac{C(e,f)}{C(e)}$

winkler 1.0000

visarequired 1.0000

visaexempt 1.0000

veiling 1.0000

valuejudgment 1.0000

undisputable 1.0000

stayers 1.0000

semipermeable 1.0000

rulingout 1.0000

roentgen 1.0000

residuarity 1.0000

regionallevel 1.0000

redhaired 1.0000

poorlyfounded 1.0000

philippic 1.0000

pemelin 1.0000

paiania 1.0000

overcultivation 1.0000

outturns 1.0000

onesixth 1.0000

c)

The problem in the previous methods was that they did not take into account the bidirectionality of the translation: For

to be a probable translation for

should occur in those sentences were

occurred, and also

should occur in those sentences were

occurred. In this case, both probability estimates $P(e \vert f) = \frac{C(e,f)}{C(f)}$ and $P(f \vert e) = \frac{C(e,f)}{C(e)}$ should be high. Let's use the product of those probabilities as the weight for

The results are in the left-most table on the next page. This time we found the correct translation, and another closely related word, reality, has the next highest value.

Let's try also the $\chi^2$ test that was presented in context of the collocations:

$\displaystyle \chi^2$

$\displaystyle =$

$\displaystyle \frac{N(O_{11}O_{22}-O_{12}O_{21})^2} {(O_{11}+O_{12})(O_{11}+O_{21})(O_{12}+O_{22})(O_{21}+O_{22})},$

where

$\displaystyle O_{11}$	$\displaystyle =$	$\displaystyle C(e,f)$
$\displaystyle O_{12}$	$\displaystyle =$	$\displaystyle C(e,\neg f) = C(e) - C(e,f)$
$\displaystyle O_{21}$	$\displaystyle =$	$\displaystyle C(\neg e,f) = C(f) - C(e,f)$
$\displaystyle O_{22}$	$\displaystyle =$	$\displaystyle C(\neg e, \neg f) = N - C(e) - C(f) + C(e,f)$

and

is the number of sentences in the corpus. For the words that will get the $\chi^2$ value over 3.843, the probability that the co-occurrences were there by chance is less than 5%.

The words that have the largest values are it the right-side table. The test seems to work very nicely: Only ``fact'' exceeded the chosen confidence value. On the other hand, if we would like to have alternative translations, such as ``reality'', a method that gave probability values would be more convenient.

In practice, the translation probabilities are often determined iteratively using the EM algorithm. This way one can limit that one English word would be a translation for many Finnish words. However, a method such as above might be used for initialization of the probabilities.

$\log (\frac{C(e,f)}{C(e)} \cdot \frac{C(e,f)}{C(f)})$

fact -4.0184

reality -6.0493

winkler -6.1975

that -6.3200

is -6.4256

visarequired -6.8906

visaexempt -6.8906

veiling -6.8906

valuejudgment -6.8906

undisputable -6.8906

stayers -6.8906

semipermeable -6.8906

rulingout -6.8906

roentgen -6.8906

residuarity -6.8906

regionallevel -6.8906

redhaired -6.8906

poorlyfounded -6.8906

philippic -6.8906

pemelin -6.8906

$\chi^2$

fact 17.3120

reality 2.2027

winkler 2.0000

that 1.4287

is 1.2133

visarequired 1.0000

visaexempt 1.0000

veiling 1.0000

valuejudgment 1.0000

undisputable 1.0000

stayers 1.0000

semipermeable 1.0000

rulingout 1.0000

roentgen 1.0000

residuarity 1.0000

regionallevel 1.0000

redhaired 1.0000

poorlyfounded 1.0000

philippic 1.0000

pemelin 1.0000

svirpioj[a]cis.hut.fi