**T-61.5020 Statistical Natural Language Processing**

Answers 7 -- Word sense disabiguation

Version 1.0

- 1.
Let's start from Bayes' theorem.

Now we are interested only in the order of probabilities, not the absolute values, so we can forget the normalization term :

In the equation the latter term is the prior for the word sense. It can be estimated for example by calculating how many of the words in the training corpus have appeared in the sense . For now we concentrate on the term .Let's choose the nearest 10 words as the context:

The word we are studying would be in the middle, . Here, the order of the context words makes a difference, and this is marked by using the parentheses. Using these kind of feature vectors is in practice impossible, as two equal 10 word contexts are not likely to be found in the training or test sets. We approximate the model by assuming that the order is not significant (now using brackets):

Now we have:

Let's make the estimation easier by assuming that the words occur independently:

Finally, let's write the expression open:

In the last row the formula is written in logarithmic form. This can be done, because taking the logarithm does not affect the order of the values.None of the used approximations is totally correct, but the roughest error is probably the one of independency of the context words. However, this is the way of getting an easily feasible method.

- 2.
- Let's use the formula derived in the previous problem:

where are the words that occurred in the context.We need two estimates: probability that the word in the context occurs with the sense , and prior probability . As we have equal number of occurrences for the senses

*sataa=rain*and*sataa=number*, we can but set the prior to .Maximum likelihood (ML) estimation is applied in the course book. In our problem we were asked to use priors, so let's define a small prior that all words are of equal probability to the probability , and add it to the estimators with coefficient . This can be thought as if every known word had already occurred 0.5 times in both context types. A large emphasises the meaning of the prior, and thus a small evidence from the training set does not change it much.

is the number of known words, 85.- a)
- Let's calculate the estimators needed in the first test sentence:

We see that for the first sense, all probability mass comes from the prior. For the comparison number (i.e. unnormalized probability) we get

The same calculations for the number sense:

The comparison number is:

So according to the model, the number sense of ``sataa'' is more probable. - b)
- As we saw before, we can leave out all the words that have not occurred
in the contexts of either word, as they do not affect the order of
the comparison numbers. Let's use the tool that changes all
disambigous numbers to string ``num''. The needed probabilities are:

For the comparison numbers we get

This time the word seems to mean raining. - c)
- For the third sentence,

For the comparison numbers we get

So it seems to be a number here. - d)
- For the last sentence the given training data does not change the probabilities for any direction. And because the priors were equal, the model cannot make any decision here.

- 3.
Let's find the dictionary definitions for the words in the tested sentence. Those are compared to the dictionary definitions of two senses of the studied word. The meaning that has more mutual words with the words in the dictionary definitions of the other words (including the word itself) in the sentence is decided to be the correct one.

In this case, from the definition of ``ampuminen'', shooting, we find the words ``harjoitella'' and ``varusmies'' that are also in the test sentence. The word ``sarjatuli'' is found from the definition of ``kivääri'', so three points for shooting.

From the definition of ``ammuminen'', that is moo'ing, we find the word ``niityllä'', which is also in the test sentence. One point for moo'ing.

It seems that it is shooting for this one ().

- 4.
Let's see how many hits the Google will give:

prices go up 111000 price goes up 88100 *199100*prices slant 58 prices lean 2520 prices lurch 21 price slants 1 price leans 63 price lurches 114 *2777*

This example goes clearly for the sense

*``go up''*.What about the next example? If we do the translation and search using the given word order, we will get no hits (excluding the hits for this exercise problem). So we try to find documents where the words may occur in any order:

want shin hoof liver or snout 260 like shin hoof liver or snout 304 covet shin hoof liver or snout 219 desire shin hoof liver or snout 243 *1026*want kick poke cost or suffer 43500

We see that the verb meanings of the words win here, altough the nouns would probable be more correct. All searches are not even needed, because the first one already produces more hits than all of the other senses together. In addition, most of the hits returned by the first four searches were from dictionaries.

As the senses

*shin, hoof, liver*and*snout*are much rarer than the verbs, they are found much less. In this situation we should probably normalize the search in some way. This example was harder than the first one also because this time the sentence was not a common and fixed phrase.- 5.
The problem is to estimate the probability of the sense when we know the context .

Let's use the Naive Bayes assumption presented in first problem, i.e. that the words in the context do not depend on each other:

- Set all the words to be equally probable for both sources, and add some
noise . Without the noise the algorithm will not converge, as
all the events have equal probabilities.

Here is the number of the known words. - Set all senses to be of equal probability.

Here is the number of the different senses.

- Calculate the probability of each sense for all contexts:

- Estimate the new word probabilities using the sentence
probabilities estimated in the E-step.

- Update the prior probabilities:

The convergence of the algorithm, as E- and M-steps are iterated, is illustrated in Figure 1. In this case the priors we kept at for first 15 iterations, which improved the stability. We see that the algorithm can separate the numbers and the trees. For sentences 8 and 9 the model overlearns and sets them only to one sense. If the amount of training data would be larger, also these estimates might be more feasible.

The same algorithm can be used to, e.g., separate a set of documents to their topics. In that case, the contexts would be the full documents.

[6.]

Here we present one possible example solution step by step. The most
important points where we have made an arbitary decision that can
increase inaccuracy and could be as well done otherwise, are marked
with *italics*.

- 1)
- The first step is to clean all the extra headlines, tags and markings
away. Then we want to separate the contexts.
*Let the context for each word to be the full sentence where it occurs.*Let's change some two words, e.g. ``sade'' (rain) and ``komissio'' (commission), to a common pseudoword. At the same time we can collect the correct answers for evaluation purpose. - 2)
- Let's change all words of the contexts to a vector form. Here we could
use binary indicator vectors, but let's
*approximate those by setting a random 200-dimensional vector to each word*. If the vector has enough dimensions, it is roughly orthogonal between all different words and the approximation is quite good. - 3)
- Let's assume that
*the order of the context words does not matter*. Let's calculate the context for each word by*summing up the vectors of the words in its context and dividing the sum by the number of the vectors*. - 4)
- Let's cluster the context vectors
*using the self-organizing map (SOM)*. The number by clusters can be decided*experimentally*. For a small number of clusters it is easier to estimate the quality visually; a large number of clusters can give a finer separation. - 5)
- Last we need to evaluate the quality of the clustering. For
unsupervised methods this is sometimes hard, but in this case we can
do the following: First we look if words with different senses went
nicely to different clusters using the correct senses from the training
set. This does not prove much, because if we chose as many clusters as
we have words in the training set, we would get automatically the best
result. Instead, we use the training samples to label each cluster.
This means that the cluster that has more items of some sense A
(
*relative to the size of the training set for both senses*) alleges that all samples that go nearby have surely the sense A. We try the test set against these senses and see how many are correct.

Using the method described above we got the results in Table 1. Here we used a map of size . If no correct answers are available, it is easier to evaluate the result when we have small number of groups. For example, for words ``sade'' and ``komissio'', the results for a map were 59% and 98%. In Figure 2 we have the grouping of words ``sade'' and ``komissio'' for the map.

training | test | ||||

correct % | correct % | correct % | correct % | ||

Lappi | Pariisi | 63 | 55 | 61 | 53 |

sade | komissio | 66 | 93 | 66 | 92 |

Venäjä | tammikuu | 80 | 60 | 78 | 60 |

Halonen | TPS | 62 | 74 | 63 | 70 |

leijona | ydinvoima | 70 | 55 | 75 | 48 |

svirpioj[a]cis.hut.fi