T-61.5020 Statistical Natural Language Processing
Version 1.1

1.
Let's modify the given table to a format that suits better the first calculations.

Table: Modified tables. (tp = True Positives, fp = False Positives, fn = False Negatives, fp = True Negatives)
 engine 1 relevant non-relevant returned 4 tp 6 fp not returned 2 fn 9988 tn
 engine 2 relevant non-relevant returned 6 tp 4 fp not returned 0 fn 9990 tn

In the following table there are the definitions of the five first measures and the results for applying them.

Table: Results. Note that only the precision and recall values are in a region that is easy to understand.
 measure Ratio of precision relevants in returned recall relevants found fallout returned non-relevants accuracy correctly classified error incorrectly classified

F-measure is defined using both the precision and recall:

where stands for precision and recall. controls the weighting between them. If we choose ,

For the first engine and for the second .

When calculating uninterpolated average precision, we go through the list of returned documents, and whenever a relevant document is seen, we calculate the precision over the documents processed so far. Relevants that were not returned are taken into account with a zero precision. Then we take an average over the precisions.

2.
Word frequences in the documents were given: and . The total number of the documents is . Inverser Document Frequency is defined as , so for the word it is and for the word . Thus the first word got almost twice the weight of the second word.

The idea in Residual Inverse Document Frequency (RIDF) is that we can model the occurrences of a word using a Poisson distribution. This works well for words that are evenly distributed in a corpus. Contentually important words usually occur in groups inside the documents that discuss the corresponding matter, and therefore Poisson distribution gives an incorrect estimation for their frequencies. In RIDF we measure the difference between IDF and Poisson distributions. The more difference we have, the more does the word tell about the document. (Note: There are many errors in this section of the course book's first edition.)

Actual calculations are the following: On average, word occurs times in a document. The probability for that in a certain document word occur times is obtained from the Poisson distribution:

RIDF is defined as

I.e., we take from the Poisson distribution the probability that the word occurs at least once in the document ( )). IDF, on the other hand, was based on the observed value of that probability ( ).

Simplifying the expression of RIDF:

Assigning the values:

We see that RIDF weighted the word 2.5 times more than IDF. Thus both methods estimate that is a more relevant search term than .

3.
The asked document-word matrix is presented in table 3.

Table: Document-word matrix
 Schumacher 0 1 0 1 0 0 0 rata 1 1 1 0 0 1 0 formula 1 0 1 1 0 0 0 kolari 0 0 1 1 0 0 0 galaksi 0 0 0 0 1 1 0 tähti 0 0 1 0 0 1 1 planeetta 0 0 0 0 0 1 1 meteoriitti 0 0 0 0 1 0 0

In Singular Value Decomposition (SVD) we decompose the matrix as:

Here is an orthogonal matrix, is a diagonal matrix and an orthogonal matrix. The matrices are presented in tables 4, 5, and 6.

Table:
 Schumacher -0.200 -0.336 0.290 0.115 0.823 0.007 0.121 -0.243 rata -0.590 0.007 0.184 0.686 -0.232 -0.183 0.025 0.243 formula -0.435 -0.464 -0.040 -0.225 -0.333 0.609 0.045 -0.243 kolari -0.317 -0.361 -0.108 -0.494 0.071 -0.438 -0.285 0.485 galaksi -0.200 0.400 0.602 -0.242 -0.053 0.028 -0.563 -0.243 tähti -0.464 0.376 -0.408 -0.213 0.034 -0.345 0.275 -0.485 planeetta -0.257 0.476 -0.234 -0.070 0.363 0.530 -0.007 0.485 meteoriitti -0.026 0.116 0.534 -0.336 -0.132 -0.048 0.713 0.243

Table:
 2.949 0 0 0 0 0 0 0 2.107 0 0 0 0 0 0 0 1.459 0 0 0 0 0 0 0 1.311 0 0 0 0 0 0 0 1.183 0 0 0 0 0 0 0 0.638 0 0 0 0 0 0 0 0.46 0 0 0 0 0 0 0

Table:
 -0.348 -0.217 0.099 0.352 -0.478 0.669 0.152 -0.268 -0.156 0.325 0.611 0.499 -0.275 0.316 -0.613 -0.210 -0.255 -0.187 -0.390 -0.559 0.130 -0.323 -0.551 0.098 -0.460 0.474 0.279 -0.261 -0.077 0.245 0.779 -0.440 -0.157 -0.030 0.328 -0.512 0.598 0.099 0.124 0.094 0.048 -0.587 -0.244 0.404 -0.440 -0.216 0.335 0.290 0.583

Table: Scaled
 -0.913 -0.924 -0.971 -0.634 -0.400 -0.768 -0.646 -0.407 -0.384 -0.238 -0.773 0.917 0.640 0.764

Table: Correlations of documents
 1.000 1.000 1.000 0.984 0.988 1.000 0.894 0.882 0.800 1.000 -0.008 0.018 0.171 -0.455 1.000 0.441 0.464 0.594 -0.008 0.894 1.000 0.279 0.304 0.446 -0.180 0.958 0.985 1.000

We reduce the inner dimension to two by taking only the two largest eigenvalues from and leaving the rest of the dimensions out from the matrices and . Now the similarity of the documents can be compared using the matrix . If 's columns are scaled to unity, it is easy to calculate correlations between rows. This kind of a scaled matrix is in table 7. (Similarity of words could be compared from .) From the correlation matrix (table 8) we see that the Formula 1 and astronomy related articles correlate much more inwardly than crosswise. Documents and that were totally uncorrelated before, are now clearly correlated. We have projected the data to two-dimensional space, and similar articles have ended up near each other in that reduced dimension.

svirpioj[a]cis.hut.fi