- 1.
- Let's modify the given table to a format that suits better the first 
  calculations.
  
 
Table:
Modified tables. 
      (tp = True Positives, 
      fp = False Positives, 
      fn = False Negatives,  
      fp = True Negatives)
| 
| engine 1 | relevant | non-relevant |  | returned | 4 tp | 6 fp |  | not returned | 2 fn | 9988 tn |  
| engine 2 | relevant | non-relevant |  | returned | 6 tp | 4 fp |  | not returned | 0 fn | 9990 tn |  |  
 
 
 
In the following table there are the definitions of the five first
measures and the results for applying them.
  
 
 
Table:
Results. Note that only the precision and recall values are 
      in a region that is easy to understand.
| 
 
    
| measure |  |  |  | Ratio of |  | precision |  |  |  | relevants in returned |  | recall |  |  |  | relevants found |  | fallout |  |  |  | returned non-relevants |  | accuracy |  |  |  | correctly classified |  | error |  |  |  | incorrectly classified |  |  
 
 
 
F-measure is defined using both the precision and recall:
   
 where stands for precision and stands for precision and recall. recall. controls 
  the weighting between them. If we choose controls 
  the weighting between them. If we choose , ,
 For the first engine and for the second and for the second . .
When calculating uninterpolated average precision, we go through the
  list of returned documents, and whenever a relevant document is
  seen, we calculate the precision over the documents processed so
  far. Relevants that were not returned are taken into account with a
  zero precision. Then we take an average over the precisions.
 
 
 
 
- 2.
- Word frequences in the documents were given:  and and .
  The total number of the documents is .
  The total number of the documents is . Inverser Document Frequency
  is defined as . Inverser Document Frequency
  is defined as , so for the word , so for the word it is it is and for the word and for the word   . Thus the first word got almost twice 
  the weight of the second word. . Thus the first word got almost twice 
  the weight of the second word.
The idea in Residual Inverse Document Frequency (RIDF) is that we
  can model the occurrences of a word using a Poisson distribution.
  This works well for words that are evenly distributed in a corpus.
  Contentually important words usually occur in groups inside the
  documents that discuss the corresponding matter, and therefore
  Poisson distribution gives an incorrect estimation for their
  frequencies. In RIDF we measure the difference between IDF and
  Poisson distributions. The more difference we have, the more does
  the word tell about the document. (Note: There are many errors in
  this section of the course book's first edition.)
 
Actual calculations are the following: On average, word  occurs occurs times in a document. The probability
  for that in a certain document word times in a document. The probability
  for that in a certain document word occur occur times is
  obtained from the Poisson distribution: times is
  obtained from the Poisson distribution:
 
 
RIDF is defined as
    
 
 
 I.e., we take from the Poisson distribution the probability that the word 
  occurs at least once in the document ( )). IDF, 
  on the other hand, was based on the observed value of that probability 
  ( )). IDF, 
  on the other hand, was based on the observed value of that probability 
  ( ). ).
Simplifying the expression of RIDF:
    
 
 
 
Assigning the values:
    
 
 
 
We see that RIDF weighted the word  2.5 times more than IDF.
  Thus both methods estimate that 2.5 times more than IDF.
  Thus both methods estimate that is a more relevant search term
  than is a more relevant search term
  than . .
 
 
- 3.
- The asked document-word matrix is presented in table 3.
 
Table:
Document-word matrix
| 
|  |  |  |  |  |  |  |  |  | Schumacher | 0 | 1 | 0 | 1 | 0 | 0 | 0 |  | rata | 1 | 1 | 1 | 0 | 0 | 1 | 0 |  | formula | 1 | 0 | 1 | 1 | 0 | 0 | 0 |  | kolari | 0 | 0 | 1 | 1 | 0 | 0 | 0 |  | galaksi | 0 | 0 | 0 | 0 | 1 | 1 | 0 |  | tähti | 0 | 0 | 1 | 0 | 0 | 1 | 1 |  | planeetta | 0 | 0 | 0 | 0 | 0 | 1 | 1 |  | meteoriitti | 0 | 0 | 0 | 0 | 1 | 0 | 0 |  |  
 
 
 In Singular Value Decomposition (SVD) we decompose the matrix as: as:
 Here is an orthogonal is an orthogonal matrix, matrix, is a diagonal is a diagonal matrix and matrix and an orthogonal an orthogonal matrix. The matrices are presented 
in tables 4, 5, and 6. matrix. The matrices are presented 
in tables 4, 5, and 6.
 
 
Table:
  | 
    
|  |  |  |  |  |  |  |  |  |  | Schumacher | -0.200 | -0.336 | 0.290 | 0.115 | 0.823 | 0.007 | 0.121 | -0.243 |  | rata | -0.590 | 0.007 | 0.184 | 0.686 | -0.232 | -0.183 | 0.025 | 0.243 |  | formula | -0.435 | -0.464 | -0.040 | -0.225 | -0.333 | 0.609 | 0.045 | -0.243 |  | kolari | -0.317 | -0.361 | -0.108 | -0.494 | 0.071 | -0.438 | -0.285 | 0.485 |  | galaksi | -0.200 | 0.400 | 0.602 | -0.242 | -0.053 | 0.028 | -0.563 | -0.243 |  | tähti | -0.464 | 0.376 | -0.408 | -0.213 | 0.034 | -0.345 | 0.275 | -0.485 |  | planeetta | -0.257 | 0.476 | -0.234 | -0.070 | 0.363 | 0.530 | -0.007 | 0.485 |  | meteoriitti | -0.026 | 0.116 | 0.534 | -0.336 | -0.132 | -0.048 | 0.713 | 0.243 |  |  
 
 
 
 
Table:
  | 
    
| 2.949 | 0 | 0 | 0 | 0 | 0 | 0 |  | 0 | 2.107 | 0 | 0 | 0 | 0 | 0 |  | 0 | 0 | 1.459 | 0 | 0 | 0 | 0 |  | 0 | 0 | 0 | 1.311 | 0 | 0 | 0 |  | 0 | 0 | 0 | 0 | 1.183 | 0 | 0 |  | 0 | 0 | 0 | 0 | 0 | 0.638 | 0 |  | 0 | 0 | 0 | 0 | 0 | 0 | 0.460 |  | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |  
 
 
 
 
Table:
  | 
    
|  |  |  |  |  |  |  |  |  |  | -0.348 | -0.217 | 0.099 | 0.352 | -0.478 | 0.669 | 0.152 |  |  | -0.268 | -0.156 | 0.325 | 0.611 | 0.499 | -0.275 | 0.316 |  |  | -0.613 | -0.210 | -0.255 | -0.187 | -0.390 | -0.559 | 0.130 |  |  | -0.323 | -0.551 | 0.098 | -0.460 | 0.474 | 0.279 | -0.261 |  |  | -0.077 | 0.245 | 0.779 | -0.440 | -0.157 | -0.030 | 0.328 |  |  | -0.512 | 0.598 | 0.099 | 0.124 | 0.094 | 0.048 | -0.587 |  |  | -0.244 | 0.404 | -0.440 | -0.216 | 0.335 | 0.290 | 0.583 |  |  
 
 
 
 
Table:
Scaled   | 
    
|  |  |  |  |  |  |  |  |  |  | -0.913 | -0.924 | -0.971 | -0.634 | -0.400 | -0.768 | -0.646 |  |  | -0.407 | -0.384 | -0.238 | -0.773 | 0.917 | 0.640 | 0.764 |  |  
 
 
 
 
Table:
Correlations of documents
| 
    
|  |  |  |  |  |  |  |  |  |  | 1.000 |  |  |  |  |  |  |  |  | 1.000 | 1.000 |  |  |  |  |  |  |  | 0.984 | 0.988 | 1.000 |  |  |  |  |  |  | 0.894 | 0.882 | 0.800 | 1.000 |  |  |  |  |  | -0.008 | 0.018 | 0.171 | -0.455 | 1.000 |  |  |  |  | 0.441 | 0.464 | 0.594 | -0.008 | 0.894 | 1.000 |  |  |  | 0.279 | 0.304 | 0.446 | -0.180 | 0.958 | 0.985 | 1.000 |  |  
 
 
 
We reduce the inner dimension to two by taking only the two largest
eigenvalues from  and leaving the rest of the dimensions out from
the matrices and leaving the rest of the dimensions out from
the matrices and and . Now the similarity of the documents can be
compared using the matrix . Now the similarity of the documents can be
compared using the matrix . If . If 's columns are scaled to
unity, it is easy to calculate correlations between rows. This kind of
a scaled matrix is in table 7. (Similarity of words
could be compared from 's columns are scaled to
unity, it is easy to calculate correlations between rows. This kind of
a scaled matrix is in table 7. (Similarity of words
could be compared from .) From the correlation matrix (table
8) we see that the Formula 1 and astronomy related
articles correlate much more inwardly than crosswise. Documents .) From the correlation matrix (table
8) we see that the Formula 1 and astronomy related
articles correlate much more inwardly than crosswise. Documents and and that were totally uncorrelated before, are now clearly
correlated. We have projected the data to two-dimensional space, and
similar articles have ended up near each other in that reduced
dimension. that were totally uncorrelated before, are now clearly
correlated. We have projected the data to two-dimensional space, and
similar articles have ended up near each other in that reduced
dimension.