T-61.281 Statistical natural language processing

Project work


General info

The purpose of the project work is to apply one or two statistical methods for some natural language problem presented in class. One should write a report of the work and the results. The work can be thought of as a small research project.

It is preferable to carry out the project work before attending an exam.

The project work reports that are returned by 31.5.2002 will be graded during spring 2002. Reports returned later are graded at the earliest convenience of the course personnel.

Finished reports may be returned into the post box in front of infolabra (3rd floor of T-building) that is titled "T-61.281 harjoitustyö".

Project report

Write a report describing your work.

The report should begin with a title page containing the course code and name, student name(s) and ID(s), and the topic.

In the report, describe briefly the research problem, the methods utilized, the experiments carried out, the results and conclusions as well as references.

If you use some other than the given data sets, describe also the data set and append samples of it to the report.

Attach program code as an appendix. If, in addition, you use some ready-made programs or tools, mention these in the report.

The length of the report should be 5-10 pages, not counting the program code.

Working in pairs

It is possible to do the project jointly with a pair. In this case, an extended version of the project should be carried out, for example by applying the methods to larger or additional data sets, or by utilizing several methods, or by extending the work in some other way.

Only one report is written, in which the distribution of work between the pair is also described. The report may be somewhat longer (10-15 pages) to reflect the extended content of the project.

Working in pairs is especially recommended if one desires to go in more deeply in some topic, and the work load would otherwise become too heavy.


The project works are graded as 5, 3, 1 or failed. Of these 5 means excellent and 1 passing.

The grade 1 can lower the course grade and correspondingly 5 can raise it when the points obtained in the exam are close to a shift in grade (1-2 points from it). An exception is the grade 5 which cannot be raised.


1. Word sense disambiguation

Apply two different methods to word sense disambiguation. One of the methods should be unsupervised and the other supervised. Apply the methods either to Finnish or English data sets. Analyze the benefits and the problems of the methods.

Alternatively, you can choose only one method and apply it to both languages, and consider/analyze the suitability of the method for each language.

English data set: Senseval

Pick from the Senseval-data at least two words to be disambiguated. Report results on the Senseval-test data on the same words. You can also compare your results to those obtained in the Senseval-competition using different methods.

Note: the dictionary data is included only in case someone wants to apply a dictionary-based method instead. It is not necessarily needed.

Finnish data set: STT

This data set does not include correct sense taggings for any ambiguous words. However, one utilize it as data when solving the pseudo word disambiguation problem. For example, create a pseudo-ambiguous word by replacing all occurrences of words 'banaani' and 'ovi' with the ambiguous word 'banaaniovi'. The original words are thus the correct senses to be recognized.

Produce at least two pseudo words (i.e. pairs or combinations of several words) and apply the methods to those. Report results on a separate test set divided from the STT data.

2. Information retrieval


CACM data set


Apply two different methods, at least one of which is one of the following: Compare the suitability of the methods to the task and discuss the advantages and disadvantages of each method.

3. Individual topic

The project work can be carried out on an individual topic, as well. First you should obtain approval for your topic from the lecturer, as follows:

Send about a half A4 desription of the topic you suggest, containing the research problem, the data set you propose to examine and that is at your disposal, and the methods you thought of applying. If necessary, discuss and refine the topic with the lecturer.

If you send the topic suggestion during spring 2002, you will get feedback and an approval decision within a week.

Practical hints for preprocessing etc.

Corpus analysis and tools are discussed in book chapter 3, which is suggested reading.

Perl, a too short description

Unix tools

The data is in file MyData.txt.gz

Select a certain field (3. from the left) from each line, when the field separator is whitespace, and compress and direct the results to file res.txt.gz:

  gzcat MyData.txt.gz | awk '{ print $3 }'  | gzip -c > res.txt.gz

Select a certain field (3. from the left) from each line, when the field separator is colon:

  gzcat MyData.txt.gz | awk -F':' '{ print $3 }'

Replace all uppercase characters between A and Z with the corresponding lowercase ones:

  gzcat MyData.txt.gz | tr  "[A-Z]" "[a-z]" 

Select lines that contain the string 'important':

  gzcat MyData.txt.gz | grep 'important'

Remove all lines containing the string 'foobar':

  gzcat MyData.txt.gz | grep -v 'foobar'
Replace all occurrences of string 'banaani' with string 'banaaniovi':
  gzcat MyData.txt.gz | perl -e 'while(<>) { s/banaani/banaaniovi/; print;}'

Krista Lagus
Last modified: Thu May 23 13:28:39 EEST 2002