It is preferable to carry out the project work before attending an exam.
The project work reports that are returned by 31.5.2002 will be graded during spring 2002. Reports returned later are graded at the earliest convenience of the course personnel.
Finished reports may be returned into the post box in front of infolabra (3rd floor of T-building) that is titled "T-61.281 harjoitustyö".
The report should begin with a title page containing the course code and name, student name(s) and ID(s), and the topic.
In the report, describe briefly the research problem, the methods utilized, the experiments carried out, the results and conclusions as well as references.
If you use some other than the given data sets, describe also the data set and append samples of it to the report.
Attach program code as an appendix. If, in addition, you use some ready-made programs or tools, mention these in the report.
The length of the report should be 5-10 pages, not counting the program code.
Only one report is written, in which the distribution of work between the pair is also described. The report may be somewhat longer (10-15 pages) to reflect the extended content of the project.
Working in pairs is especially recommended if one desires to go in more deeply in some topic, and the work load would otherwise become too heavy.
The grade 1 can lower the course grade and correspondingly 5 can raise it when the points obtained in the exam are close to a shift in grade (1-2 points from it). An exception is the grade 5 which cannot be raised.
Alternatively, you can choose only one method and apply it to both languages, and consider/analyze the suitability of the method for each language.
Note: the dictionary data is included only in case someone wants to apply a dictionary-based method instead. It is not necessarily needed.
Produce at least two pseudo words (i.e. pairs or combinations of several words) and apply the methods to those. Report results on a separate test set divided from the STT data.
Send about a half A4 desription of the topic you suggest, containing the research problem, the data set you propose to examine and that is at your disposal, and the methods you thought of applying. If necessary, discuss and refine the topic with the lecturer.
If you send the topic suggestion during spring 2002, you will get feedback and an approval decision within a week.
Corpus analysis and tools are discussed in book chapter 3, which is
suggested reading.
Perl, a too short description
Select a certain field (3. from the left) from each line, when the field separator is whitespace, and compress and direct the results to file res.txt.gz:
gzcat MyData.txt.gz | awk '{ print $3 }' | gzip -c > res.txt.gz
Select a certain field (3. from the left) from each line, when the field separator is colon:
gzcat MyData.txt.gz | awk -F':' '{ print $3 }'
Replace all uppercase characters between A and Z with the corresponding lowercase ones:
gzcat MyData.txt.gz | tr "[A-Z]" "[a-z]"
Select lines that contain the string 'important':
gzcat MyData.txt.gz | grep 'important'
Remove all lines containing the string 'foobar':
gzcat MyData.txt.gz | grep -v 'foobar'Replace all occurrences of string 'banaani' with string 'banaaniovi':
gzcat MyData.txt.gz | perl -e 'while(<>) { s/banaani/banaaniovi/; print;}'