META-NET Challenge: Context in Machine Translation
v1.1, Jun 17 2012
=================================================
http://www.cis.hut.fi/icann2011/con-txt-mt11/
=================================================
CONTENTS
========
List of important files
Challenge data set
Challenge task
Eurovoc descriptors
Acknowledgments
LIST OF IMPORTANT FILES
=======================
README.txt This file
train/en-fi/*.xml English-Finnish training documents
train/el-fr/*.xml Greek-French training documents
test/en-fi/*.xml English-Finnish test documents
test/el-fr/*.xml Greek-French test documents
scripts/ Example scripts
EUROVOC.txt Multilingual EUROVOC descriptors
CHALLENGE DATA SET
==================
The challenge data set consists of documents from the JRC-ACQUIS
Multilingual Parallel Corpus (v3.0) .
Two language-pair directions are included: English->Finnish (en-fi)
and Greek-French (el-fr). The constructed challenge training data set
contains the document context, n-best lists for translated documents
and additional contextual information as well as the reference
translations. The reference translations is omitted from the documents
in the test set.
Each document is represented in one single file encoded in UTF-8.
The file format is:
...
[[ ... ]]
[ ... ]
...
...
Each document can contain both translated segments and contextual
segments. Contextual segments do not have translation candidates. All
source segments, candidate translations and reference translations are
tokenized (e.g, punctuation characters are separate tokens). The
segments in each document are in the order they appear in the
document. The candidate translations are in no specific order. Below
is an explanation and an example of each element in the document
structure.
* id, document identifier:
Positive integers (non-consecutive).
Example:
* eurovoc, Eurovoc document descriptor IDs:
Comma separated list of EUROVOC descriptor IDs.
Example:
* id, segment identifier:
Positive integers.
Example:
* src, source text:
Text in source language, tokenized.
Example: Agreement
* ref, reference translation:
Text in target language, tokenized.
Example: [ sopimus ]
* cand, candidate translation:
Translation hypothesis in target language, tokenized.
Example: sopimuksen
* id, candidate identifier:
Positive integer.
Example: ...
* sys, MT system used to produce the candidate translation:
An integer, possible values: 1,2,3,4.
Example: ...
* scores, candidate scores:
MT system specific scores for the translation.
Lower absolute value is better. The scores are MT system specific
and cannot be directly compared.
Example: ...
* total, candidate total score:
MT system specific total score for the translation.
Lower absolute value is better.
Example: ...
The test set (published later) will be in the same format, except the
reference translations will not be included.
CHALLENGE TASK
==============
The task is to select the best translation candidate for each source
segment in the later published test data. To accomplish the challenge
task, in addition to the included context information, all additional
data sources, except for the JRC-ACQUIS corpus, can be utilized. The
selection can be limited to the outputs from a single MT system
(e.g. 1 or 2) or can encompass all four MT system outputs. Please
list all used additional sources in your submission. A script to
evaluate the submission with the BLEU score will we published, but
other means of evaluation will also be approved and encouraged.
To submission should include a file that contains the selected
translations. It should have one integer triplet (document-id,
segment-id, candidate-id) per line. The file should sorted first by
document number and then by segment number. Example submission file
(with comments):
2 1 14 # Document 2, segment 1, selected candidate 14
2 3 6 # Document 2, segment 3, selected candidate 6
2 4 154 # Document 2, segment 4, selected candidate 154
14 2 1 # Document 14, segment 2, selected candidate 1
14 3 3 # Document 14, segment 3, selected candidate 3
...
The submission will be computed for a separate test set which will be
published well before the submission deadline. It should contain a
selection for each segment which has candidate translations in the
data. The details of the submission procedure will be announced later.
EUROVOC DESCRIPTORS
===================
EUROVOC is a multilingual, multidisciplinary thesaurus covering the
activities of the EU, the European Parliament in particular. The
challenge documents contain one or more numeric EUROVOC descriptors
IDs. For more information, see .
The file EUROVOC.txt is tab-delimited and encoded in UTF-8. It
contains the 6797 EUROVOC descriptor IDs and the corresponding text in
all 4 languages (EN, EL, FR, FI) participating in the challenge (first
5 columns). In the consequent 15 columns, you will find the 2nd-level
domains that each descriptor belongs to. You can easily infer the top
level domains by taking only the first two digits of the 4-digit
number (2nd-level domain). Please, take special care of the domains
starting with zero like "POLITICS" and "INTERNATIONAL RELATIONS" -
always treat them as strings, not numbers.
EUROVOC top-level domains:
---------------------------
04 POLITICS
08 INTERNATIONAL RELATIONS
10 EUROPEAN COMMUNITIES
12 LAW
16 ECONOMICS
20 TRADE
24 FINANCE
28 SOCIAL QUESTIONS
32 EDUCATION AND COMMUNICATIONS
36 SCIENCE
40 BUSINESS AND COMPETITION
44 EMPLOYMENT AND WORKING CONDITIONS
48 TRANSPORT
52 ENVIRONMENT
56 AGRICULTURE, FORESTRY AND FISHERIES
60 AGRI-FOODSTUFFS
64 PRODUCTION, TECHNOLOGY AND RESEARCH
66 ENERGY
68 INDUSTRY
72 GEOGRAPHY
76 INTERNATIONAL ORGANISATIONS
ACKNOWLEDGMENTS
===============
Special thanks to the META-NET partners:
* FBK, Human Language Technology, Fondazione Bruno Kessler
* DFKI, Language Technology Lab
* ILSP, Institute for Language and Speech Processing
* CNRS/LIMSI, Centre National de la Recherche Scientifique,
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
* RWTH Aachen, Human Language Technology and Pattern Recognition