META-NET Challenge: Context in Machine Translation v1.1, Jun 17 2012 ================================================= http://www.cis.hut.fi/icann2011/con-txt-mt11/ ================================================= CONTENTS ======== List of important files Challenge data set Challenge task Eurovoc descriptors Acknowledgments LIST OF IMPORTANT FILES ======================= README.txt This file train/en-fi/*.xml English-Finnish training documents train/el-fr/*.xml Greek-French training documents test/en-fi/*.xml English-Finnish test documents test/el-fr/*.xml Greek-French test documents scripts/ Example scripts EUROVOC.txt Multilingual EUROVOC descriptors CHALLENGE DATA SET ================== The challenge data set consists of documents from the JRC-ACQUIS Multilingual Parallel Corpus (v3.0) . Two language-pair directions are included: English->Finnish (en-fi) and Greek-French (el-fr). The constructed challenge training data set contains the document context, n-best lists for translated documents and additional contextual information as well as the reference translations. The reference translations is omitted from the documents in the test set. Each document is represented in one single file encoded in UTF-8. The file format is: ... [ ... ] [ ... ] ... ... Each document can contain both translated segments and contextual segments. Contextual segments do not have translation candidates. All source segments, candidate translations and reference translations are tokenized (e.g, punctuation characters are separate tokens). The segments in each document are in the order they appear in the document. The candidate translations are in no specific order. Below is an explanation and an example of each element in the document structure. * id, document identifier: Positive integers (non-consecutive). Example: * eurovoc, Eurovoc document descriptor IDs: Comma separated list of EUROVOC descriptor IDs. Example: * id, segment identifier: Positive integers. Example: * src, source text: Text in source language, tokenized. Example: Agreement * ref, reference translation: Text in target language, tokenized. Example: sopimus * cand, candidate translation: Translation hypothesis in target language, tokenized. Example: sopimuksen * id, candidate identifier: Positive integer. Example: ... * sys, MT system used to produce the candidate translation: An integer, possible values: 1,2,3,4. Example: ... * scores, candidate scores: MT system specific scores for the translation. Lower absolute value is better. The scores are MT system specific and cannot be directly compared. Example: ... * total, candidate total score: MT system specific total score for the translation. Lower absolute value is better. Example: ... The test set (published later) will be in the same format, except the reference translations will not be included. CHALLENGE TASK ============== The task is to select the best translation candidate for each source segment in the later published test data. To accomplish the challenge task, in addition to the included context information, all additional data sources, except for the JRC-ACQUIS corpus, can be utilized. The selection can be limited to the outputs from a single MT system (e.g. 1 or 2) or can encompass all four MT system outputs. Please list all used additional sources in your submission. A script to evaluate the submission with the BLEU score will we published, but other means of evaluation will also be approved and encouraged. To submission should include a file that contains the selected translations. It should have one integer triplet (document-id, segment-id, candidate-id) per line. The file should sorted first by document number and then by segment number. Example submission file (with comments): 2 1 14 # Document 2, segment 1, selected candidate 14 2 3 6 # Document 2, segment 3, selected candidate 6 2 4 154 # Document 2, segment 4, selected candidate 154 14 2 1 # Document 14, segment 2, selected candidate 1 14 3 3 # Document 14, segment 3, selected candidate 3 ... The submission will be computed for a separate test set which will be published well before the submission deadline. It should contain a selection for each segment which has candidate translations in the data. The details of the submission procedure will be announced later. EUROVOC DESCRIPTORS =================== EUROVOC is a multilingual, multidisciplinary thesaurus covering the activities of the EU, the European Parliament in particular. The challenge documents contain one or more numeric EUROVOC descriptors IDs. For more information, see . The file EUROVOC.txt is tab-delimited and encoded in UTF-8. It contains the 6797 EUROVOC descriptor IDs and the corresponding text in all 4 languages (EN, EL, FR, FI) participating in the challenge (first 5 columns). In the consequent 15 columns, you will find the 2nd-level domains that each descriptor belongs to. You can easily infer the top level domains by taking only the first two digits of the 4-digit number (2nd-level domain). Please, take special care of the domains starting with zero like "POLITICS" and "INTERNATIONAL RELATIONS" - always treat them as strings, not numbers. EUROVOC top-level domains: --------------------------- 04 POLITICS 08 INTERNATIONAL RELATIONS 10 EUROPEAN COMMUNITIES 12 LAW 16 ECONOMICS 20 TRADE 24 FINANCE 28 SOCIAL QUESTIONS 32 EDUCATION AND COMMUNICATIONS 36 SCIENCE 40 BUSINESS AND COMPETITION 44 EMPLOYMENT AND WORKING CONDITIONS 48 TRANSPORT 52 ENVIRONMENT 56 AGRICULTURE, FORESTRY AND FISHERIES 60 AGRI-FOODSTUFFS 64 PRODUCTION, TECHNOLOGY AND RESEARCH 66 ENERGY 68 INDUSTRY 72 GEOGRAPHY 76 INTERNATIONAL ORGANISATIONS ACKNOWLEDGMENTS =============== Special thanks to the META-NET partners: * FBK, Human Language Technology, Fondazione Bruno Kessler * DFKI, Language Technology Lab * ILSP, Institute for Language and Speech Processing * CNRS/LIMSI, Centre National de la Recherche Scientifique, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur * RWTH Aachen, Human Language Technology and Pattern Recognition