Some of our methods for morpheme discovery

In the so called Morfessor Baseline method we use raw text as training data in order to learn a model, i.e., a vocabulary of morphs. We construct a morph vocabulary, or a lexicon of morphs, so that it is possible to form any word in the data by the concatenation of some morphs. Each word in the data is rewritten as a sequence of morph pointers, which point to entries in the lexicon. We aim at finding the optimal lexicon and segmentation, i.e., a set of morphs that is concise, and moreover gives a concise representation for the data. This model is inspired by the Minimum Description Length (MDL) principle. The segmentation procedure resembles text segmentation (unsupervised discovery of word boundaries in text where blanks have been removed), as no simplifying assumptions about the number of morphs per word are made.

For additional information on the Morfessor Baseline method, see (Creutz and Lagus, 2002). An implementation of the algorithm is available from Morfessor 1.0 and Morfessor 2.0 software. You can also test the on-line demonstration of the algorithm for Finnish, English and Swedish.

In the so called Morfessor Categories-ML method we aim at improving the segmentation obtained using the Baseline method. A probabilistic model is trained in an unsupervised manner. The morphs are tagged with category labels and there are three categories in use: prefix, stem, and suffix. By learning morph categories as well as sequential dependencies between these, the segmentation can be refined, as it is possible to detect instances of over- or under-segmentation in the baseline segmentation.

For additional information on the Morfessor Categories-ML method, see (Creutz and Lagus, 2004). Categories-ML is also included in the on-line demonstration of Morfessor.

The Morfessor Categories-MAP model has a more sophisticated formulation than Categories-ML in that it is a complete maximum a posteriori model, which means that it does not need to rely on heuristics in order to determine the optimal size of the morph lexicon. The improvements over the Categories-ML model have been made possible by introducing a hierarchical lexicon structure: Each morph in the lexicon consists either of a string of letters or of two submorphs, which are themselves present in the lexicon. The submorphs can in turn recursively consist of shorter submorphs. Not all morphs in the lexicon need to be "morpheme-like" in the sense that they carry meaning. Some morphs correspond more closely to syllables and other short fragments of words. The existence of these non-morphemes makes it possible to represent some longer morphs more economically, e.g., the Finnish word "oppositio" consists of "op" (which has no meaning) and "positio" (which means "position").

The hierarchical structure provides different mechanisms for preventing over- and under-segmentation than the heuristics used in Categories-ML. In a morpheme segmentation task, under-segmentation can be avoided by expanding a lexical item into the submorphs it consists of. In order not to create the opposite problem, over-segmentation, the substructures are only expanded as long as they do not contain non-morphemes.

For additional information on the Morfessor Categories-MAP method, see (Creutz and Lagus, 2005). An implementation of the algorithm is available from Morfessor Categories-MAP software.