Laboratory of Computer and Information Science / Neural Networks Research Centre CIS Lab Helsinki University of Technology
Morpheme lattice

Background information on morpheme discovery

In the theory of linguistic morphology, morphemes are considered to be the smallest meaning-bearing elements of language. Any word form can be expressed as a combination of morphemes, as for instance the following English words: affect+ion+ate, dinner+s, eat+ing, king+'s, open+mind+ed+ness.

It seems that automated morphological analysis would be beneficial for many natural language applications dealing with large vocabularies, such as speech recognition and machine translation. Many existing applications make use of words as vocabulary units. However, for some languages, e.g., Finnish and Turkish, this is infeasible, as the number of possible word forms is very high.

The figure below shows how the number of different word forms (word types) increase when going through a large English or Finnish corpus. For example, when 10 million word tokens have been observed, there is less than 100 000 English word types, and already more than 800 000 Finnish word types.

Word tokens versus word types

A Finnish word typically consists of several morphemes. Often a stem is followed by multiple suffixes. Compound words are common, containing an alternation of stems and suffixes (and sometimes prefixes). For instance:

You are at: CISResearchMultimodal InterfacesNatLang group → Morpho project

Page maintained by morpho at mail.cis.hut.fi, last updated Thu Oct 6 11:50:09 2016