Adaptive Natural Language Processing

Our goal is to learn representations that can be used for the recognition, understanding and generation of language. This can be considered to consist of the following interrelated tasks: (1) the discovery of elements of representation (e.g. words, morphemes, phonemes), (2) their meaning relations (syntax and semantics), and (3) structures or "rules" of their use in natural utterances (syntax and pragmatics). The research is part of the activities of two research groups, Multimodal Interfaces and Computational Cognitive Systems.

Research topics
Past research projects
Teaching
People

Research topics

1. Discovery of units of representation

Morpheme discovery

The goal is to develop unsupervised data-driven methods that carry out unsupervised morphology induction, that is, discover the regularities behind word formation in natural languages. For more information, see the page of the Morpho project.

Keywords: morphology induction, unsupervised morpheme discovery, minimum description length principle (MDL), applications of morphemes in NLP

Selected publications:

Mathias Creutz and Krista Lagus (2007). Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing, Volume 4, Issue 1, January 2007.
Oskar Kohonen, Sami Virpioja, and Mikaela Klami (2009). Allomorfessor: Towards unsupervised morpheme analysis. Lecture Notes in Computer Science, 5706. Evaluating Systems for Multilingual and Multimodal Information Access 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008 Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers

Term discovery

Selected publications:

Mari-Sanna Paukkeri, Ilari T. Nieminen, Matti Pöllä and Timo Honkela (2008). A Language-Independent Approach to Keyphrase Extraction and Evaluation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Manchester, UK, August 2008.

2. Discovery of meaning relations between words/morphemes

The research on emergent linguistic and cognitive representations enables computers to deal with semantics: to process data having certain access to its meaning and eventually to its context of use.

Keywords: Self-organising semantic maps, SOM, Word ICA, Latent Semantic Analysis (LSA), random mapping (RM), word spaces, conceptual spaces

More on emergence of linguistic representations for words.

Word sense discovery and disambiguation

A specific task in NLP where the representation of word meaning is important is the twin probem of word sense discovery and word sense disambiguation. Discovery is the process of uncovering the possible different meanings for a given word, and disambiguation is the determination of which meaning is intended in a given instance or given context of the word.

Selected publications:

Linden, K. Evaluation of Linguistic Features for Word Sense Disambiguation with Self-Organized Document Maps. Computers and the Humanities, 2004. (December). Keywords: soft clustering, linguistic features, word-sense disambiguation, document space
Lindén, K. and Lagus, K. (2002). Word Sense Disambiguation in Document Space. 2002 IEEE Int. Conference on Systems, Man and Cybernetics, Tunisia, October 6-9, 2002. Electronic publication (CD-ROM).

3. Modeling of sequential patterns of the elements

Discovery of constructions

Selected publications:

Krista Lagus, Oskar Kohonen, and Sami Virpioja (2009). Towards unsupervised learning of constructions from text. In Proceedings of the Workshop on Extracting and Using Constructions in NLP of the 17th Nordic Conference on Computational Linguistics, NODALIDA, May 2009. SICS Technical Report T2009:10.

Statistical language modeling

Statistical language modeling is the endeavor for finding models that can accurately estimate the probabilities of natural language sequences or utterances. Language models are essential in many a NLP applications, such as speech recognition and machine translation. Our research concentrates on efficient ways of representing the relevant probabilities by applying unsupervised machine learning methods.

Selected publications:

Creutz, M., Hirsimäki, T., Kurimo, M., Puurula, A., Pylkkönen, J., Siivola, V., Varjokallio, M., Arisoy, E., Saraçlar, M., and Stolcke, A. (2007). Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing, Volume 5, Issue 1, Dec 2007.
Vesa Siivola, Teemu Hirsimäki and Sami Virpioja (2007). On Growing and Pruning Kneser-Ney Smoothed N-Gram Models. IEEE Transactions on Audio, Speech and Language Processing, Volume 15, Issue 5, July 2007, pp. 1617-1624.
Hirsimäki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., and Pylkkönen, J. (2006). Unlimited Vocabulary Speech Recognition with Morph Language Models Applied to Finnish. Computer Speech and Language, Volume 20, Issue 4, October 2006, pp. 515-541.
Kurimo, M. and Lagus, K. (2002). An Efficiently Focusing Large Vocabulary Language Model. In International Conference on Artificial Neural Networks (ICANN'02), Madrid, Spain, August 28-30, 2002. pp. 1068-1073.

Adaptive Natural Language Processing

Contents

Research topics

1. Discovery of units of representation

Morpheme discovery

Term discovery

2. Discovery of meaning relations between words/morphemes

Word sense discovery and disambiguation

3. Modeling of sequential patterns of the elements

Discovery of constructions

Statistical language modeling

Past research projects

Teaching

People

Former people

Internal NatLang pages