Introduction to the problem.

Next: Introduction to this thesis Up: INTRODUCTION Previous: INTRODUCTION

Introduction to the problem.

Many important pattern recognition problems arise from situations where a source has a finite number of states, and in each state it emits patterns having particular stationary characteristics. This assumption can also be used as a successful approximation in many practical tasks. One example is automatic speech recognition (ASR), where the pattern sequences are produced when a speaker modifies his articulatory mechanism. Another common example is automatic recognition of cursive handwriting which consists of arcs and lines drawn for the letters.

A straightforward approach to find out the state sequences of the system is to first divide the pattern sequences into short segments corresponding to different states and then to classify them using stationary models. The final recognition result is then obtained by classifying the sequence of these intermediate results. However, sometimes the segmentation and the final classification problems are tied closely together as also in the two given examples. The most likely rival result candidates may require different segmentations so that a separate segmentation phase may destroy some of the essential information for the classification.

In the pattern sequence recognition problems the information about the state of the system is often not transmitted explicitly, but only in the stochastic features of the emitted patterns. The feature vectors computed from the successive observations then form a discrete time stochastic process, where the individual feature vectors provide only local hints of the underlying states.

A method used in this work to join the segmentation and segment classification into a unified probabilistic approach is the hidden Markov model (HMM). The popularity of the model is based on its simple mathematical formulation that allows efficient implementations. The model is particularly widespread in ASR despite the fact that the Markov property, according to which the transition probability to a new state depends only on the current process state, does not accurately hold for speech.

The training of the HMM based recognition models occurs automatically eliminating most of the irritating manual segmentation and labeling work. For ASR, the models learn the characteristics of the states and transitions directly from the collected speech samples requiring only some initial values for the parameters and the phonetic transcription of the collected words. The most difficult and crucial thing to learn is how to model the emitted features of the states, because the densities of the outputs are not simply parameterizable and the models should also tolerate some variation in the data. In practice, there will probably always be states that do not get enough data to allow confident density estimates and yet the models should generalize well to unknown test data.

Artificial neural networks (ANNs) have some properties that suit well to the computation of the likelihoods of the states corresponding to the observed features. ANNs do not make as restricting assumptions on the parameterization of the densities as the conventional statistical methods and they can be flexibly extended and can operate with high speed by using only simple processing elements. Also simple training methods can be developed to allow the learning of the discriminative features between the models.

Next: Introduction to this thesis Up: INTRODUCTION Previous: INTRODUCTION

Mikko Kurimo
11/7/1997