T-61.184 Audio mining
Special course on Information Technology IV

Seminar course description

Audio mining is a recently developed research area for information retrieval and data exploration on audio input. Today there is already a huge amount of audio information that is digitally stored or transmitted in speeches, videos, movies, radio and television programs, music, recorded meetings and so on. Actually a large proportion of human-generated information is speech and much of it is in the form television and radio broadcasts. In multimedia retrieval the audio information, especially from spoken audio is in key position, because, when automatically transcribed by a speech recognizer, it is closest to the written representation where the traditional search engines can operate. However, in addition to text, the audio transcriptions may include other information, such as rival word candidates, recognizer's confidence levels etc. A severe difficulty still is that the IR system must tolerate 20 - 50 %, or even more, misrecognized words in the transcribed text.

The recent wide interdisciplinary interest in speech and multimedia indexing emphasizes clearly that the indexing and retrieval of audio segments based on its semantic content is a significant and challenging research problem.

While it is still very difficult to index data segments based on automatically extracted semantic content, such as: "Here's Mrs President interviewed", from moving or still images only, it is now becoming possible using the speech recognition technology. The current stage of development in speech processing, especially large vocabulary continuous speech recognition, high performance computing and accessing digital audio for training the recognizers, have brought up several interesting pilot systems and more recently commercial applications, as well. The idea of storing large amounts of audio material for retrieval is not a new one, but the traditional, at least partly manual, methods are so extremely inefficient that the use of retrieval has sofar been very limited.

The course is based on related special issues on journals, such as

  1. Accessing information in spoken audio. Special Issue of Speech Communication Vol. 32:1-2, September 2000.
  2. Automatic transcription of broadcast news data. Special Issue of Speech Communication Vol. 37:1-2, May 2002.
Additional material consists of selected journal articles on relevant topics, such as
  1. Separation of speech from music and non-speech sounds
  2. Speech segmentation by speaker and prosody information
  3. Speech indexing by sub-word units
  4. Speech recognition of broadcast news
  5. Optimizing speech recognition for information retrieval
  6. Evaluation methods for audio retrieval
  7. Indexing and retrieval of music and other non-speech
  8. Existing pilot systems for audio mining


The basic knowledge of pattern recognition, speech processing, speech recognition, natural language processing, and information retrieval methods are helpful.

