Tools and databases related to natural language processing

Locally installed programs

SRILM - The SRI Language Modeling Toolkit

SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. Like CMU-SLM but more flexible, more options, and more updates.

The CMU-Cambridge Statistical Language Modeling Toolkit v2

For estimating and evaluating n-gram models and word frequencies in text corpus. Supports arpa language model format. Very fast, easy to use.

The Hidden Markov Model Toolkit (HTK-toolkit)

General purpose Hidden Markov Modeling toolkit, mostly used for speech recognition.

Self-made HMM software

with SOM, K-means and LVQ

Implementation of Hofmann's topic-based language model

An effective C++-library for creating and using Hofmann's topic-based language model. The library can be used inside perl programs.

General audio file processing tools

General tools for cutting and pasting wav, etc. files. Includes wavetools (w* -h, for example try wcut -h), patched sox, etc...

The Bow Toolkit

A toolkit for statistical language modeling, text retrieval, classification and clustering. Contains a library of C code and front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow)

Wavesurfer

Visualize spectrum, waveform, segmentation etc. of a audio file.

IRC

Text-only IRC client for UNIX shell

Local databases

Isolated words: 59 speakers, ~350 words each

Data collected here at lab. 350 words per speaker. Includes segmentation into phonemes. NIST wav (=sph) format.

"Kielipankki" trigram statistics and tools

Trigrams for "Kielipankki" data. Tools for combining and trimming trigram files. CMU-toolkit will not work, since the size of the vocabulary clearly exceeds 65000.

Databases elsewhere

CSC's "Kielipankki", books database

CMU Textlearning databases