Tools and databases related to natural language processing

Locally installed programs

SRILM - The SRI Language Modeling Toolkit

SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. Like CMU-SLM but more flexible, more options, and more updates.

The CMU-Cambridge Statistical Language Modeling Toolkit v2

For estimating and evaluating n-gram models and word frequencies in text corpus. Supports arpa language model format. Very fast, easy to use.

The official home page
Online manual
Local_installation: /share/puhe/CMU-Cam_Toolkit_v2/
Local "expert": Vesa.Siivola@hut.fi
Bugs and notes

The Hidden Markov Model Toolkit (HTK-toolkit)

General purpose Hidden Markov Modeling toolkit, mostly used for speech recognition.

Official homepage
Local installation: /share/puhe/htk/
Local "expert": Teemu.Hirsimaki@hut.fi

Self-made HMM software

with SOM, K-means and LVQ

Official homepage
Local installation: /home/neuro/panus/sr/src/
Local "expert": Panu.Somervuo@hut.fi

Implementation of Hofmann's topic-based language model

An effective C++-library for creating and using Hofmann's topic-based language model. The library can be used inside perl programs.

Intranet page. Ask Teemu if you are interested in using the library.
Experiments using the model (seminar report) topicem.ps
Local expert: Teemu.Hirsimaki@hut.fi
The original paper: D. Gildea and T. Hofmann. Topic-based language models using EM. In Proceedings of the 6th European Conference on Speech Communication and Technonoly, pages 2167-2170, Budapest, Hungary, 1999.

General audio file processing tools

General tools for cutting and pasting wav, etc. files. Includes wavetools (w* -h, for example try wcut -h), patched sox, etc...

Local installation: /share/puhe/bin/
Local "expert": Teemu.Hirsimaki@hut.fi

The Bow Toolkit

A toolkit for statistical language modeling, text retrieval, classification and clustering. Contains a library of C code and front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow)

Official homepage
Local installation: /home/info/ella/ir/bow/
Local "expert": Ella.Bingham@hut.fi

Wavesurfer

Visualize spectrum, waveform, segmentation etc. of a audio file.

Official homepage
Local installation /share/puhe/irix/bin/wavesurfer
Local "expert": Teemu.Hirsimaki@hut.fi

IRC

Text-only IRC client for UNIX shell

Official homepage
Local installation /home/info/ella/ir/irc/
Local "expert": Ella.Bingham@hut.fi

Local databases

Isolated words: 59 speakers, ~350 words each

Data collected here at lab. 350 words per speaker. Includes segmentation into phonemes. NIST wav (=sph) format.

Location /share/puhe/näytteet-syksy1999/
Local "expert": Vesa.Siivola@hut.fi

"Kielipankki" trigram statistics and tools

Trigrams for "Kielipankki" data. Tools for combining and trimming trigram files. CMU-toolkit will not work, since the size of the vocabulary clearly exceeds 65000.

Database location /share/puhe/Kielipankki/
Tools are /home/neuro/vsiivola/CSC/c. Toolname -h gives short help. There are scripts to create baseformed trigram models etc, ask Vesa.
Local "expert": Vesa.Siivola@hut.fi

Databases elsewhere

CSC's "Kielipankki", books database

<100 books, articles from two newpapers, total few million words.
Text cannot be copied elsewhere, all relevant statistic have to be calculated in CSC machines
Local "expert": Vesa.Siivola@hut.fi

CMU Textlearning databases

20 Newsgroup corpus: newsgroup articles, 1000 from each of the 20 groups.
4 Universities data set: contains WWW pages collected from computer science departments of various universities. Manually classified.
Original location
Local "expert": Ella.Bingham@hut.fi