Tools and databases related to natural language processing
Locally installed programs
SRILM - The SRI Language Modeling Toolkit
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation.
Like CMU-SLM but more flexible, more options, and more updates.
The CMU-Cambridge Statistical Language Modeling Toolkit v2
For estimating and evaluating n-gram models and word frequencies
in text corpus. Supports arpa language model format. Very fast, easy
to use.
The Hidden Markov Model Toolkit (HTK-toolkit)
General purpose Hidden Markov Modeling toolkit, mostly used for speech
recognition.
Self-made HMM software
with SOM, K-means and LVQ
Implementation of Hofmann's topic-based language model
An effective C++-library for creating and using Hofmann's topic-based language
model. The library can be used inside perl programs.
General audio file processing tools
General tools for cutting and pasting wav, etc. files.
Includes wavetools (w* -h, for example try wcut -h), patched sox, etc...
The Bow Toolkit
A toolkit for statistical language modeling, text retrieval, classification
and clustering. Contains
a library of C code and front-ends for document classification
(rainbow), document retrieval (arrow) and document clustering (crossbow)
Wavesurfer
Visualize spectrum, waveform, segmentation etc. of a audio file.
IRC
Text-only IRC client for UNIX shell
Local databases
Isolated words: 59 speakers, ~350 words each
Data collected here at lab. 350 words per speaker. Includes segmentation
into phonemes. NIST wav (=sph) format.
"Kielipankki" trigram statistics and tools
Trigrams for "Kielipankki" data. Tools for combining and trimming trigram files. CMU-toolkit will not work, since the size of the vocabulary clearly exceeds 65000.
Databases elsewhere
CSC's "Kielipankki", books database
- <100 books, articles from two newpapers, total few million words.
- Text cannot be copied elsewhere, all relevant statistic have to be
calculated in CSC machines
- Local "expert": Vesa.Siivola@hut.fi
CMU Textlearning databases
- 20 Newsgroup corpus: newsgroup articles, 1000 from each of the 20 groups.
- 4 Universities data set: contains WWW pages collected from computer
science departments of various universities. Manually classified.
- Original location
- Local "expert": Ella.Bingham@hut.fi