Software 2007 - T-61.3050

Helsinki University of Technology → Department of Computer Science and Engineering →
Laboratory of Computer and Information Science → Teaching →
T-61.3050 Machine Learning: Basic Principles → 2007 → Software

Software 2007 - T-61.3050

Data analysis software

Some example codes distributed from the course web site are written in R. R is a language and environment for statistical computing and graphics. R can be considered as an open source implementation of S. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. It is often a vehicle of choice for research in statistical methodology. For example, R is currently the de facto data analysis environment in bioinformatics via the Bioconductor Project.

One of the strengths of R is excellent documentation, especially the numerous books written about it. Also the online documentation is quite extensive, with references to implemented methods. A good reference for R is "Modern Applied Statistics with S" by Venables and Ripley, available from the TKK library. Someone also recommended "Mixed-Effects Models in S and S-Plus" by Pinheiro and Bates, which is supposedly available as an eBook via the TKK library. I can't however confirm this because I have never succeeded in obtaining any eBook from our library, maybe you will have a better luck (or ask librarian). See the R web site for other documentation and references.

However, if you are more comfortable with some other software capable of doing the necessary computations there is no particular reason to use R or S; unless you want to try R (after all, it is free). It is important however to have familiarity with some tool that allows you to do the basic operations and plots with the data.

Other options include but are not limited to:

Commercial Matlab and its open source variant Octave.
Weka, licensed under the GNU General Public License. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

Other helpful software

The standard unix command line utilities, such as wc, sort, tr, uniq and sed, as well as scripting languages like awk, perl and python can make the life easy in various phases of data analysis, especially in converting the data file to a suitable format or in doing some preliminary analysis or pruning of the data. For example, counting the number of unique entries in the third column of a comma separated data file:
awk -F ',' '{print $3}' < data.csv | sort -u | wc -l

These tools come in the default installations with operating systems of the unix family like Linux and OS X. For Microsoft Windows they are available for example via Cygwin.