Helsinki University of Technology →
Department of Computer Science and
Engineering →

Laboratory of Computer and Information
Science →
Teaching →

T-61.3050 Machine Learning: Basic Principles →
2007 →
Software

Some example codes distributed from the course web site are written in R. R is a language and environment for statistical computing and graphics. R can be considered as an open source implementation of S. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. It is often a vehicle of choice for research in statistical methodology. For example, R is currently the de facto data analysis environment in bioinformatics via the Bioconductor Project.

One of the strengths of R is excellent documentation, especially the numerous books written about it. Also the online documentation is quite extensive, with references to implemented methods. A good reference for R is "Modern Applied Statistics with S" by Venables and Ripley, available from the TKK library. Someone also recommended "Mixed-Effects Models in S and S-Plus" by Pinheiro and Bates, which is supposedly available as an eBook via the TKK library. I can't however confirm this because I have never succeeded in obtaining any eBook from our library, maybe you will have a better luck (or ask librarian). See the R web site for other documentation and references.

However, if you are more comfortable with some other software capable
of doing the necessary computations
there is no particular
reason to use R or S; unless you want to try R (after all, it is free).
It is important however to have familiarity with *some*
tool that allows you to do the basic operations and plots with the data.

Other options include but are not limited to:

- Commercial Matlab and its open source variant Octave.
- Weka, licensed under the GNU General Public License. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

The standard unix command line utilities,
such as
`wc`

,
`sort`

,
`tr`

,
`uniq`

and
`sed`

, as well as scripting languages
like
`awk`

,
`perl`

and
`python`

can make the life easy in various phases
of data analysis, especially in converting the data file
to a suitable format or in doing some preliminary analysis or
pruning of the data. For example,
counting the number of unique entries
in the third column of a comma separated
data file:

`awk -F ',' '{print $3}' < data.csv | sort -u | wc -l`

These tools come in the default installations with operating systems of the unix family like Linux and OS X. For Microsoft Windows they are available for example via Cygwin.

Page maintained by t613050@james.hut.fi, last updated Wednesday, 23-Apr-2008 09:01:38 EEST