A quick guide to using the Evolving Tree

This guide tells you how to get started with the Evolving Tree. It will look through the typical steps needed to analyze data. In this case we have a set of data and we want to measure the classification rate of the Evolving Tree with n-fold cross validation.

Preliminaries

The first thing you need are the executable files and your data. Instructions on how to compile the programs can be found on this page and information on the data format can be found here.

Go to the directory you extracted the package to. It should contain several Python scripts and the two core executable files. Copy your data file to this directory. In this document we use the file name datafile.dat.

Data pre-processing

In almost all cases you want to pre-process your data. The Evolving Tree package comes with several helper scripts that help you efficiently preprocess your data. Our empirical experience has been that normalizing each variable independently with data_normalizer.py gives the best results. The following command preprocesses the data.

data_normalizer.py datafile.dat

Your normalized data is now in the file datafile.norm.dat.

Unless you have a very good reason not to, randomize the order of your data vectors with data_randomizer.py. Most data sets are ordered by default and this can lead to heavy biasing. Suppose that you do ten fold cross validation with a data set that has ten classes. In the worst possible case the data is split so that all members of a class go to the test set and all the others go to the training set. The result would be a classification rate of zero, no matter how you do the training. This is a real problem, as it actually happened to us during the development of the Evolving Tree.

Randomization is done just like the normalization above, but using the data_randomizer.py script.

data_randomizer.py datafile.norm.dat

The result is in the file datafile.norm.rand.dat to signify that it has been both normalized and randomized.

Parameter selection

Usually you want to try the training with a bunch of different parameters. Creating a group of parameters to evaluate is quite simple. Just run the script make_params.py and answer the questions it asks. Here is a sample run, the parts in bold indicate portions that require typing.

make_params.py trainparams
Parameter: division threshold
Base:
120
Params:
80 100
Parameter: division factor
Base:
4
Params:

Parameter: eta0
Base:
0.3
Params:
0.2 0.4
Parameter: sigma
Base:
0.8
Params:
0.7 0.9
Parameter: tau1
Base:
4
Params:

Parameter: tau2
Base:
2
Params:

Parameter: bmu counter decay
Base:
0.8
Params:
0.9
Parameter: k-means rounds
Base:
1
Params:
2

The program goes through the parameters one by one. First it asks for a base value, and then a range of values to go through. If you don't want to try different values for some parameters, just press enter (as in division factor above). After this you have your parameters in trainparams file, which looks like this.

120 80 100
4
0.3 0.2 0.4
0.8 0.7 0.9
4
2
0.8 0.9
1 2

The format is quite straightforward. Every parameter has its own line. The first element on a line has the base value and after that come the alternate values, if any, separated by a space.

Here are some suggested parameter values in a handy table form. They have been found quite usable on a database of size 1300, dimension 32 and 14 classes.

division threshold120
division factor4
eta00.3
sigma00.8
tau14
tau22
decay constant0.8
k-means rounds2

A detailed description of the parameters can be found in the class reference documentation.

Running the cross-validator

Now you are ready to run the actual cross-validation script param_rotate.py. We set its parameters to the files obtained above.

param_rotate.py datafile.norm.rand.dat trainparams 10

The last parameter is a number indicating the fold of the crossvalidations. In this case the program does 10-fold cross validation. At this point you can relax while numbers are being crunched. Not too long, though, because the Evolving Tree is quite a bit faster than you'd expect. The output goes to a file with the form datafile.norm.rand.dat.<number>.results;. The number changes every time you run the program so you don't end up accidentally overwriting your old results.

Checking the results

Once the program has finished, you most likely want to find out how the algorithm did. You can look through the result file manually, but it is very tedious. An easier way is to use a provided helper script findbest.py to find the best result. Using it is simple.

findbest.py datafile.norm.rand.dat.2233.results
100 4 0.2 0.8 4 2 0.8
0.86 0.23 0.53 0.78 0.51 0.31 0.4 0.47 0.43 0.74 0.19 0.0 0.46 0.6
Total accuracy 0.51772054479

This output is the qualification result for the best run. The first line contains the parameters used, they are in the same order as above. The next line contains the classification results for each class. In this case some classes are a lot harder to classify than others. Finally there is the total classification rate calculated as a weighted average.

Moving on

At this point you are pretty much done. Now you can refine your resuls by tweaking the parameters, try different data sets, examine ways to adapt the Evolving Tree to your specific problem domain and write journal articles showcasing your obtained results.


Copyright 2004 Jussi Pakkanen, Laboratory of Computer and Information Science, Helsinki University of Technology.