This guide tells you how to get started with the Evolving Tree. It will look through the typical steps needed to analyze data. In this case we have a set of data and we want to measure the classification rate of the Evolving Tree with n-fold cross validation.
The first thing you need are the executable files and your data. Instructions on how to compile the programs can be found on this page and information on the data format can be found here.
Go to the directory you extracted the package to. It should contain
several Python scripts and the two core executable files. Copy your
data file to this directory. In this document we use the file name
datafile.dat
.
In almost all cases you want to pre-process your data. The Evolving
Tree package comes with several helper
scripts that help you efficiently preprocess your data. Our
empirical experience has been that normalizing each variable
independently with data_normalizer.py
gives the best
results. The following command preprocesses the data.
data_normalizer.py datafile.dat
Your normalized data is now in the file
datafile.norm.dat
.
Unless you have a very good reason not to, randomize the order of
your data vectors with data_randomizer.py
. Most data sets
are ordered by default and this can lead to heavy biasing. Suppose
that you do ten fold cross validation with a data set that has ten
classes. In the worst possible case the data is split so that all
members of a class go to the test set and all the others go to the
training set. The result would be a classification rate of zero, no
matter how you do the training. This is a real problem, as it actually
happened to us during the development of the Evolving Tree.
Randomization is done just like the normalization above, but using
the data_randomizer.py
script.
data_randomizer.py datafile.norm.dat
The result is in the file datafile.norm.rand.dat
to
signify that it has been both normalized and randomized.
Usually you want to try the training with a bunch of different
parameters. Creating a group of parameters to evaluate is quite
simple. Just run the script make_params.py
and answer the
questions it asks. Here is a sample run, the parts in bold indicate
portions that require typing.
make_params.py trainparams
Parameter: division threshold
Base:
120
Params:
80 100
Parameter: division factor
Base:
4
Params:
Parameter: eta0
Base:
0.3
Params:
0.2 0.4
Parameter: sigma
Base:
0.8
Params:
0.7 0.9
Parameter: tau1
Base:
4
Params:
Parameter: tau2
Base:
2
Params:
Parameter: bmu counter decay
Base:
0.8
Params:
0.9
Parameter: k-means rounds
Base:
1
Params:
2
The program goes through the parameters one by one. First it asks
for a base value, and then a range of values to go through. If you
don't want to try different values for some parameters, just press
enter (as in division factor
above). After this you have
your parameters in trainparams
file, which looks like
this.
120 80 100
4
0.3 0.2 0.4
0.8 0.7 0.9
4
2
0.8 0.9
1 2
The format is quite straightforward. Every parameter has its own line. The first element on a line has the base value and after that come the alternate values, if any, separated by a space.
Here are some suggested parameter values in a handy table form. They have been found quite usable on a database of size 1300, dimension 32 and 14 classes.
division threshold | 120 |
division factor | 4 |
eta0 | 0.3 |
sigma0 | 0.8 |
tau1 | 4 |
tau2 | 2 |
decay constant | 0.8 |
k-means rounds | 2 |
A detailed description of the parameters can be found in the class reference documentation.
Now you are ready to run the actual cross-validation script
param_rotate.py
. We set its parameters to the files
obtained above.
param_rotate.py datafile.norm.rand.dat trainparams 10
The last parameter is a number indicating the fold of the
crossvalidations. In this case the program does 10-fold cross
validation. At this point you can relax while numbers are being
crunched. Not too long, though, because the Evolving Tree is quite a
bit faster than you'd expect. The output goes to a file with the form
datafile.norm.rand.dat.<number>.results;
. The
number changes every time you run the program so you don't end up
accidentally overwriting your old results.
Once the program has finished, you most likely want to find out how
the algorithm did. You can look through the result file manually, but
it is very tedious. An easier way is to use a provided helper script
findbest.py
to find the best result. Using it is
simple.
findbest.py datafile.norm.rand.dat.2233.results
100 4 0.2 0.8 4 2 0.8
0.86 0.23 0.53 0.78 0.51 0.31 0.4 0.47 0.43 0.74 0.19 0.0 0.46 0.6
Total accuracy 0.51772054479
This output is the qualification result for the best run. The first line contains the parameters used, they are in the same order as above. The next line contains the classification results for each class. In this case some classes are a lot harder to classify than others. Finally there is the total classification rate calculated as a weighted average.
At this point you are pretty much done. Now you can refine your resuls by tweaking the parameters, try different data sets, examine ways to adapt the Evolving Tree to your specific problem domain and write journal articles showcasing your obtained results.
Copyright 2004 Jussi Pakkanen, Laboratory of Computer and Information Science, Helsinki University of Technology.