This document will explain the data format used by the Evolving Tree. It is very similar to the data format used in SOM Toolbox. The format consists of three portions: the header, data dimension line and the data portions. Here is a sample data file.
# This is the header.
# It may have several lines.
3
0.1 0.2 0.1 good
3 5.5 0 bad
0 0.2 0.2 good
1.1 2.2 2.1 bad
Let us now go through the portions one by one.
The header is at the beginning of the file. It consists of zero or more lines that start with the character '#'. The text in the header lines can be anything, as it is ignored. Note that these lines must be at the beginning of the file, they may not appear anywhere else.
Immediately after the header there is a line that consists of a single integer. It defines the dimension of the data vectors that follow.
Data vectors are listed one per line. First there are n real numbers separated by a space that constitute the actual data vector. The value of n is specified by the data dimension line above. After these is a simple string attribute. Depending on how you want to use the Evolving Tree this can be e.g. a class id or a running index, but it must not be empty.
The last attribute is not used in training at all, but is returned
when the tree is queried with the etreequery
program. The
attribute string must not contain any spaces. It is also recommended
not to have unusual characters, such as quotes ("), hashes (#) or
backticks (`). They can cause problems in other programs, which can be
very difficult to pinpoint and analyze.
Copyright 2004 Jussi Pakkanen, Laboratory of Computer and Information Science, Helsinki University of Technology.