The Self-Organizing Map (SOM) is a versatile tool for exploring data sets. It is an effective clustering method and it has excellent visualization capabilities including techniques which use the weight vectors of the SOM to give an informative picture of the data space, and techniques which use data projections to compare data vectors or whole data sets with each other. The visualization capabilities of the SOM make it a valuable tool in data summarization and in consolidating the discovered knowledge. The SOM can also be used for regression and modeling or as a preprocessing stage for other methods.

As part of this work a prototype of a data mining tool was implemented. The ENTIRE program is a much-needed improvement in usability over the command based program package SOM_PAK [22]. However to really make use of the capabilities of the SOM such a tool should be integrated as part of an existing data mining/computing environment such as Matlab by MathWorks, Inc [28]. For a generic data mining environment the list of the very basic operations could be as follows:

- Preprocessing tools for scaling and filtering the data as presented in section 4.1. An important additional property would be to let the user inquire the original data values, i.e. reverse the scaling.
- The training options should cover at least the very basic types: hexagonal and rectangular lattices, rectangular and toroidal topologies, random and linear initialization, bubble and gaussian neighborhood functions, and linear and inverse of time decrease of learning coefficient. See sections 2.2 and 2.3.
- Quality measures for comparing SOMs, e.g. the quantization error, topographic error, the topological quantization and the map distance measure presented in section 3.2.
- Visualization, both 2D and 3D, of component planes, u-matrix and the Sammon's mapping. Labels are important as is the possiblity to control the colour mapping. See sections 2.6 and 4.3.
- Data classification and the visualization of data distributions in sections 3.3 and 4.4. The user should also be able to view the local data sets and to search the BMUs of arbitrary vectors. Depending on the application area, trajectories may or may not be important.
- Postprocessing tools such as the (semi)automated clustering tools presented in section 3.3 and different kinds of labeling tools as well as possibility to save information about the map and the data projections for further analysis. See section 4.5.

The methods and tools presented in this work were used to analyze the pulp and paper industry worldwide and the Scandinavian industry in more detail. The hierarchical SOM was used to combine data from different areas. Such use of multiple interpretation layers introduces some additional error to the process but on the other hand provides a more structured solution to data fusion than simple concatenation of feature vectors.

The results were encouraging. However, much work is still needed regarding the postprocessing stage and the interpretation of results. The analysis in the work was performed by hand and was both time-consuming and inaccurate as it was based on visual inspection rather than exact measures from the SOM. The development and automated usage of algorithms that cluster the units of the SOM will be an essential part of future work. Such clustering should not be based only on the distance matrix of the SOM, but also on the rate of change in the values of individual component planes. This could be accomplished by the use of the hierarchical maps or with fuzzy interpretation rules.

All in all, the many abilities of the SOM together with its robustness and flexibility are a combination which makes the SOM a prime tool in knowledge discovery and data mining.

Tue May 27 12:40:37 EET DST 1997