In each training step, one sample vector from the input data
set is chosen randomly and a similarity measure is calculated between
it and all the weight vectors of the map. The Best-Matching Unit
(BMU), denoted as c, is the unit whose weight vector has the
greatest similarity with the input sample
. The similarity is
usually defined by means of a distance measure, typically Euclidian distance.
Formally the BMU is defined as the neuron for which
where is the distance measure.
After finding the BMU, the weight vectors of the SOM are updated. The weight vectors of the BMU and its topological neighbors are moved closer to the input vector in the input space. This adaptation procedure stretches the BMU and its topological neighbors towards the sample vector. This is illustrated in figure 2.2, where the input vector given to the network is marked by an x.
Figure 2.2: Updating the best matching unit (BMU) and its
neighbors towards the input sample marked with x.
The solid and dashed lines correspond to situation before and after
updating, respectively.
The SOM update rule for the weight vector of the unit i is:
where t denotes time. The is the input
vector randomly drawn from the input data set at time t and
the neighborhood kernel around the winner unit c at time
t. The neighborhood kernel is a non-increasing function of time and of
the distance of unit i from the winner unit c. It defines the
region of influence that the input sample has on the SOM. The kernel
is formed of two parts: the neighborhood function h(d,t) and the learning
rate function
:
where is the location of unit i on the map grid.
The simplest neighborhood function is the bubble: it is constant over
the whole neighborhood of the winner unit and zero elsewhere (see
figure 2.3a). Another is the Gaussian neighborhood
function
(see figure 2.3b). It gives slightly better results but is
computationally somewhat heavier. Usually the neighborhood radius is
bigger at first and is decreased linearly to one during the training.
Figure 2.3: The two basic neighborhood functions.
The learning rate is a decreasing function of time. Two
commonly used forms are a linear function and a function
inversely proportional to time:
,
where A and B are some suitably selected constants. Use
of the latter function type ensures that all input samples
have approximately equal influence on the training
result [31].
The training is usually performed in two phases. In the first phase, relatively large initial alpha value and neighborhood radius are used. In the second phase both alpha value and neighborhood radius are small right from the beginning. This procedure corresponds to first tuning the SOM approximately to the same space as the input data and then fine-tuning the map. If the linear initialization procedure is used the first training phase can be skipped. There are several rules-of-thumb, found through experiments, for choosing suitable values [21].
There are many variants to the basic SOM. One variantion theme is to use neuron-specific learning rates and neighborhood sizes. Another is to use adaptive or even growing map structures. The goal of these variations is to enable the SOM to follow better the topology of the underlying data set or to achieve better quantization results. An usual defect of these methods is that the visualization properties suffer or that the algorithm becomes computationally heavier.