Example: training and test sets

sTrainD.data: numerical measurement values for the training samples, 4 measurements /individual
sTrainD.label: true class labels of the training set: 'Setosa','Versicolor' or 'Virginica'
sTestD.data: numerical measurement values for the test samples, 4 measurements /individual
sTestD.label: true class labels of the test set: 'Setosa','Versicolor' or 'Virginica'

Let's examine only one of the measurements as an example.

Below is a plot of the values of training set variable 'Sepal width' (sTrainD.data(:,2)). For the training set, the true class corresponding to each value is known. It can be observed that the species 'Setosa' has somewhat wider sepal than the two othe species.

On the basis of this analysis, we can state that if the sepal width is larger than 3.4 mm the sample is a 'Setosa' with a rather large probability.

If we received new measurements of sepal widths, we could draw them as below.

Our task would be to guess/decide the species of the new samples on basis of the sepal withs only. On the basis of the above analysis, we could classify all the samples with sepal with over 3.4 mm as 'Setosa'. Probably we would make classification errors: we would classify 'Virginica' and 'Versicolor' samples as 'Setosas'.

In order to assess the quality of the classifier we use, the data set is usually partitioned into (at least) two, training and test sets. By knowing the correct labels also for the test set it is straightforward to compute how many of the proposed classifications were correct.

If all the samples with sepal width over 3.4mm were classified as 'Setosa', at least one 'Versicolor' (index 26) would be misclassified

NOTE! In the data set, there are four different measurements and all of them are used jointly. The classification is performed primarily by the KNN algorithm.