T-61.2010 Harjoitustyö #2: "K nearest neighbour classification"

The KNN classifier is applied to the classical iris data set.

Author of the work:
Pre-requisites
Description of the assignment
Documenting the work
Matlab functions
Data set
(1) Examining the data set in Matlab
(2) Distance
(3) K-nearest neighbour classification
(4) Classification error
(5) An alternate value of K
(6) Discussion of the results

Author of the work

Opiskelija: Matti Meikäläinen

Student ID: 12345A

E-mail address: mmeikala@cc.hut.fi

Pre-requisites

In preparation for the assignment, it is useful to familiarise oneself with the basics of pattern recognition and with problem 3 of 3rd "paper exercise" concerning KNN classifier. In addition, you may want to recapitulate the basic linear algebra and become acquainted with using Matlab.

Description of the assignment

The purpose of this assignment is to demonstrate the nearest neighbour classifier introduced in the lectures. The used data set is the classical Iris data set that contains length and width measurements of petals and sepals in three species of the lily family.

For the assignment, the WWW pages of the course will contain a link to a personalised Matlab file for each student enrolled on the course. A training set sTrainD of 60 samples, and and test set sTestD with 30-60 samples to be classified, are loaded from this file. The training set contains 20 examples of each lily species, the test set 10-20. The correct labels of the test set are included so that the classification accuracy can be evaluated.

Your task is to:

(1) examine the data and produce clarifying illustrations. Do the three classes separate directly on the basis of the measurements of their leaves?
(2) find the Euclidean distance from every test set sample (to be classified) to every sample in the training set.
(3) perform the classification of each test set sample x by taking K nearest neighbours from the training set and assigning to the class where most belong. Choose some small (odd) value for K.
(4) calculate the percentage of incorrect classifications made when all the 30-60 test vectors have been processed. In this case we knew the correct classes of the test samples and can thus determine whether the assigned classes were correct or not.
(5) repeat the above with another value of K.
(6) answer the questions of the "Discussion" section

The code and documentation are to be written individually using the personal datasets. However, discussion in small groups or in news group is very welcome. Plagiarism is prohibited. (CS department instructions regarding dealing with plagiarism apply.)

Documenting your work

You have to return a document and the code as attachments of email by 15.1.2008 to vvi@cis.hut.fi. Use subject: "T-61.2010 assignment #2 STUD_ID", where STUD_ID is replaced by your own student ID.

The document must contain the name, student ID and email address of the student, small description the phases of the assignment, your results and answers to the questions (6). Convert your document into PDF.

Copy your Matlab code into a file and include it as an email attachment.

Matlabin functions for handling the data sets

Hints for the general use of Matlab can be found also e.g. in computer exercise #3 and by internet search (keywords e.g. Matlab tutorial).

On the WWW-page of the course you will also find a Kurssin www-sivulta löytyy myös apufunktioita:

myKnnClass.m: this function classifies the test vector "x" with a known class and with a vector "distances" containing distances from test vector "x" to all possible training samples using the K nearest neighbours rule.

For matrices:

help help on any matlab command
doc richer documentation of the commands
size size of a matrix
min, max, sort minimum, maximum and sorting
sum sum of elements of a vector
reshape altering the size of a matrix
repmat copying a matrix
diag picks values from diagonal of a matrix
for i = [1:4], i*i, end; "for" loop
if (i==4), j=0, else, j=1, end; "if" construct
disp writes neatly on the screen
num2str converts a number to a string

For drawing images:

figure new window, figure(Z) opens or actives window Z
plot basic command for plotting
hold on targets all subsequent drawing operations to the same window without erasing the rasults in between
clf clears a window
close all closes all windows
subplot divides a window to several subfigures
title a title to a figure
xlabel, ylabel names image axes
legend adds legend
text inserts text to an image
grid on shows grid lines in the image
print saving an image as .png, .eps, .jpg, .tif, ...
saveas saving an image as .png, .eps, .jpg, .tif, ...

Data file

Fetch your own personal data file XXXXXY.mat (Topic #2) via http://www.cis.hut.fi/Opinnot/T-61.2010/Harjoitustyo/.

The data file is based on the classical Iris data that contains 150 samples of three classes (species in the lily family). The dimension of the data is four: the feature vectors consist of measurements of:

'Sepal length'
'Sepal width'
'Petal length'
'Petal width'

see e.g. http://en.wikipedia.org/wiki/Sepal. These four measurements of each individual plant result in a data matrix of size (N x 4).

The measuruments originate from measuring individuals of three species in the lily family: 'Setosa', 'Versicolor' ja 'Virginica'.

For this assignment the data has been partitioned into two. In the training part sTrainD thera are 60 samples and in the to-be-classified testing part sTestD approximately 30-60 samples. The data has been partitioned so that the first 20 samples in the training data belong to class 'Setosa', the next 20 to the class 'Versicolor', and the last 20 to 'Virginica'. In the test set the classes (species) are mixed. In practice the data set of each student is slightly different.

Both the variables are Matlab structs. The components of the structs can be accessed as follows. : sTrainD includes two variables (as also sTestD), sTrainD.data contains the data set (60x4) and sTrainD.label contains a cell array (60x1) with the true class labels for each sample. For instance, sTrainD.data(25,3) is the 3rd variable of the 25th plant. The true class of the plant ('Versicolor') is accessed via sTrainD.label{25}, note the curly braces!

An example run

(1) Examining the data matrix in Matlab

Fetch the personalised data file XXXXXY.mat (Topic #2) via http://www.cis.hut.fi/Opinnot/T-61.2010/Harjoitustyo/. (NOTE! Mail Ville.Viitaniemi () tkk.fi if you can not find data file with your student number.)

Read the file into Matlab workspace with command load. This results in two variables sTrainD and sTestD in the workspace.

You may familiarise yourself with the data by typing sTrainD in Matlab and pressing "enter". Next sTrainD.data and "enter". Ja yet sTrainD.label and "enter", and analogously for sTestD.

All the valueas of the first variable (sepal length) are given by sTrainD.data(:,1); The petal widths of all 'Versicolor' plants are given by sTrainD.data(21:40,4) as the 'Versicolor' plants reside in samples 21..40. plot can be used to visualise the data. Histogram of the data, an estimate of the distribution of the numbers, can be generated through the command hist. A separate example can be accessed through this link.

(2) Distance

Here we calculate the Euclidean distance from each sample to be classified to every sample in the trainig set, (60 x M) distances alltogether. For example, the square of the distance between 4D-points [5.1000 3.5000 1.4000 0.2000] and [5.5000 2.4000 3.7000 1.0000] is E = (5.1-5.5)^2 + (3.5-2.4)^2 + (1.4-3.7)^2 + (0.2-1.0)^2 = 7.3 (in the following we look for smallest distances, we thus do not have to take square root as square root preserves the relative ranking of the distances).

(3) K-nearest neighbour

Some value is chosen for K and the test samples are classified. For parts (3) and (4) the previously mentioned function myKnnClass.m.

(4) Classification error

The classification error is computed. An example:

K=1. Error: 3.3333%

(5) Another value of K

Another value is chosen for KAnd th eexperiments are repeated. Esimerkiksi:

K=5. Error: 5%

(6) Discussion

Consider and discuss the following in your document:

Howdo the three species 'Setosa', 'Versicolor' ja 'Virginica' separate from each other?
Which of the variables ('Sepal length', 'Sepal width', 'Petal length', 'Petal width') separated the species?
What kind of classification errors were obtained with different values of K? How does varying K affect the result and the computation in general?
What could be an alternative way to perform the classification? What would be the easiest way (of course with as small error as possible)?
Additionally imagine a data set with one variable varying between 20 200 meters (e.g. uniformly) and another varying between 0.35 mm and 0.94 mm:. What kind of problems could arise (e.g. when calculating the distances between the points)? How could this situation be helped?

In the end of the document, please write feedback on completing the assignment and estimate the time you used. Remember to mention if you co-operate with someone else!