Machine learning project 2007

Helsinki University of Technology → Department of Computer Science and Engineering →
Laboratory of Computer and Information Science → Teaching →
T-61.3050 Machine Learning: Basic Principles → 2007 → Project - Web spam detection, Results

Project - Web spam detection - Results

The labeled test data can be found here. Check that Antti did not make a mistake when computing the values below!

The failure rate of classification is the fraction of incorrectly labeled hosts. That is, the number of cases where your label does not equal the correct label divided by the number of examples. Note that the hosts that were labeled undecided were included and cause the results look slightly worse.

The mean square error is computed as the sum of squares of the prediction errors divided by the number of examples. Also here the hosts labeled as undecided were included.

Classification results

Team-ID    Failure rate
-------------------------
Team-7ad8d 0.274588340009
Team-e5ce4 0.276368491322
Team-f6622 0.294615042279
Team-cc666 0.327547841567
Team-ee8c8 0.336003560303
Team-42b44 0.388518024032
Team-b51ed 0.392968402314
Team-21e71 0.398308856253
Team-0f77a 0.419225634179
Team-a95ea 0.425901201602
Team-ff9d8 0.435692033823
Team-d0a00 0.44147752559
Team-4040d 0.460169114375
Team-2b64d 0.481085892301
Team-9008e 0.486426346239
Team-744c7 0.534490431687
Team-fefa3 0.534490431687
Team-cee97 0.538050734312
Team-62246 0.543391188251
Team-eec8c 0.556297285269
Team-7787f 0.585224744103
Team-b982a 0.607476635514
Team-b6ba7 0.608811748999
Team-157cc 0.62082777036
Team-7a454 0.642189586115
Team-b7333 0.649154051647
Team-77b77 0.650645304851
Team-01c3b 0.659394479074
Team-2f232 0.683133066311
Team-3386e 0.688473520249
Team-aa8ee 0.692033822875
Team-c5754 0.770764119601

Prediction results

Team-ID    MSE
-------------------------
Team-7ad8d 0.102949478676
Team-cc666 0.138479582641
Team-e5ce4 0.155747737041
Team-21e71 0.178277616566
Team-f6622 0.195007215962
Team-4040d 0.19747817515
Team-b51ed 0.215397029472
Team-ff9d8 0.21546842703
Team-ee8c8 0.219859966722
Team-fefa3 0.22152931401
Team-d0a00 0.236595833651
Team-42b44 0.247992874144
Team-cee97 0.250493647547
Team-2b64d 0.251683049301
Team-9008e 0.284029360847
Team-0f77a 0.302909268902
Team-aa8ee 0.310055022062
Team-62246 0.310853491817
Team-a95ea 0.322837043961
Team-744c7 0.339145822332
Team-7787f 0.360582532501
Team-01c3b 0.368200532919
Team-eec8c 0.430114983831
Team-77b77 0.435028286348
Team-b7333 0.449710686174
Team-157cc 0.456305532728
Team-b6ba7 0.46011679614
Team-b982a 0.466181257262
Team-3386e 0.481379160046
Team-7a454 0.506751267284
Team-2f232 0.533866105672
Team-c5754 0.579260267661

Some comments

1) Results for classification and prediction are mostly consistent, that is, a team with a good classification performance is likely to have a good prediction performance as well, and vice versa.

2) There is substantial variance in the results, i.e., some teams had a very good performance in both tasks, while some did probably considerably worse than expected. Don't worry! In your final reports you can discuss why your first approach failed (if it failed), and you can improve the method accordingly. The grade for your project will be based on the final report. This also shows that the problem is nontrivial, but yet solvable. Note that more complex methods are likely to have bad performance due to programming/etc. errors.

3) Many of you have most likely fallen victim to the following rather nasty property of the data sets: the prior probabilities of observing a "normal" and a "spam" host differ considerably in the training and test data sets. In the training data the fraction of spam pages is approximately 10%, while in the test data about 66% of the examples are spam.

Simple Bayesian approaches that estimate the class priors from the training data will thus have problems. Note that the task is not easy: in case of the test data the "dummy" classifier that puts everything into the "spam" class has an error of approx. 0.33. Only 4-5 teams were able to beat this. A dummy classifier based on the training data that puts everything into the "normal" class would have an error of around 0.66. Almost all teams were able to do better than this. Also, for example k-NN will suffer from the biased class-distribution, as in the training data the nearest neighbor of any new host will most likely be "normal", as there simply are so many more of these around.

4) A standard approach for handling a situation where the fraction of one class is considerably smaller when compared to the other class is to use the technique discussed in Problem Set 3, Problem 1. In this case when training the classifier the cost function should penalize more for a misclassified "spam" instance than for a misclassified "normal" host.

Why was the class distribution in the test data so different? If the fraction of spam hosts in the test set would also be ~10%, the dummy classifier that says everything is "normal" would be _extremely hard_ to beat. Having considerably more spam pages than normal ones in the test set allows us to distinguish classifiers/predictors that truly recognize the spam pages from those that simply rely on the fact that spam hosts are relatively rare. Obviously a classifier that predicts everything as "normal" won't be very useful in practice.