Helsinki University of Technology →
Department of Computer Science and
Laboratory of Computer and Information Science → Teaching →
T-61.3050 Machine Learning: Basic Principles → 2007 → Project - Web spam detection, Results
The labeled test data can be found here. Check that Antti did not make a mistake when computing the values below!
The failure rate of classification is the fraction of incorrectly labeled hosts. That is, the number of cases where your label does not equal the correct label divided by the number of examples. Note that the hosts that were labeled
undecided were included and cause the results look slightly worse.
The mean square error is computed as the sum of squares of the prediction errors divided by the number of examples. Also here the hosts labeled as
undecided were included.
Team-ID Failure rate ------------------------- Team-7ad8d 0.274588340009 Team-e5ce4 0.276368491322 Team-f6622 0.294615042279 Team-cc666 0.327547841567 Team-ee8c8 0.336003560303 Team-42b44 0.388518024032 Team-b51ed 0.392968402314 Team-21e71 0.398308856253 Team-0f77a 0.419225634179 Team-a95ea 0.425901201602 Team-ff9d8 0.435692033823 Team-d0a00 0.44147752559 Team-4040d 0.460169114375 Team-2b64d 0.481085892301 Team-9008e 0.486426346239 Team-744c7 0.534490431687 Team-fefa3 0.534490431687 Team-cee97 0.538050734312 Team-62246 0.543391188251 Team-eec8c 0.556297285269 Team-7787f 0.585224744103 Team-b982a 0.607476635514 Team-b6ba7 0.608811748999 Team-157cc 0.62082777036 Team-7a454 0.642189586115 Team-b7333 0.649154051647 Team-77b77 0.650645304851 Team-01c3b 0.659394479074 Team-2f232 0.683133066311 Team-3386e 0.688473520249 Team-aa8ee 0.692033822875 Team-c5754 0.770764119601
Team-ID MSE ------------------------- Team-7ad8d 0.102949478676 Team-cc666 0.138479582641 Team-e5ce4 0.155747737041 Team-21e71 0.178277616566 Team-f6622 0.195007215962 Team-4040d 0.19747817515 Team-b51ed 0.215397029472 Team-ff9d8 0.21546842703 Team-ee8c8 0.219859966722 Team-fefa3 0.22152931401 Team-d0a00 0.236595833651 Team-42b44 0.247992874144 Team-cee97 0.250493647547 Team-2b64d 0.251683049301 Team-9008e 0.284029360847 Team-0f77a 0.302909268902 Team-aa8ee 0.310055022062 Team-62246 0.310853491817 Team-a95ea 0.322837043961 Team-744c7 0.339145822332 Team-7787f 0.360582532501 Team-01c3b 0.368200532919 Team-eec8c 0.430114983831 Team-77b77 0.435028286348 Team-b7333 0.449710686174 Team-157cc 0.456305532728 Team-b6ba7 0.46011679614 Team-b982a 0.466181257262 Team-3386e 0.481379160046 Team-7a454 0.506751267284 Team-2f232 0.533866105672 Team-c5754 0.579260267661
1) Results for classification and prediction are mostly consistent, that is, a team with a good classification performance is likely to have a good prediction performance as well, and vice versa.
2) There is substantial variance in the results, i.e., some teams had a very good performance in both tasks, while some did probably considerably worse than expected. Don't worry! In your final reports you can discuss why your first approach failed (if it failed), and you can improve the method accordingly. The grade for your project will be based on the final report. This also shows that the problem is nontrivial, but yet solvable. Note that more complex methods are likely to have bad performance due to programming/etc. errors.
3) Many of you have most likely fallen victim to the following rather nasty property of the data sets: the prior probabilities of observing a "normal" and a "spam" host differ considerably in the training and test data sets. In the training data the fraction of spam pages is approximately 10%, while in the test data about 66% of the examples are spam.
Simple Bayesian approaches that estimate the class priors from the training data will thus have problems. Note that the task is not easy: in case of the test data the "dummy" classifier that puts everything into the "spam" class has an error of approx. 0.33. Only 4-5 teams were able to beat this. A dummy classifier based on the training data that puts everything into the "normal" class would have an error of around 0.66. Almost all teams were able to do better than this. Also, for example k-NN will suffer from the biased class-distribution, as in the training data the nearest neighbor of any new host will most likely be "normal", as there simply are so many more of these around.
4) A standard approach for handling a situation where the fraction of one class is considerably smaller when compared to the other class is to use the technique discussed in Problem Set 3, Problem 1. In this case when training the classifier the cost function should penalize more for a misclassified "spam" instance than for a misclassified "normal" host.
Why was the class distribution in the test data so different? If the fraction of spam hosts in the test set would also be ~10%, the dummy classifier that says everything is "normal" would be _extremely hard_ to beat. Having considerably more spam pages than normal ones in the test set allows us to distinguish classifiers/predictors that truly recognize the spam pages from those that simply rely on the fact that spam hosts are relatively rare. Obviously a classifier that predicts everything as "normal" won't be very useful in practice.
Page maintained by firstname.lastname@example.org, last updated Monday, 26-Nov-2007 19:03:25 EET