Machine learning project 2007

Helsinki University of Technology → Department of Computer Science and Engineering →
Laboratory of Computer and Information Science → Teaching →
T-61.3050 Machine Learning: Basic Principles → 2007 → Project - Web spam detection

Project - Web spam detection - T-61.3050 - 2007

Introduction

The basic problem of information retrieval (IR) is to find relevant documents given a query. As the number of relevant documents may be very large, it is not enough to only separate relevant documents from non-relevant ones. The IR system has to rank the documents so that they can be shown to the user in decreasing order of relevance.

Traditionally IR systems have been designed to operate in a controlled environment, such as company databases, libraries, etc. In this kind of scenarios it is less likely that a malicious user tries to interfere with the normal operation of the IR system by constructing documents that get an artificially high rank in the search results. In decentralized environments without a common "editor" such as the WWW this assumption no longer holds. In fact, as the search engine has become the default gateway to information on the web for most of the users, website designers try to make their page appear as high in the search results as possible. Typically this is called search engine optimization (SEO). In general SEO is considered a good practice also among the search engines. However, the border between acceptable measures and "spamming" can be vague.

Web spam is a term used to describe web pages (or sites) that do not have any significant content themselves, but serve only to increase the rank of some other page in the search results. Modern WWW search engines (such as Google, Microsoft and Yahoo) rank web pages using the contents of the pages as well as the link structure of the web. In short, pages that have a high indegree (number of incoming links) are considered "good". Also pages with a low indegree can be considered good if the links they receive come from other good pages. The PageRank relevance measure initially used by Google is an example of this idea.

One popular technique for artificially increasing the rank of a target page is to create a number of other pages that have some links among themselves and additionally all have a link to the target page. If the number of pages in this system (also called a link farm) is high enough, the PageRank of the target page can be considerably increased. (Similar results can be achieved also due to human intervention. In the 2004 presidential election G. W. Bush was the target of a Google bomb that made his biography at the White House website appear as the first result on Google with the query "miserable failure".)

The task

In this project the task is to use machine learning techniques for recognizing web spam. We will use the WEBSPAM-UK2006 collection compiled by University of Rome "La Sapienza" and University of Milan with the support of the DELIS EU - FET research project, published under a Creative Commons license.

The data consists of a crawl of the uk-domain (all websites whose hostname has the suffix .uk) from May 2006. We only consider the hostgraph, which has about 11400 nodes, each corresponding to one host. Note that a host may consist of multiple pages, but in this case information about all the pages has been aggregated to represent the host.

A number of hosts have been labeled by humans to three classes: "normal", "borderline" and "spam". Some of the hosts have been labeled by more than one person. Based on the ratings a spammicity score has been computed for each page. The spammicity of page p is defined as:

s(p) = 1/n * (score(p, e_1) + score(p, e_2) + ... + score(p, e_n))

where score(p, e_i) is the score assigned to page p by evaluator i, and

            
                [  0, if p is classified as "normal",
score(p, e_i) = [ 0.5 if p is classified as "borderline",
                [  1  if p is classified as "spam".

In the project you should consider two sub-tasks:

Classification: Classify pages to "spam" and "normal".
Prediction: Predict the spammicity of a page.

Note that these are two different problems, even though similar methods can be applied in both cases. Especially note that the predictor can be used for classification as well, but this may not give the best possible results.

Practical arrangements

You can work on the project either alone or in groups of two people (preferred). Both members of the team receive the same grade for this part of the course.

We are organizing a non-serious competition (or "challenge") to make the project slightly more interesting. First each participant is provided with a training set that lists the spammicity and correct class label for a number of hosts. Using this data both for training and validation you should find a classifier that recognizes spam hosts from the normal ones, and a predictor that gives estimates for the spammicity of a host.

In November we publish an unlabeled test set that you must label using your classifier and predictor. As we (the course staff) know the correct labels also for these hosts, we can compute the error of your predictions. In case of classification the error will be measured as the fraction of incorrectly labeled hosts, and in case of prediction the error will be the mean square error between your prediction and the correct spammicity.

The predictions must be sent to the course email address (t613050@james.hut.fi) no later than 23 November 2007 (Friday). Precise instructions will appear later on this page. Also, on the same day every team should hand in a preliminary version of their report that describes the work done so far. This report does not need to be very polished or complete, but it should already contain the basic ideas used in the solution.

The prediction scores of every team will be published on Monday, 26 November 2007. A number of teams with an interesting solution will be asked to prepare a short presentation (about 15 min) where they discuss their approach to the problem. Performance in the competition is not the only criteria for being chosen to give a talk. Mostly the selections will be made based on the preliminary reports. All presentations are given in the problem session on Friday, 30 November 2007.

After this the correct labels are published for the test set, and the teams are allowed to make modifications to their approach. However, please do not simply copy the method used by the teams with good performance in the competition! The idea is to give everybody an opportunity to tune their classifiers or maybe try something different altogether. This work must be discussed in the final report that should be handed in by 2 January 2008.

Also remember that web spam detection is a difficult task! Basically there is no upper limit on the complexity of the solution. Almost all research articles published on the topic employ very sophisticated and multi-phased approaches. The purpose of the project is not to (even try to!) replicate any of the methods you might find in the literature. For passing the project it is enough to use one of the basic algorithms, do the feature- and model selection parts as instructed, and submit a well written report. Do not use any method that you do not understand yourself!

As a summary, to complete the project you should deliver:

Predictions for the test set and a preliminary version of your project report by 23 November 2007.
For some of you: a presentation about your solution for 30 November 2007.
The final report by 2 January 2008.

The data

The data is stored in a number of files, each containing a different set of features computed from the crawled pages. The files are in ASCII format, one line for each host, with the features as a comma separated list. The first line of each file has the names of each feature. Brief descriptions of the features are given in separate files. You can use whatever combination of features you see fit.

Simple features: These are the hostname, number of pages under the host and length of the hostname. [download data]
Content based features: These are features based on the textual content of the pages on each host. Note that these are not available for all hosts! [download data] [download description]
Link based features: These are features based on the hostgraph. [download data] [download description]
Link based features transformed: These are different transformations (ratios, logarithms, etc.) of the link based features. [download data] [download description]
Training data: Labels for a number of hosts. [download data]
NEW: Unlabeled test data: A list of hosts for which you should predict the class and the spammicity value. [download data]