Clustering exercise

Data

All the data here (except the GO categories) has been fetched from Ulitsky et al.. However, the data files are not the original ones but a subsample that has been processed to a form that is easy to read in Matlab. These files do not even have the names of the genes of the GO classes, so doing any biology with these is impossible. I will later add versions that could be used in the project work.

Use "save as" from the browser to save these to your own folder.

Expression measurements for 1475 genes in 20 conditions
Protein-protein interactions, a list of proteins that interact (1 corresponds to the gene in the first row of the expression data etc)
GO categories, a binary matrix telling for each of the 1475 genes to which of the 34 GO categories it belongs to

Matlab-related stuff

Start Matlab with command "matlab"
You can build everything on top of run_cluster.m (save it into your account and modify). In Matlab you can run a script by simply writing its name (without .m) on the command line.
compareToGo.m (requires also fisherextest.m) can be used as basis for using the GO categories for validation. It is a bad implementation of enrichment testing without any corrections.
We only have direct links between proteins. If you want to know also indirect links (path of length X between two proteins) you can use smoothLinks.m.
If you haven't used Matlab before, you can consult for example
The graphical interface might sometimes feel sluggish or even crash. You can open Matlab also without the GUI using "matlab -nojvm"
The computing center has a wide range of Matlab toolboxes installed, but you are not likely to need those
You can do the exercise also using R if you want to (starts with command "R"); it is widely used in bioinformatics community and thus would be the preferred language for this kind of work. On this course Matlab was chosen only because more students are familiar with it.

Solution

You can take a look at run_cluster_solution.m if you couldn't finish the exercise. It is by no means a comprehensive treatment of the problem, but has the commands for running the clusterings as well as an attempt to some external validation by GO classes.