T-61.6080 Special course in bioinformatics II:
Data integration and fusion in bioinformatics V P, 5 cr

Project work

General

The idea is to make a small scale research project that will give you some hands-on experience on data fusion techniques. The report should resemble a small conference paper. It should describe the overall goal, the choices, their justifications, the methods and the results with relevant illustrations in max. 6 pages, but it can be shorter. The amount of credit points is approximately 3 from this project work, so try to scale the amount of work according to that.

Key dates for the project

October, 26th: deadline for the project plan
November, 2nd: suggestions and approval of the plan by course administrators
December, 8th, 8 p.m.: deadline for the report

If necessary, we will give suggestions for improvement of the report by 22nd December. Deadline for the final improved versions is in January 2007.

Topics

Individual topics

Implement and analyze the data integration method that was used in the article of your seminar presentation. We can help you with preprocessed data and more detailed instructions according to the project. In fact, we have suggestions for some of you.
Suggest an own topic

General topic

Canonical correlation analysis (CCA) and its extensions in mining common variation in large-scale data sets.

Task Apply PCA, CCA, kernel-CCA and/or gCCA for mining genes that have common behaviour in different stress treatments in yeast. The underlying motivation is that it is assumed that yeast has a set of general stress genes, "environmental stress response (ESR)" genes that are always affected under stress. The task is to discover this set genes by computationally fusing several gene expression data sets, all measured under some sort of stress. Compare the abilities of the methods to discover ESR genes, discuss the theoretical differences between the methods and the differences in practice. A preprocessed data is available for download (gzipped archive, right-click and save), but you can get the original data and the papers from the author's web pages. Links to two papers providing the expression data:

Implementation In at least R/BioConductor and Matlab, there are ready-made functions for doing the computations. Successful interpretation of the results will require understanding of the methods, their connections and output, however. You can use the ready-made functions, or make your own implementation.

Alternative data set If you like, you can use a leukemia data set instead of yeast stress case. The preprocessed data is available for download. There are five Leukemia subclasses. Their biological replicates are indicated by column names. Row names indicate the GeneID symbol for each gene expression profile across different conditions. Raw gene expression data was normalized by RMA, and logarithmic difference between the leukemia samples and measurements from normal patients is used here. The task is then to apply PCA/CCA/kCCA/gCCA between the different leukemia subtypes, by using data from all or some of the five subtypes. You can try to validate your findings by comparing them to a leukemia gene list offered by the bioinformatics.org community. Before doing this, you should convert the gene symbols of the web page to GeneID symbols of the ALL data. Another option is to check the GeneID identifiers of the most distinguished genes of the analysis by using the NCBI web server. Please ask if you encounter any problems.

You can get the original data through the original article by the authors. Link to the paper providing the expression data:

Ross et al. 2003

References

CCA tutorial by Magnus Borga
Application of Kernel CCA
Yamanishi et al. Bioinformatics 19 (Supplement 1): i323. (2003)

Note that CCA can be generalized for more than two data sets (=gCCA) as follows (see gCCA-related article):

Whiten each data set separately
Concatenate the data sets
Perform PCA for the concatenated set

With two data sets, the solution equals to ordinary CCA. The interpretation of the components is slightly different in the two cases, however as PCA handles whitened data whereas CCA operates on the original, non-whitened data sets.

T-61.6080 Special course in bioinformatics II: Data integration and fusion in bioinformatics V P, 5 cr