Laboratory of Computer and Information Science / Neural Networks Research Centre CIS Lab Helsinki University of Technology

T-61.6080 Special course in bioinformatics II:
Data integration and fusion in bioinformatics V P, 5 cr

Project work


The idea is to make a small scale research project that will give you some hands-on experience on data fusion techniques. The report should resemble a small conference paper. It should describe the overall goal, the choices, their justifications, the methods and the results with relevant illustrations in max. 6 pages, but it can be shorter. The amount of credit points is approximately 3 from this project work, so try to scale the amount of work according to that.

Key dates for the project

If necessary, we will give suggestions for improvement of the report by 22nd December. Deadline for the final improved versions is in January 2007.


Individual topics

General topic

Canonical correlation analysis (CCA) and its extensions in mining common variation in large-scale data sets.

Task Apply PCA, CCA, kernel-CCA and/or gCCA for mining genes that have common behaviour in different stress treatments in yeast. The underlying motivation is that it is assumed that yeast has a set of general stress genes, "environmental stress response (ESR)" genes that are always affected under stress. The task is to discover this set genes by computationally fusing several gene expression data sets, all measured under some sort of stress. Compare the abilities of the methods to discover ESR genes, discuss the theoretical differences between the methods and the differences in practice. A preprocessed data is available for download (gzipped archive, right-click and save), but you can get the original data and the papers from the author's web pages. Links to two papers providing the expression data:

  1. Gasch et al. paper
  2. Causton et al. paper

Implementation In at least R/BioConductor and Matlab, there are ready-made functions for doing the computations. Successful interpretation of the results will require understanding of the methods, their connections and output, however. You can use the ready-made functions, or make your own implementation.

Alternative data set If you like, you can use a leukemia data set instead of yeast stress case. The preprocessed data is available for download. There are five Leukemia subclasses. Their biological replicates are indicated by column names. Row names indicate the GeneID symbol for each gene expression profile across different conditions. Raw gene expression data was normalized by RMA, and logarithmic difference between the leukemia samples and measurements from normal patients is used here. The task is then to apply PCA/CCA/kCCA/gCCA between the different leukemia subtypes, by using data from all or some of the five subtypes. You can try to validate your findings by comparing them to a leukemia gene list offered by the community. Before doing this, you should convert the gene symbols of the web page to GeneID symbols of the ALL data. Another option is to check the GeneID identifiers of the most distinguished genes of the analysis by using the NCBI web server. Please ask if you encounter any problems.

You can get the original data through the original article by the authors. Link to the paper providing the expression data:


Note that CCA can be generalized for more than two data sets (=gCCA) as follows (see gCCA-related article): With two data sets, the solution equals to ordinary CCA. The interpretation of the components is slightly different in the two cases, however as PCA handles whitened data whereas CCA operates on the original, non-whitened data sets.

You are at: CIS → T-61.6080 Special course in bioinformatics II: Data integration and fusion in bioinformatics

Page maintained by, last updated Tuesday, 24-Oct-2006 11:39:09 EEST