T122.102 Special Course in Information Technology VI (P V)

Lecturer:  Prof. (pro tem) Jaakko Hollmén, Prof. Heikki Mannila 

Course Assistant:  Jouni Seppänen, M.Sc. 
Semester:  Spring 2003 
Credit points:  34 cr (?) 
Place:  lecture hall T4 in the computer science building 
Time:  Tuesday, 14:15  16:00 first lecture on the 21st of January 
Language:  Finnish or English 
Course Homepage:  http://www.cis.hut.fi/Opinnot/T122.102/ 
Resources:  time table, source material 
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Matlab, gzipped text. The array "title" lists titles of some papers published in CACM. Three matrices contain some information about links between papers. The matrix "links" has a 1 for every pair of papers such that one refers to the other. The matrix "cocite" is related to how often two papers are cited together. The matrix "coupling" is related to how many common citations two papers have. Source: Cornell.
Note: all three matrices are symmetric.
Binary data – zeros and ones – arises in many practical contexts as categorical data indicating alive vs. dead, positive vs. negative, defective vs. nondefective, success vs. failure, presence vs. absence. Even whole databases can be recorded using this categorical representation, for instance in supermarket basket data, computer and telecommunications systems, text analysis, and the like. Binary data may arise as a natural way to represent the measured variable, or as a transformed representation of the original variable of interest.
One of the traditional examples of large binary data set is the socalled marketbasket data. The binary vector indicates what a customer bought (had in the market basket) out of all items in the market. A large supermarket might have thousands of items (things you buy) and hundreds of thousands of customers. The task is to analyze such a dataset, to find structure in it, to make meaningful inferences upon it, and to decide what action to take. This course covers modeling of binary data using two, rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in the data set. For instance, an example of local pattern in a supermarket basket data could be: customer x bought "beer, sausage, and milk". On the other hand, global modeling involves estimating or approximating the joint probability distribution of all realizations. The global approach usually takes into account some independence relations about the data, for instance, finite mixtures of multivariate probability distributions may be used. These models, each in their own manner, can be used to make inferences about the behavior of buying "beer, sausage, and milk". The two complementary views – the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for binary data, transformations between continuous and binary data, local modeling based on patterns, global modeling based on probabilistic models such as finite mixtures of probability distributions and Bayesian networks, maximum entropy modeling, subspace models, hypothesis testing on binary counts, linear regression with binary outcomes, text modeling using binary documentterm matrices.
The requirements for passing the course is active participation in the lectures and the seminars, seminar presentation on a given topic, completion of the exercises given out during the course and a completion of a larger, programming exercise to be handed in before the end of May 2003.
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
email: t122102@mail.cis.hut.fi