Previous courses: 2001, 2002

# T-122.102 Special Course in Information Technology VI (P V) T-122.102 Informaatiotekniikan erikoiskurssi VI (L V)

NB: For all purposes related to the Computer and Information Science curriculum, T-122 courses are equivalent to T-61 courses.

## Analysis of binary data – Binääridatan analyysi

Lecturer: Prof. (pro tem) Jaakko Hollmén, Prof. Heikki Mannila Jouni Seppänen, M.Sc. Spring 2003 3-4 cr (?) lecture hall T4 in the computer science building Tuesday, 14:15 - 16:00first lecture on the 21st of January Finnish or English http://www.cis.hut.fi/Opinnot/T-122.102/ time table, source material

Exercises given out this far (now also including the fifth exercise): gzipped PostScript, PDF.

Instructions for programming exercise: gzipped PostScript, PDF.
(You may write your report in Finnish, Swedish, or English; the instructions are in English to accommodate our international students!)

Data files for exercise 4.1:

• Citation data:

Matlab, gzipped text. The array "title" lists titles of some papers published in CACM. Three matrices contain some information about links between papers. The matrix "links" has a 1 for every pair of papers such that one refers to the other. The matrix "cocite" is related to how often two papers are cited together. The matrix "coupling" is related to how many common citations two papers have. Source: Cornell.

Note: all three matrices are symmetric.

### Course description: Analysis of binary data

Binary data – zeros and ones – arises in many practical contexts as categorical data indicating alive vs. dead, positive vs. negative, defective vs. non-defective, success vs. failure, presence vs. absence. Even whole databases can be recorded using this categorical representation, for instance in supermarket basket data, computer and telecommunications systems, text analysis, and the like. Binary data may arise as a natural way to represent the measured variable, or as a transformed representation of the original variable of interest.

One of the traditional examples of large binary data set is the so-called market-basket data. The binary vector indicates what a customer bought (had in the market basket) out of all items in the market. A large supermarket might have thousands of items (things you buy) and hundreds of thousands of customers. The task is to analyze such a dataset, to find structure in it, to make meaningful inferences upon it, and to decide what action to take. This course covers modeling of binary data using two, rather complementary approaches.

On the one hand, binary data can be modeled using local patterns of ones in the data set. For instance, an example of local pattern in a super-market basket data could be: customer x bought "beer, sausage, and milk". On the other hand, global modeling involves estimating or approximating the joint probability distribution of all realizations. The global approach usually takes into account some independence relations about the data, for instance, finite mixtures of multivariate probability distributions may be used. These models, each in their own manner, can be used to make inferences about the behavior of buying "beer, sausage, and milk". The two complementary views – the local and global modeling – will form the basis of the course.

The course will cover topics about binary data, similarity measures for binary data, transformations between continuous and binary data, local modeling based on patterns, global modeling based on probabilistic models such as finite mixtures of probability distributions and Bayesian networks, maximum entropy modeling, subspace models, hypothesis testing on binary counts, linear regression with binary outcomes, text modeling using binary document-term matrices.

### Requirements

The requirements for passing the course is active participation in the lectures and the seminars, seminar presentation on a given topic, completion of the exercises given out during the course and a completion of a larger, programming exercise to be handed in before the end of May 2003.

### Contact information

Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi

http://www.cis.hut.fi/Opinnot/T-61.6060/k2003/index.shtml
t122102@mail.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST