[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
HUT - CIS /Opinnot/T-122.102/k2003/index.shtml
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
[an error occurred while processing this directive]
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST
Exercises given out this far (now also including the fifth exercise):
gzipped PostScript,
PDF.
Instructions for programming exercise:
gzipped PostScript,
PDF.
(You may write your report in Finnish, Swedish, or English;
the instructions are in English to accommodate our international students!)
Data files for exercise 4.1:
Citation data:
Matlab,
gzipped text.
The array "title" lists titles of some papers published in
CACM.
Three matrices contain some information about links between
papers. The matrix "links" has a 1 for every pair of papers such
that one refers to the other. The matrix "cocite" is related
to how often two papers are cited together. The matrix "coupling"
is related to how many common citations two papers have.
Source: Cornell.
Note: all three matrices are symmetric.
Course description: Analysis of binary data
Binary data – zeros and ones – arises in many practical contexts as
categorical data indicating alive vs. dead, positive vs. negative, defective
vs. non-defective, success vs. failure, presence vs. absence. Even whole
databases can be recorded using this categorical representation, for
instance in supermarket basket data, computer and telecommunications
systems, text analysis, and the like. Binary data may arise as a natural way
to represent the measured variable, or as a transformed representation of
the original variable of interest.
One of the traditional examples of large binary data set is the so-called
market-basket data. The binary vector indicates what a customer bought (had
in the market basket) out of all items in the market. A large supermarket
might have thousands of items (things you buy) and hundreds of thousands of
customers.
The task is to analyze such a dataset, to find structure in it,
to make meaningful inferences upon it, and to decide what action
to take.
This course covers modeling of binary data using two,
rather complementary approaches.
On the one hand, binary data can be modeled using local patterns of ones in
the data set. For instance, an example of local pattern in a super-market
basket data could be: customer x bought "beer, sausage, and milk". On the
other hand, global modeling involves estimating or approximating the joint
probability distribution of all realizations. The global approach usually
takes into account some independence relations about the data, for instance,
finite mixtures of multivariate probability distributions may be used. These
models, each in their own manner, can be used to make inferences about the
behavior of buying "beer, sausage, and milk". The two complementary views –
the local and global modeling – will form the basis of the course.
The course will cover topics about binary data, similarity measures for
binary data, transformations between continuous and binary data, local
modeling based on patterns, global modeling based on probabilistic models
such as finite mixtures of probability distributions and Bayesian networks,
maximum entropy modeling, subspace models, hypothesis testing on binary
counts, linear regression with binary outcomes, text modeling using binary
document-term matrices.
Requirements
The requirements for passing the course is active participation in the
lectures and the seminars, seminar presentation on a given topic, completion
of the exercises given out during the course and a completion of a larger,
programming exercise to be handed in before the end of May 2003.
Contact information
Jaakko Hollmén
Jouni Seppänen
Heikki Mannila
e-mail: t122102@mail.cis.hut.fi
http://www.cis.hut.fi/Opinnot/T-122.102/k2003/index.shtml
webmaster@www.cis.hut.fi
Wednesday, 09-Apr-2003 13:08:32 EEST