Courses in previous years: [
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
]
If you attended this course you can evaluate it here
T-61.6020 Special Course in Computer and Information Science II P : Popular Algorithms in Data Mining and Machine Learning
Introduction
The goal of this hands-on course is to introduce and to implement some of the
most popular algorithms in Data Mining and Machine Learning. The algorithms
deal with classification, clustering, link analysis and pattern discovery.
See the topic list below.
First introductory lecture will be given on 23.01.2008. There will be no
lecture on 30.01.08. The rest of the lectures will consist on presentations of
the selected topic by the participants of the seminar.
Prerequisites
The course prerequisites include background in algorithms, probability theory,
linear algebra and some optimization theory. Students are assumed to have
rudimentary knowledge in Matlab and Python.
Requirements for passing the course
To pass the course a student must attend actively in the lectures, give one
presentation (based on a material chosen from the list), and complete all the
homeworks. There will about 10 homeworks, one homework for each algorithm. In
each homework one should implement the corresponding algorithm and test with
some the given data. The student should write a short report (of length abt.
2-3 pages) for each homework, explaining the algorithm and the results. Stubs
for the algorithms (either in Matlab or Python, depending on the algorithm)
will be provided so that students can concentrate on the core of the algorithm.
Homeworks
There are 10 homeworks, one homework for each algorithm. Homeworks are divided
into two sets. The description of the homeworks can be downloaded from here:
[First set],
[Second set]. The deadline for the first homework
is 26.3, and the deadline for the second homework is 21.5.
The stubs can be downloaded from here: [tar.gz],
[zip].
The toy datasets can be downloaded from here: [tar.gz],
[zip].
Presentations
The following table presents the dates for different topics and the respective lecturers.
Please note that the lecture slides should be sent to the course address
t616020@cis.hut.fi one week before the session
so that the course organizers can comment and provide some hints conserning the slides.
Date
| Name
| Topic
| Slides
|
23.1
| Nikolaj Tatti
| Introduction
| [PDF]
|
6.2
| Adam Gyenge
| Decision trees
| [PDF]
|
13.2
| Lasse Kärkkäinen
| Mixture models
| [PDF]
|
20.2
| Laszlo Kozma
| kNN
| [PDF]
|
27.2
| Jarno Seppänen
| EM
| [PDF]
|
5.3
| Luis Gabriel de Alba
| AdaBoost
| [PDF]
|
12.3
| Lauri Lahti
| APriori
| [PPT] [PDF]
|
19.3
| Joni Pajarinen
| PageRank/HITS
| [PDF]
|
2.4.
| Ville Lämsä
| K-Means/Spectral
| [PDF]
|
9.4.
| Oskar Kohonen
| FP-Tree
| [PDF]
|
16.4.
| Stevan Keraudy
| SVM
| [PDF]
|
Dusan Sovilj
| gSpan
| [PDF]
|
23.4.
| Sami Virpioja
| BIRCH
| [PDF]
|
Ari Nevalainen
| PrefixSpan
| [PDF]
|
Topics
The following is a selection of topis for the seminar on Popular Algorithms.
The list is based heavily on Top 10 Algorithms in Data
Mining from ICDM 2006. Each participant should select one topic for his/her
presentation. Available topics will be assigned to participants, who don't already
have a topic, at the first lecture. If we ran out of topics, you can e-mail us (t616020@cis.hut.fi)
and we provide some additional candidate topics for the presentation.
Please, inform about your preferred topic and preferred timeslots for your
presentation by sending an email to t616020@cis.hut.fi. Papers will be handed
out in first come first serve fashion. At the moment, all available topics are taken.
Note that some papers are unavailable outside hut domain.
- Decision Trees (ID3 and pruning methods)
- K Nearest Neighbours (kNN)
- Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor
Classification. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI). 18, 6 (Jun.
1996), 607-616.
http://dx.doi.org/10.1109/34.506411
- Duda et al. Wiley Interscience, 2.ed, 2000. Sections 4.4 - 4.6.
- Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 10.6.
- Naive Bayes / Chow-Liu Tree Model (mixture models)
- Chow C , Liu C, Approximating discrete probability distributions with dependence trees,
Information Theory, IEEE Transactions on, Vol. 14, No. 3. (1968), pp. 462-467.
http://ieeexplore.ieee.org/iel5/18/22639/01054142.pdf
- H. Zhang, "The optimality of naive Bayes," presented at 17th International FLAIRS conference, Miami Beach, May 17-19, 2004.
http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf
- Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 10.8.
- SVM
- Thorsten Joachims, "Training Linear SVMs in Linear Time", KDD 2006, August
20-23, 2006, Philadelphia, Pennsylvania, USA.
http://www.cs.cornell.edu/People/tj/publications/joachims_06a.pdf
- Haykin, Simon. Neural Networks: A Comprehensive Foundation, 2.ed. Prentice-Hall, 1999. Chapter 6.
- Christopher Burges, "A Tutorial on Support Vector Machines for Pattern Recognition",
Data Mining and Knowledge Discovery, Vol. 2. (1998), pp. 121-167, 1998.
http://www.public.asu.edu/~jye02/CLASSES/Spring-2007/Papers/PAPERS/SVM-tutorial.pdf
Discussion of linear SVMs is sufficient (sections 1-3).
- Expectation Maximization (EM)
- Jeff A. Bilmes. "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation
for Gaussian Mixture and Hidden Markov Models", International Compute Science Institute, 1998.
http://scipp.ucsc.edu/groups/retina/articles/bilmes98gentle.pdf
Discussion of sections 1-3 is sufficient.
- Russel and Norvig. Artificial Intelligence: A Modern Approach, 2.ed. Prentice-Hall, 2003. Section 20.3.
- Duda et al. Wiley Interscience, 2.ed, 2000. Sections 3.9.
- Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 8.4.
- APriori
- FP-Tree
- Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without
candidate generation. In Proceedings of the 2000 ACM SIGMOD
international Conference on Management of Data (Dallas, Texas, United
States, May 15 - 18, 2000). SIGMOD '00. ACM Press, New York, NY, 1-12.
http://doi.acm.org/10.1145/342009.335372
- PageRank / HITS
- Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. In
Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (San
Francisco, California, United States, January 25 - 27, 1998). Symposium on
Discrete Algorithms. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 668-677.
http://www.cs.cornell.edu/home/kleinber/auth.pdf
- Corso, Gullí and Romani. Fast PageRank Computation via a Sparse Linear System (Extended Abstract).
http://citeseer.ist.psu.edu/719287.html
- K-Means / Spectral Clustering
- AdaBoost
- BIRCH (extra topic)
- Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient
data clustering method for very large databases. In Proceedings of the
1996 ACM SIGMOD international Conference on Management of Data
(Montreal, Quebec, Canada, June 04 - 06, 1996). J. Widom, Ed.
SIGMOD '96. ACM Press, New York, NY, 103-114.
DOI= http://doi.acm.org/10.1145/233269.233324
- gSpan (extra topic)
- Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern
Mining. In Proceedings of the 2002 IEEE International Conference on
Data Mining (ICDM '02) (December 09 - 12, 2002). IEEE Computer
Society, Washington, DC.
http://citeseer.ist.psu.edu/yan02gspan.html
- PrefixSpan (extra topic)
- J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and
M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by
Prefix-Projected Pattern Growth. In Proceedings of the 17th
international Conference on Data Engineering (April 02 - 06,
2001). ICDE '01. IEEE Computer Society, Washington, DC.
http://www-sal.cs.uiuc.edu/~hanj/pdf/span01.pdf
You are at: CIS
→
T-61.6020 Special Course in Computer and
Information Science II
Page maintained by ntatti@cc.hut.fi,
last updated Wednesday, 23-Apr-2008 16:24:34 EEST