Laboratory of Computer and Information Science / Neural Networks Research Centre CIS Lab Helsinki University of Technology

Courses in previous years: [ 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 ]

If you attended this course you can evaluate it here


T-61.6020 Special Course in Computer and Information Science II P : Popular Algorithms in Data Mining and Machine Learning


Lecturers M.Sc. Nikolaj Tatti
Assistants M.Sc. Sami Hanhijärvi
Credits (ECTS) 5
Semester Spring 2008
Seminar sessions On Wednesdays at 14-16 in Lecture Hall T5, computer science building,
Konemiehentie 2, Otaniemi, Espoo. The first lecture is on January 23th, 2008.
Language English
Web http://www.cis.hut.fi/Opinnot/T-61.6020/
E-mail t616020@cis.hut.fi

Introduction

The goal of this hands-on course is to introduce and to implement some of the most popular algorithms in Data Mining and Machine Learning. The algorithms deal with classification, clustering, link analysis and pattern discovery. See the topic list below.

First introductory lecture will be given on 23.01.2008. There will be no lecture on 30.01.08. The rest of the lectures will consist on presentations of the selected topic by the participants of the seminar.

Prerequisites

The course prerequisites include background in algorithms, probability theory, linear algebra and some optimization theory. Students are assumed to have rudimentary knowledge in Matlab and Python.

Requirements for passing the course

To pass the course a student must attend actively in the lectures, give one presentation (based on a material chosen from the list), and complete all the homeworks. There will about 10 homeworks, one homework for each algorithm. In each homework one should implement the corresponding algorithm and test with some the given data. The student should write a short report (of length abt. 2-3 pages) for each homework, explaining the algorithm and the results. Stubs for the algorithms (either in Matlab or Python, depending on the algorithm) will be provided so that students can concentrate on the core of the algorithm.

Homeworks

There are 10 homeworks, one homework for each algorithm. Homeworks are divided into two sets. The description of the homeworks can be downloaded from here: [First set], [Second set]. The deadline for the first homework is 26.3, and the deadline for the second homework is 21.5.

The stubs can be downloaded from here: [tar.gz], [zip].

The toy datasets can be downloaded from here: [tar.gz], [zip].

Presentations

The following table presents the dates for different topics and the respective lecturers. Please note that the lecture slides should be sent to the course address t616020@cis.hut.fi one week before the session so that the course organizers can comment and provide some hints conserning the slides.

Date Name Topic Slides
23.1 Nikolaj Tatti Introduction [PDF]
6.2 Adam Gyenge Decision trees [PDF]
13.2 Lasse Kärkkäinen Mixture models [PDF]
20.2 Laszlo Kozma kNN [PDF]
27.2 Jarno Seppänen EM [PDF]
5.3 Luis Gabriel de Alba AdaBoost [PDF]
12.3 Lauri Lahti APriori [PPT] [PDF]
19.3 Joni Pajarinen PageRank/HITS [PDF]
2.4. Ville Lämsä K-Means/Spectral [PDF]
9.4. Oskar Kohonen FP-Tree [PDF]
16.4. Stevan Keraudy SVM [PDF]
Dusan Sovilj gSpan [PDF]
23.4. Sami Virpioja BIRCH [PDF]
Ari Nevalainen PrefixSpan [PDF]

Topics

The following is a selection of topis for the seminar on Popular Algorithms. The list is based heavily on Top 10 Algorithms in Data Mining from ICDM 2006. Each participant should select one topic for his/her presentation. Available topics will be assigned to participants, who don't already have a topic, at the first lecture. If we ran out of topics, you can e-mail us (t616020@cis.hut.fi) and we provide some additional candidate topics for the presentation.

Please, inform about your preferred topic and preferred timeslots for your presentation by sending an email to t616020@cis.hut.fi. Papers will be handed out in first come first serve fashion. At the moment, all available topics are taken.

Note that some papers are unavailable outside hut domain.

Decision Trees (ID3 and pruning methods)
K Nearest Neighbours (kNN)
  • Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI). 18, 6 (Jun. 1996), 607-616. http://dx.doi.org/10.1109/34.506411
  • Duda et al. Wiley Interscience, 2.ed, 2000. Sections 4.4 - 4.6.
  • Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 10.6.
Naive Bayes / Chow-Liu Tree Model (mixture models)
SVM
Expectation Maximization (EM)
  • Jeff A. Bilmes. "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models", International Compute Science Institute, 1998. http://scipp.ucsc.edu/groups/retina/articles/bilmes98gentle.pdf
    Discussion of sections 1-3 is sufficient.
  • Russel and Norvig. Artificial Intelligence: A Modern Approach, 2.ed. Prentice-Hall, 2003. Section 20.3.
  • Duda et al. Wiley Interscience, 2.ed, 2000. Sections 3.9.
  • Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 8.4.
APriori
FP-Tree
  • Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD international Conference on Management of Data (Dallas, Texas, United States, May 15 - 18, 2000). SIGMOD '00. ACM Press, New York, NY, 1-12. http://doi.acm.org/10.1145/342009.335372
PageRank / HITS
  • Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, California, United States, January 25 - 27, 1998). Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, 668-677. http://www.cs.cornell.edu/home/kleinber/auth.pdf
  • Corso, Gullí and Romani. Fast PageRank Computation via a Sparse Linear System (Extended Abstract). http://citeseer.ist.psu.edu/719287.html
K-Means / Spectral Clustering
AdaBoost
BIRCH (extra topic)
  • Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international Conference on Management of Data (Montreal, Quebec, Canada, June 04 - 06, 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New York, NY, 103-114. DOI= http://doi.acm.org/10.1145/233269.233324
gSpan (extra topic)
  • Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM '02) (December 09 - 12, 2002). IEEE Computer Society, Washington, DC. http://citeseer.ist.psu.edu/yan02gspan.html
PrefixSpan (extra topic)
  • J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In Proceedings of the 17th international Conference on Data Engineering (April 02 - 06, 2001). ICDE '01. IEEE Computer Society, Washington, DC. http://www-sal.cs.uiuc.edu/~hanj/pdf/span01.pdf

You are at: CIS → T-61.6020 Special Course in Computer and Information Science II

Page maintained by ntatti@cc.hut.fi, last updated Wednesday, 23-Apr-2008 16:24:34 EEST