Courses in previous years: [ 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 ]

If you attended this course you can evaluate it here

Lecturers | M.Sc. Nikolaj Tatti |
---|---|

Assistants | M.Sc. Sami Hanhijärvi |

Credits (ECTS) | 5 |

Semester | Spring 2008 |

Seminar sessions | On Wednesdays at 14-16 in Lecture Hall T5,
computer science building, Konemiehentie 2, Otaniemi, Espoo. The first lecture is on January 23th, 2008. |

Language | English |

Web | http://www.cis.hut.fi/Opinnot/T-61.6020/ |

t616020@cis.hut.fi |

The goal of this hands-on course is to introduce and to implement some of the most popular algorithms in Data Mining and Machine Learning. The algorithms deal with classification, clustering, link analysis and pattern discovery. See the topic list below.

First introductory lecture will be given on 23.01.2008. There will be no lecture on 30.01.08. The rest of the lectures will consist on presentations of the selected topic by the participants of the seminar.

The course prerequisites include background in algorithms, probability theory, linear algebra and some optimization theory. Students are assumed to have rudimentary knowledge in Matlab and Python.

To pass the course a student must attend actively in the lectures, give one presentation (based on a material chosen from the list), and complete all the homeworks. There will about 10 homeworks, one homework for each algorithm. In each homework one should implement the corresponding algorithm and test with some the given data. The student should write a short report (of length abt. 2-3 pages) for each homework, explaining the algorithm and the results. Stubs for the algorithms (either in Matlab or Python, depending on the algorithm) will be provided so that students can concentrate on the core of the algorithm.

There are 10 homeworks, one homework for each algorithm. Homeworks are divided
into two sets. The description of the homeworks can be downloaded from here:
[First set],
[Second set]. The deadline for the first homework
is **26.3**, and the deadline for the second homework is **21.5**.

The stubs can be downloaded from here: [tar.gz], [zip].

The toy datasets can be downloaded from here: [tar.gz], [zip].

The following table presents the dates for different topics and the respective lecturers. Please note that the lecture slides should be sent to the course address t616020@cis.hut.fi one week before the session so that the course organizers can comment and provide some hints conserning the slides.

Date | Name | Topic | Slides |
---|---|---|---|

23.1 | Nikolaj Tatti | Introduction | [PDF] |

6.2 | Adam Gyenge | Decision trees | [PDF] |

13.2 | Lasse Kärkkäinen | Mixture models | [PDF] |

20.2 | Laszlo Kozma | kNN | [PDF] |

27.2 | Jarno Seppänen | EM | [PDF] |

5.3 | Luis Gabriel de Alba | AdaBoost | [PDF] |

12.3 | Lauri Lahti | APriori | [PPT] [PDF] |

19.3 | Joni Pajarinen | PageRank/HITS | [PDF] |

2.4. | Ville Lämsä | K-Means/Spectral | [PDF] |

9.4. | Oskar Kohonen | FP-Tree | [PDF] |

16.4. | Stevan Keraudy | SVM | [PDF] |

Dusan Sovilj | gSpan | [PDF] | |

23.4. | Sami Virpioja | BIRCH | [PDF] |

Ari Nevalainen | PrefixSpan | [PDF] |

The following is a selection of topis for the seminar on Popular Algorithms. The list is based heavily on Top 10 Algorithms in Data Mining from ICDM 2006. Each participant should select one topic for his/her presentation. Available topics will be assigned to participants, who don't already have a topic, at the first lecture. If we ran out of topics, you can e-mail us (t616020@cis.hut.fi) and we provide some additional candidate topics for the presentation.

Please, inform about your preferred topic and preferred timeslots for your presentation by sending an email to t616020@cis.hut.fi. Papers will be handed out in first come first serve fashion. At the moment, all available topics are taken.

Note that some papers are unavailable outside hut domain.

**Decision Trees (ID3 and pruning methods)**- J. Ross Quinlan: Induction of Decision Trees. Machine Learning 1(1): 81-106 (1986) http://www.cs.toronto.edu/~roweis/csc2515-2006/readings/quinlan.pdf
- J. Ross Quinlan: Simplifying Decision Trees. International Journal of Man-Machine Studies 27(3): 221-234 (1987) http://citeseer.ist.psu.edu/525260.html
**K Nearest Neighbours (kNN)**- Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI). 18, 6 (Jun. 1996), 607-616. http://dx.doi.org/10.1109/34.506411
- Duda et al. Wiley Interscience, 2.ed, 2000. Sections 4.4 - 4.6.
- Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 10.6.
**Naive Bayes / Chow-Liu Tree Model**(mixture models)- Chow C , Liu C, Approximating discrete probability distributions with dependence trees, Information Theory, IEEE Transactions on, Vol. 14, No. 3. (1968), pp. 462-467. http://ieeexplore.ieee.org/iel5/18/22639/01054142.pdf
- H. Zhang, "The optimality of naive Bayes," presented at 17th International FLAIRS conference, Miami Beach, May 17-19, 2004. http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf
- Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 10.8.
**SVM**- Thorsten Joachims, "Training Linear SVMs in Linear Time", KDD 2006, August 20-23, 2006, Philadelphia, Pennsylvania, USA. http://www.cs.cornell.edu/People/tj/publications/joachims_06a.pdf
- Haykin, Simon. Neural Networks: A Comprehensive Foundation, 2.ed. Prentice-Hall, 1999. Chapter 6.
- Christopher Burges, "A Tutorial on Support Vector Machines for Pattern Recognition",
Data Mining and Knowledge Discovery, Vol. 2. (1998), pp. 121-167, 1998.
http://www.public.asu.edu/~jye02/CLASSES/Spring-2007/Papers/PAPERS/SVM-tutorial.pdf

**Discussion of linear SVMs is sufficient**(sections 1-3). **Expectation Maximization (EM)**- Jeff A. Bilmes. "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation
for Gaussian Mixture and Hidden Markov Models", International Compute Science Institute, 1998.
http://scipp.ucsc.edu/groups/retina/articles/bilmes98gentle.pdf

**Discussion of sections 1-3 is sufficient.** - Russel and Norvig. Artificial Intelligence: A Modern Approach, 2.ed. Prentice-Hall, 2003. Section 20.3.
- Duda et al. Wiley Interscience, 2.ed, 2000. Sections 3.9.
- Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 8.4.
**APriori**- Agrawal R, Srikant R. "Fast Algorithms for Mining Association Rules", VLDB. Sep 12-15 1994, Chile, 487-99. http://www.acm.org/sigmod/vldb/conf/1994/P487.PDF
- Hand, Mannila and Smyth. Principles of Data Mining. MIT, 2000. Section 5.3.2.
**FP-Tree**- Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD international Conference on Management of Data (Dallas, Texas, United States, May 15 - 18, 2000). SIGMOD '00. ACM Press, New York, NY, 1-12. http://doi.acm.org/10.1145/342009.335372
**PageRank / HITS**- Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, California, United States, January 25 - 27, 1998). Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, 668-677. http://www.cs.cornell.edu/home/kleinber/auth.pdf
- Corso, Gullí and Romani. Fast PageRank Computation via a Sparse Linear System (Extended Abstract). http://citeseer.ist.psu.edu/719287.html
**K-Means / Spectral Clustering**- Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, 2002. http://robotics.stanford.edu/~ang/papers/nips01-spectral.ps
- D. Arthur, S. Vassilvitskii. k-means++ The Advantages of Careful Seeding. Symposium on Discrete Algorithms (SODA), 2007. http://www.stanford.edu/~sergeiv/papers/kMeansPP-soda.pdf
- Duda et al. Wiley Interscience, 2.ed, 2000. Section 10.4.3.
**AdaBoost**- A Short Introduction to Boosting Introduction to Adaboost by Freund and Schapire from 1999. http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf
- Duda et al. Wiley Interscience, 2.ed, 2000. Sections 9.5.2.
**BIRCH**(extra topic)- Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international Conference on Management of Data (Montreal, Quebec, Canada, June 04 - 06, 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New York, NY, 103-114. DOI= http://doi.acm.org/10.1145/233269.233324
**gSpan**(extra topic)- Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM '02) (December 09 - 12, 2002). IEEE Computer Society, Washington, DC. http://citeseer.ist.psu.edu/yan02gspan.html
**PrefixSpan**(extra topic)- J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In Proceedings of the 17th international Conference on Data Engineering (April 02 - 06, 2001). ICDE '01. IEEE Computer Society, Washington, DC. http://www-sal.cs.uiuc.edu/~hanj/pdf/span01.pdf

You are at: CIS → /Opinnot/T-61.6020/2008/index.shtml

Page maintained by webmaster at cis.hut.fi, last updated Wednesday, 23-Apr-2008 16:24:34 EEST