Courses in previous years: [ 2000 | 2001 | 2002]

Tik-122.101 Special Course in Information Science V L

Modeling and mining the Web

Lecturer:Prof. (pro tem) Jaakko Hollmén
Course Assistant:M.Sc. (Tech.) Anne Patrikainen
Semester:Autumn 2003
Credit points:4-5 cr
Place:Lecture hall B353 in the computer science building (note the change!)
Time:Wednesday, 12:15 - 14:00 (note the change!)
Language:Finnish or English, depending on the participants
Course book: Baldi, Frasconi, Smyth. Modeling the Internet and the Web: Probabilistic Methods and Algorithms.

Modeling and mining the Web - Course description

The most important goal for theoretical computer science in 1950-2000 was to understand the von Neumann computer. The most important goal for theoretical computer science from 2000 onwards is to understand the Internet.
- Christos H. Papadimitriou

There are many reasons why the Internet and the Web are exciting, albeit young, topics for scientific investigation. ...The Web can be viewed as an example of a very large distributed and dynamic system with billions of pages resulting from the uncoordinated actions of millions of individuals. After all, anyone can post a Web page on the Internet and link it to any other page. In spite of this complete lack of central control, the graphical structure of the Web is far from random and possesses emergent properties shared with other complex graphs found in social, technological, and biological systems.
- Baldi, Frasconi, Smyth (the course book)

Data analysis aims to create knowledge from data sets that consist of measurement signals from the environment. Traditionally, the measured signals come from industrial processes, telecommunication environments, or medical settings, for instance.

In the scope of this course, the whole Internet will be considered as a (very) large data set that must be analyzed. Appropriate data analysis methods for this purpose have to be developed. The end goal still remains the same: to analyze the data and create new knowledge about the problem domain. Uses of such knowledge can be used in several settings, such as in information retrieval or behavioral analysis in new digital environments.

Passing the course

To pass the course, you have to give a presentation, solve homework problems, complete a computer project, and take part in a final exam. For this, you are given 4 credit units. You can earn an extra credit unit by writing an additional essay or completing an additional research project and handing in a research report.


Knowledge of basic mathematics, statistics, and information science are required. Especially some familiarity with linear algebra and graph theory are useful.

Applying for the course

The number of participants in the course is limited to 10-15. The participants are chosen based on their major, stage of the studies, and a short letter of motivation.

If you wish to take part in the course, please send an e-mail to Anne Patrikainen. Include information on your department, major, stage of studies (if you're not a graduate student, the total number of credits and the number of credits on T-61 courses), and some reasons for why the course is important to you.

The deadline for the applications is on Tuesday, September 9th at 12:00. We will let you know later on Tuesday whether you are accepted in the course.

If you don't want to take the credits for the course but wish to be an auditor only, you don't have to send an application - you're welcome to listen to the presentations.


10.9.Introduction to the topic, practical arrangements, choosing the presenters-Jaakko Hollmén
17.9.Text analysis 1 & Text analysis 2 (problems for Text analysis 2)Baldi/4 + additional materialElla Bingham (1), Nikolaj Tatti (2)
1.10.Web as a graph 1Baldi/1.7,3 + articlesSatu Virtanen
8.10.Web as a graph 2 and problemsBaldi/1.7,3 + articlesSatu Virtanen
15.10.Link analysis 1: History, PageRank, HITSBaldi/5.1-5.4 + articlesAnne Patrikainen
22.10.Link analysis 2: Stability of and extensions to PageRank and HITSBaldi/5.5-5.7 + articlesAnne Patrikainen
29.10.WWW technologies and crawling techniquesBaldi/2Olli-Pekka Rinta-Koski
5.11.Advanced crawling techniques and web dynamicsBaldi/6Olli-Pekka Rinta-Koski
5.11. & 19.11. at 11:15Search engines in practice + homeworkBerry/2,5,6,7Yang Zhi-Rong
12.11.NO SEMINAR--
19.11.Modeling human behavior on the web + problemsBaldi/7Nikolaj Tatti
26.11.Commerce on the web: Automated recommender systemsBaldi/8Juha Raitio
3.12.The future of web miningChakrabarti/9 + material on the analysis of Finnish languageJaakko Hollmen, Timo Honkela
10.12.Feedback session for the computer project + exam--

Course literature

Pierre Baldi, Paolo Frasconi, Padhraic Smyth. Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley 2003. (The main course book)

Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. Siam 1999. (Additional material)

Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers 2003. (Additional material)

Details on the practical arrangements


Each participant is required to give a presentation on one of the topics listed in the schedule above. The presentation is to last for 45 minutes. The presenter must also prepare hand-outs for the audience. The presentation slides will be published on this course web page. The slides must be sent by e-mail in ps format to Anne Patrikainen on the presentation day, at the latest.

Each presenter is also required to prepare two (or three) homework problems related to the presented topic. The problems may be calculations or computer assignments and should take around two hours to solve. Baldi's book contains good exercises, but the presenter might also make up new ones or find other sources. If the presentation is on Wednesday, the presenter should send his/her homework problem suggestions to Anne Patrikainen by previous Monday. The lecturer and the assistant will evaluate the problems and make some modifications, if required.

The presenter is responsible for the exercise session a week after the presentation.

Note that the presentation does not need and should not cover every single detail in the assigned material (for instance, a chapter of a book)! Rather, you may pick a topic / some topics of interest and concentrate on them.

Homework problems

Each week, the first half of the course will be in a form of an exercise session. The speaker of previous week has given out two exercise problems. The course participants will have had one week to solve these problems. In the exercise session, the correct answers (if any) are presented by one of the participants, and the solutions are discussed. The speaker of previous week, who has originally made up the problems, is responsible for the exercise session.

The homework problems will also be published on this web page on the day of the corresponding presentation.

To pass the course, each participant is required to solve at least a half of the homework problems.

Computer assignment

The instructions for the computer assignment have been published.

The search engine results are now available.

To pass the course, each participant has to get at least a half of the points for the computer assignment.

Final exam

There will be a small final exam covering mainly the core material of the course (Text analysis 1&2, Web as a graph 1&2, Link analysis 1&2). The topics of the second half of the course might be asked about on a very general level. The exam will be arranged in December.

To pass the course, each participant has to get at least a half of the points for the final exam.


To pass the course, each participant has to get at least a half of the points for the homework problems, the computer assignment, and the final exam. To pass with distinction, the participant has to get 3/4 of the total number of points.

Feedback on the presentations

After each presentation, the presenter will be given anonymous feedback by the audience. There will be feedback on both the presentation material and the presentation technique.

Data sets

  • Newsgroup data related to Text Analysis 1. The articles are collected from four newsgroups. Their topics are religion, medicine, space, and cryptography.
  • You can use the same data with Problem 2 / Text Analysis 2. Class indices: cryptography 1-87, medicine 88-174, space 175-261, religion 262-end. The index vector.
  • Sequence data related to Modeling Human Behavior in the Web. The data is in Matlab-format.

More information

Additional information about the course may be asked from Jaakko Hollmén or Anne Patrikainen.
Thursday, 12-Aug-2004 13:31:37 EEST