Tik-122.101 Special Course in Information Science V L
Modeling and mining the Web
Modeling and mining the Web - Course description
The most important goal for theoretical computer science in
1950-2000 was to understand the von Neumann computer. The most
important goal for theoretical computer science from 2000 onwards is
to understand the Internet.
There are many reasons why the Internet and the Web are
exciting, albeit young, topics for scientific investigation. ...The
Web can be viewed as an example of a very large distributed and
dynamic system with billions of pages resulting from the uncoordinated
actions of millions of individuals. After all, anyone can post a Web
page on the Internet and link it to any other page. In spite of this
complete lack of central control, the graphical structure of the Web
is far from random and possesses emergent properties shared with other
complex graphs found in social, technological, and biological
Data analysis aims to create knowledge from data sets that consist of measurement signals from the environment. Traditionally, the measured signals come from industrial processes, telecommunication environments, or medical settings, for instance.
In the scope of this course, the whole Internet will be considered as a (very) large data set that must be analyzed. Appropriate data analysis methods for this purpose have to be developed. The end goal still remains the same: to analyze the data and create new knowledge about the problem domain. Uses of such knowledge can be used in several settings, such as in information retrieval or behavioral analysis in new digital environments.
Passing the courseTo pass the course, you have to give a presentation, solve homework problems, complete a computer project, and take part in a final exam. For this, you are given 4 credit units. You can earn an extra credit unit by writing an additional essay or completing an additional research project and handing in a research report.
PrerequisitesKnowledge of basic mathematics, statistics, and information science are required. Especially some familiarity with linear algebra and graph theory are useful.
Applying for the course
The number of participants in the course is limited to 10-15. The participants are chosen based on their major, stage of the studies, and a short letter of motivation.
If you wish to take part in the course, please send an e-mail to Anne Patrikainen. Include information on your department, major, stage of studies (if you're not a graduate student, the total number of credits and the number of credits on T-61 courses), and some reasons for why the course is important to you.
The deadline for the applications is on Tuesday, September 9th at 12:00. We will let you know later on Tuesday whether you are accepted in the course.
If you don't want to take the credits for the course but wish to be an auditor only, you don't have to send an application - you're welcome to listen to the presentations.
Pierre Baldi, Paolo Frasconi, Padhraic Smyth. Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley 2003. (The main course book)
Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. Siam 1999. (Additional material)
Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers 2003. (Additional material)
Details on the practical arrangements
Each participant is required to give a presentation on one of the topics listed in the schedule above. The presentation is to last for 45 minutes. The presenter must also prepare hand-outs for the audience. The presentation slides will be published on this course web page. The slides must be sent by e-mail in ps format to Anne Patrikainen on the presentation day, at the latest.
Each presenter is also required to prepare two (or three) homework problems related to the presented topic. The problems may be calculations or computer assignments and should take around two hours to solve. Baldi's book contains good exercises, but the presenter might also make up new ones or find other sources. If the presentation is on Wednesday, the presenter should send his/her homework problem suggestions to Anne Patrikainen by previous Monday. The lecturer and the assistant will evaluate the problems and make some modifications, if required.
The presenter is responsible for the exercise session a week after the presentation.
Note that the presentation does not need and should not cover every single detail in the assigned material (for instance, a chapter of a book)! Rather, you may pick a topic / some topics of interest and concentrate on them.
Each week, the first half of the course will be in a form of an exercise session. The speaker of previous week has given out two exercise problems. The course participants will have had one week to solve these problems. In the exercise session, the correct answers (if any) are presented by one of the participants, and the solutions are discussed. The speaker of previous week, who has originally made up the problems, is responsible for the exercise session.
The homework problems will also be published on this web page on the day of the corresponding presentation.
To pass the course, each participant is required to solve at least a half of the homework problems.
The instructions for the computer assignment have been published.
The search engine results are now available.
To pass the course, each participant has to get at least a half of the points for the computer assignment.
There will be a small final exam covering mainly the core material of the course (Text analysis 1&2, Web as a graph 1&2, Link analysis 1&2). The topics of the second half of the course might be asked about on a very general level. The exam will be arranged in December.
To pass the course, each participant has to get at least a half of the points for the final exam.
To pass the course, each participant has to get at least a half of the points for the homework problems, the computer assignment, and the final exam. To pass with distinction, the participant has to get 3/4 of the total number of points.
Feedback on the presentations
After each presentation, the presenter will be given anonymous feedback by the audience. There will be feedback on both the presentation material and the presentation technique.
Thursday, 12-Aug-2004 13:31:37 EEST