Courses in previous years: [ 2000 | 2001 | 2002] ## Tik-122.101 Special Course in Information Science V L## Modeling and mining the Web
## Modeling and mining the Web - Course description
Data analysis aims to create knowledge from data sets that consist of measurement signals from the environment. Traditionally, the measured signals come from industrial processes, telecommunication environments, or medical settings, for instance. In the scope of this course, the whole Internet will be considered as a (very) large data set that must be analyzed. Appropriate data analysis methods for this purpose have to be developed. The end goal still remains the same: to analyze the data and create new knowledge about the problem domain. Uses of such knowledge can be used in several settings, such as in information retrieval or behavioral analysis in new digital environments. ## Passing the courseTo pass the course, you have to give a presentation, solve homework problems, complete a computer project, and take part in a final exam. For this, you are given 4 credit units. You can earn an extra credit unit by writing an additional essay or completing an additional research project and handing in a research report.## PrerequisitesKnowledge of basic mathematics, statistics, and information science are required. Especially some familiarity with linear algebra and graph theory are useful.## Applying for the courseThe number of participants in the course is limited to 10-15. The participants are chosen based on their major, stage of the studies, and a short letter of motivation. If you wish to take part in the course, please send an e-mail to Anne Patrikainen. Include information on your department, major, stage of studies (if you're not a graduate student, the total number of credits and the number of credits on T-61 courses), and some reasons for why the course is important to you.
The deadline for the applications is on If you don't want to take the credits for the course but wish to be an auditor only, you don't have to send an application - you're welcome to listen to the presentations. ## Schedule
## Course literaturePierre Baldi, Paolo Frasconi, Padhraic Smyth. Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley 2003. (The main course book) Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. Siam 1999. (Additional material) Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers 2003. (Additional material) ## Details on the practical arrangements
Each participant is required to give a presentation on one of the topics listed in the schedule above. The presentation is to last for 45 minutes. The presenter must also prepare hand-outs for the audience. The presentation slides will be published on this course web page. The slides must be sent by e-mail in ps format to Anne Patrikainen on the presentation day, at the latest. Each presenter is also required to prepare two (or three) homework problems related to the presented topic. The problems may be calculations or computer assignments and should take around two hours to solve. Baldi's book contains good exercises, but the presenter might also make up new ones or find other sources. If the presentation is on Wednesday, the presenter should send his/her homework problem suggestions to Anne Patrikainen by previous Monday. The lecturer and the assistant will evaluate the problems and make some modifications, if required. The presenter is responsible for the exercise session a week after the presentation. Note that the presentation does not need and should not cover every single detail in the assigned material (for instance, a chapter of a book)! Rather, you may pick a topic / some topics of interest and concentrate on them.
Each week, the first half of the course will be in a form of an exercise session. The speaker of previous week has given out two exercise problems. The course participants will have had one week to solve these problems. In the exercise session, the correct answers (if any) are presented by one of the participants, and the solutions are discussed. The speaker of previous week, who has originally made up the problems, is responsible for the exercise session. The homework problems will also be published on this web page on the day of the corresponding presentation. To pass the course, each participant is required to solve at least a half of the homework problems.
The instructions for the computer assignment have been published. The search engine results are now available. To pass the course, each participant has to get at least a half of the points for the computer assignment.
There will be a small final exam covering mainly the core material of the course (Text analysis 1&2, Web as a graph 1&2, Link analysis 1&2). The topics of the second half of the course might be asked about on a very general level. The exam will be arranged in December. To pass the course, each participant has to get at least a half of the points for the final exam.
To pass the course, each participant has to get at least a half of the points for the homework problems, the computer assignment, and the final exam. To pass with distinction, the participant has to get 3/4 of the total number of points.
After each presentation, the presenter will be given anonymous feedback by the audience. There will be feedback on both the presentation material and the presentation technique. ## Data sets- Newsgroup data related to Text Analysis 1. The articles are collected from four newsgroups. Their topics are religion, medicine, space, and cryptography.
- 4news.mat. The data, vocabulary size 100 terms (matlab format).
- terms4x100_100.txt. Term list (100 terms).
- 4news_800.mat. The data, vocabulary size 800 terms (matlab format).
- terms4x100_800.txt. Term list (800 terms).
- You can use the same data with Problem 2 / Text Analysis 2. Class indices: cryptography 1-87, medicine 88-174, space 175-261, religion 262-end. The index vector.
- Sequence data related to Modeling Human Behavior in the Web. The data is in Matlab-format.
## More informationAdditional information about the course may be asked from Jaakko Hollmén or Anne Patrikainen. http://www.cis.hut.fi/Opinnot/T-122.101/s2003/index.shtml webmaster@www.cis.hut.fi Thursday, 12-Aug-2004 13:31:37 EEST |