Laboratory of Computer and Information Science

On this page, the primary course book for Special Course in Information Science I (Tik-61.181), autumn 1999, is presented.

Data Preparation for Data Mining

by Dorian Pyle

[Book cover]

This book is unique in that it addresses a difficult issue forever present in data mining: preparation of raw data into a form suitable for data mining. The issue is considered so difficult that practically nothing has ever been actually written about it. This book does, though: how to deal with categorial variables, missing values, out-of-range problems, sparce data, etc.

The book also gives a really good overview of what the whole data mining is about, from the viewpoint of 25 years of experience.

What this book does not give, is a throughout mathematical treatment of the issues. The book is intended for people with, as said by Mr. Pyle below, "Basic knowledge of computing and forgotten high school math", which in practice means that the calculation of variance is explained in detail... Despite this lack of mathematical depth, or perhaps because of it, the book manages to give an excellent idea of how data preparation works in practice, although it does not give prove why it should work so. The lack of mathematical detail will be fixed with additional material on the course.

The below information has been grabbed from the Amazon, where the book (paperback) costs about $40 + posting.

Table of contents

Data Exploration As a Process
The Nature of the World and Its Impact on Data Preparation
Data Preparation as a Process
Getting the Data: Basic Preparation
Sampling, Variability and Confidence
Handling Non-Numerical Variables
Normalizing and Redistributing Variables
Replacing Missing and Empty Values
Series Variables
Preparing the Data Set
The Data Survey
Using Prepared Data
Using the Demonstration Code on the CD
Further Reading


Data Preparation for Data Mining addresses an issue unfortunately ignored by most authorities on data mining: data preparation. Thanks largely to its perceived difficulty, data preparation has traditionally taken a backseat to the more alluring question of how best to extract meaningful knowledge. But without adequate preparation of your data, the return on the resources invested in mining is certain to be disappointing.

Dorian Pyle corrects this imbalance. A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals. Apply his techniques and watch your mining efforts pay off-in the form of improved performance, reduced distortion, and more valuable results.

From the author

The author, Dorian Pyle - , March 20, 1999

Is this book for you?
Thank you for your interest in my book!

The book is about exactly what the title suggests, how to prepare data for mining. I wrote it because in data mining, one of the most important parts of the whole process is to properly prepare the data. The importance of preparation is acknowledged at conferences, seminars, presentations and in books about data mining. Yet despite its importance, it is not really addressed in detail anywhere else.

Data mining is becoming very popular today, and many people are interested in using these new and powerful tools. Perhaps you are one of them. You may not have a background in statistics or data analysis, but you still want to get the most out of what data mining offers. But how do you begin? Most data mining books talk at length about what various algorithms do, and how to apply them to prepared data. But how do you get started? This book will help you to see the process, understand what is needed, and get the most out of your data in solving real world business problems.

Of course, data preparation is a technical subject. I do assume that you know the basics of computing, and that at some point you took high school math (although you may well have forgotten most of what you learned about it!) That's ok. Basic knowledge of computing and forgotten high school math, plus an interest in understanding how to get the most out of your data, is all you will need to understand what is in this book. There is very little math here, and even what there is can be ignored if you only want an overview.

If you are a programmer, or understand how to read computer programs, all of the tools that are described in the text are illustrated with code. Once again, you don't need to understand the code to use the tools and techniques. It's there if you want it, but this is not a book about programming. My focus throughout is on helping you to understand what to do with and to data to get the most out of it. And so that you can experiment for yourself, there are some sample data sets provided for you to explore. The code is ready compiled for you to use on the data, as well as in source form.

My book is mainly intended for people who need to work with data and to mine it. However, if you only need to understand what is involved in the preparation and mining process, and what can realistically be expected from it, this book will help you to. You will certainly want to skip the more technical parts, but there is plenty of non-technical material that will give you a good idea of the process.

I really enjoyed writing the book. I have spent a lot of my professional life working with data sets to find out what is in them and to get value out of them. I hope that you enjoy reading it, and that by doing so, you can avoid making some of the mistakes that I made along the way! Most of what I learned was as a result of discovering what didn't work, and then discovering what did on many, many projects.

I wish you much luck and success in your mining efforts.

CIS courses
September 2nd, 1999
Juha Vesanto