Data Preprocessing

Data preprocessing is a broad area and consists of a number of different strategies and techniques that are interrelated in complex ways. We shall study some of these techniques breifly.


Sometimes “less is more” and this is the case with aggregation, the combining of two or more data objects into a single object. Consider a data set consisting of transactions (data objects) recording the daily sales of products in various store locations for different days over a course of a year. One way to aggregate transaction for this data set is to replace all the transaction of a single store with a single storewide transaction. This reduces the hundereds or thousands of transactions that occur daily to single daily transaction, and the number of data objects are reduced to the number of stores.

There are several motivations for aggregation:-

  • The smaller data sets resulting from data reduction requires smaller memory and processing time, and hence, aggregation may permit the use of more expensive data mining algorithms.
  • Aggregation can act as a change of scope or scale by providing a high level view of the data instead of a low level view.


Sampling is a commonly used approach for selecting a subset of the data objects to be analyzed. In statistics, it has long been used for both the preliminary investigation of the data and the final data analysis.

Using a sample work almost as using the entire data set if the sample is representative if it has approximately the same property as the original set of data.

Sampling approaches:-

  • Simple random sampling:- In this type of sampling, this is an equal probability of selecting any particular item. There are two variation for Simple random sampling, First sampling with replacement and sampling without replacement.

  • Stratified Sampling:- When the population consists of different types of objects, with widely different number of objects, simple random sampling can fail to adequately represent those types of objects that are less frequent. This can cause problems when the analysis requires proper representation. Hence a sampling scheme that can accommodate differing frequencies for the items of the interests is needed. Stratified sampling, which starts with prespecified groups of objects, is such an approach. In the simplest approach, equal number of objects are drawn from each group even though the groups are of different sizes.

  • Progressive Sampling:- The proper sample size can be difficult to determine, so adaptive or progressive sampling schemes are sometimes used. These approaches start with a small sample, and then increase the sample size until a sample of suffient size has been obtained.

Dimensionality Reduction

Data sets can have a large number of features. For example a TDM for a set of documents or a set of time series consisting of daily closing price of various stocks over a period of 30 years.

There are variety of benifits to dimensionality reduction.

  • Many data mining algorithms work better if the dimensionality is lower.
  • Reduction of dimensionality can lead to a more understandable model because the model may involve fewew attributes.
  • Dimensionality reduction may allow the data to be more easily visualized.
  • Amount of time and memory required by data mining algorithm is reduced with a reduction in dimensionality.

The term dimensionality reduction is often reserved for those techniques that reduce the dimensionality of a data set by selecting new attributes that are a combination of old attributes.

The curse of dimensionality

The curse of dimensionality referes to the phenomenon that many type of data analysis become significantly harder as the dimensionality of the data increases. Specifically, as dimensionality increases, the data becomes increasingly sparse in the space that is occupies. For classification this can mean that there are not enough data objects to allow the creation of a model that reliably assigns a class to all objects. For clustering, the definitions of density and distance between points, which are critical for clustering, become less meaningful.

Techniques for Dimensionality reduction

  1. Principal component analysis:- It is a linear algebra technique for continues attributes that finds new attributes that are
    1. linear combination of of the original attributes
    2. orthogonal to each others.
    3. capture the maximum ammount of variations in the data
  2. Singular value Decomposition:- It is a linear algebra technique that is related to PCA and is also commonly used for dimensionality reduction.

Feature subset collection

Another way of reducing dimensionality is to use only a subset of the features. While it might seem that such an approach would lose information, this is not the case if redundant and irrelevant features are present. Redundant Features duplicates much or all of the information contained in one or more other attributes. Irrelevant features contain almost no useful information for the data mining task at hand. Redundant and irrelevant features can reduce classification accuracy and the quality of the clusters that are found.

While some irrelevant and redundant features can be removed immediately using common sense or domain knowledge, selecting the best subset of features requires a systematic approach. The ideal approach to feature selection would be to try all the possible subsets of features as input and then selecting the one which produces the best results. Unfortunately as the number of possible subsets of features would be 2^n, such an approach in impractical in most situations and alternative strategies are needed.

Feature creations

It is frequently possible to create, from the original attributes, a new set of attributes that captures the important information in a data set much more effectively. Furthermore, the number of new attributes us to reap all the previously described benifits of dimensionality reduction.

Related methadologies for creating new attributes are as follows:-

  1. Feature extraction:- The creation of a new set of features from the original raw data is known as feature extraction. For example, in the dataset of a set of photographs where each image is to be classified according to whether or not it contains a human face. The raw data is a set of pixels which is not suitable for many types of classification algorithms. However, if the data is processed to provide higher level feature features such as presence or absense certain types of edges and areas that are highly correlated with the presence of human faces, then a much broader set of classification techniques can be applied to this problem. Unfortunately, in the sense in which it is most commonly used, feature extraction is highly domain specific.
  2. Feature Construction:-Sometimes the features in the data sets have the necessary information, but it is not in a form suitable for data mining algorithms. In this situation one or more new features created from original features can be more useful.

Discretization and Binarization

Some data mining algorithms, especially certain classification algorithms, require that data be in the form of categorical attributes. Algorithms that find association patterns require that data be in the form of binary attributes. Thus it is often necessary, to transform a continues attribute into a categorical attribute (discretization), and both continues and dicrete attributes may need to to be transformed into one or more binary attributes(binarization).

Discretization:- it is the process of conversion of continues attributes into descrete attributes.
Binarization:- it is the process of conversion of continues and discrete attributes into binary attributes.

The best discretization or binarization approach is the one that produces the best result for the data mining algorithm that will be used to analyze the data. It is typically not practical to apply such a criterion directly. Consequently, discretization or binarization is performed in a wat that satisfies a criterion that is thought to have a relationship to good performance for data mining task being considered.

Binarization of categorical data

a simple technique to binarize a categorical attribute is as following.

  • If there are m categorical values, then uniquely assign each original value to an integer in the interval [0, m-1].
  • Convert each of these m integers into binary numbers using n attributes, where $n = \log_{2}(m)$
  • Since n binary digits are required to represent these integers represent these binary numbers using n binary attributes


Categorical valueInteger valuex1x2x3

Such a transformation can also cause some complications, such as creating unintended relationships among the transformed attributes. For example in the table above attributes x2 and x3 are co-related because information about the good value is being encoded using both attributes. Furthermore, association requires analysis asymmetric binary attributes, where only the presence of the attribute is important. For association problems, it is therefore necessary to introduce one attribute for each categorical value as shown in the table below.

Categorical valueInteger valuex1x2x3x4x5

likewise for association problems, it may be necessary to convert a single binary attribute with two assymetric binary attributes.

Discretization of continues attributes

Discretization is typically applied to attributes that are used in classification or association analysis. In general, the best discretization depends on the algorithm being used, as well as the attributes being considered. Typically, however, the descretization of an attribute is considered in isolation.

Transformation of a continue attribute to a categorical attribute involves two subtasks

  1. Deciding how many categories to have
  2. Determining how to map the values of the continues attributes to these categories


  1. Sort the values of continues attributes
  2. Divide the values into n intervals by specifying n-1 split points
  3. All the values in one interval are mapped to the same categorical value.

The result can be represented either as a set of intervals {(x0,x1), (x1, x2) …} or equivalently, as a series of inequalities x1 <= x2 <= x3 …

A basic distinction between discretization methos for classification is whether class information is used(supervised) or not (unsupervised). If class information is not used, then relatively simple approaches are common. For instance, the equal width approach divides the range of the attributes into a user specified number of intervals each having equal widths. As another example of unsupervised discretization, clustering method, such as k-means can be used.

Variable transformation

A variable transformation refers to a transformation that is applied to all the values of a variable. In other words, for each object, the transformation is applied to the value of the variable of that object. For example if only magnitude of a variable is important, then the values of the variable can be transformed by taking the absolute value.

Some types of variable transformations are as follows

  1. Simple Function Transformation:- for this type of variable transformation a simple mathematical function is applied to each value individually. For example: $x^k$, $\log(x)$, $e^x$ etc. Variable transformation should be applied with caution since they change the nature of the data. While this is what is desired, there can be problems if the nature of the transformation is not fully appreciated.

  2. Normalization or Standardization:- Another type of variable transformation is the standardization or normalization of a variable. The goal of standardization or normalization is to make an entire set of values have a particular property. A traditional example is that of “standardizing a variable” in statistics. if $\bar{x}$ is the mean of the attribute value and $s_{x}$ is their standard deviation, then the transformation $$x’ = (x-\bar{x})/s_{x}$$ creates a new variable that has a mean of 0 and a standard deviation of 1. If different values are to be combined in some way, then such a transformation is often necessary to avoid having a variable with large values dominate the results of the calculation. The mean and standard deviation are strongly affected by outliers, so the above transformation is often modified. First, the mean is replaced by the median, i.e., the middle value. Second, the standard deviation is replaced by the absolute standard deviation. if x is a variable, then the absolute standard deviation of x is given by $$\sigma_{x} = \sum_{i=1}^{m} x_{i} - \mu$$ where $x_{i}$ is the $i^{th}$ value of the variable, $m$ is the number of objects, and $\mu$ is either the mean or median.