Types of Data Sets

General characterstics of datasets

  • Dimensionality:- The dimensionality of a data set is the number of attributes that the objects in the data set possess. Data with a small number of dimensions tends to be qualitatively different than moderate or high-dimensional data. The difficulties assosiated with the analyzing high-dimensional data are sometimes referred to as the curse of dimensionality.

  • Sparsity:- For some data sets, such as those with asymmetric features, most of the attributes of an object have values of 0, fewer than 1% of the entries are non zero. In practical terms sparsity is an advantage because only the non-zero values are needed to be stored and manipulated.

  • Resolution:- It is frequently possible to obtain data at different level of resolution, and often the properties of data are different at different resolutios. For example the surface of earth seems very uneven at the resolution of a few meters, but is relatively smooth at a resoulution of tens of kilometers. If the resolution is too fine, a pattern may not be visible or may be buried in noise.

Record Data

Record data is a collection of data objects, each of which consists of a fixed set of data fields (attributes). Record data is generally stored either in flat files, or in relational databases.



Transaction or Market Basket Data

Transaction data is a special type of record data, where each record (transaction) involves a set of items. This type of data is called market basket data because the items in each record are the products in a person’s “market bucket”.


1Bread, Soda, Milk
2Beer, Bread
3Beer, Soda, Diaper, Milk
4Beer, Bread, Diaper, Milk
5Soda, Diaper, Milk

The data matrix

If the data objects in a collection of data all have the same fixed set of numeric attributes, then the data is considered as a point (vector) in multi dimensional space, where each dimension represents a distinct attribute describing the object. A set of such data objects an be interpreted as an m byn matrix where there are m rows, one for each object and n columns. This matrix is called a data matrix or a pattern matrix.

Projection of x loadProjection of y loadDistanceLoadThickness

The sparse data matrix

A sparse matrix is a special case of a data matrix in which the attributes are of the same time and are asymmetric; i.e., only non-zero values are important. Transaction data is an example of a sparse data matrix that has only 0-1 entries. Another common example is document data. In particular, if the order of terms (words) in a document is ignored, then a document can be represented as a term vector, where each term is a component (attribute) of the vector and the value of each component is the number of times the corresponding term occurs in the document. This representation of a collection of documents is often called a Document term matrix.


Document 13050260202
Document 20702100300
Document 30100122030

Graph based data

A graph can sometimes be a convenient and powerful representation for data. There are two spefic cases for graph based data :-

  1. Data with relationship among objects:- The relationship among objects frequently represent important information convey important information. In such cases, the data is often represented as a graph. In particular, the objects are mapped to nodes of the graph, while the relationship among objects are captured by the links between object and link properties, such as direction and weight.

  2. Data with objects that are graph:- If objects have structure, that is, the objects contain subobjects that have relationship, then such objects are frequently represented as graphs. For example, the structure of chemical compounds can be represented by a graph, where the nodes are atoms and the links between the nodes are chemical bonds.

Ordered data

For some types of data, the attributes have relationships that involves order in space or time, different types of such called ordered data are as follows:-

  1. Sequential data:- Sequential data, also referred as temporal data, can be thought of as an extention of record data. where each record has a time assosiated with it.

  2. Sequence data:- Sequence data consists of a data set that is a sequence of individual entities, such as a sequence of words or letters. It is quite similiar to sequential data except that there are no time stamps; instead there are positions in an ordered sequence. For example the generic imformation of plants and animals can be represented in the for of sequence of nucleotides that are known as genes.

  3. Time Series Data:- Time series data is a special type of sequential data in which each record is a time series. i.e., a series of measurement taken over time. For example, a financial data set might contain objects that are time series of the daily prices of various stocks.

  4. Spatial Data:- Some objects have spatial attributes, such as positions or areas, as well as other types of attributes. For example, weather data that is collected from a variety of geographical locations.

Handeling Non-Record data

Most data mining algorithms are designed for record data or it’s variations such as transaction data and data matrices Record oriented techniques can be applied to non-record data by extracting features from data objects ans using these features to create a record corropondin to that object.