Classification

In general classification is the task of assigning objects to one of the predefined categories. Some of the example use cases includes:-

  1. Spam detection
  2. Classfication of galaxies based upon their shapes
  3. Categorizing cells as maligent or benign based upon the results of MRI scan.

Classification as the task of mapping an input attribute set $x$ into its class label $y$

Formal definition of classification

Formally Classification is the task of learning a target function $f$ that maps each attribute set $x$ to one of the predefined class labels $y$. The target function is also informally known as classification model.

Usages

A classification model is useful for the following purposes.

  1. Descriptive Modeling:- A classification model can serve as an explanatory tool to distinguish between objects of different classes.
  2. Predictive modeling:- A classification model can also be used to predict the class label of unknown records.

Classification techniques are most suited for predicting or describing data sets with binary or nominal categories. They are less effective for ordinal categories because they do not consider the implicit order among the categories.

General approach to solve a classification problem

A classification technique (or a classifier) is a systematic approach to building classification models from an input data set. Each technique employs a learning algorithm (Decision Trees, KNN, SVM, neural networks etc.) to identify a model that best fits the relationship between the attribute set and class label of the input data. The model generated by a learning algorithm should both fit the input data as well and correctly predict the class labels of the records it has never seen before. Therefore, the key objective of the learning algorithm is to build models with good generalization capabilities; i.e., model can accurately predict the class labels of previously unknown records.

Procedure:-

  1. The training set consisting of records whose class labels are known is taken.
  2. The training set is used to build a classification model
  3. Build model is applied to the test set to predict the labels.

Confusion matrix

Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. These counts are tabulated in a table known as confusion matrix. The following table depicts the confusion matrix for a binary classification problem.

Each entry $f_{ij}$ in the above table denotes the number of records from class $i$ predicted to be of class $j$. For instance, $f_{01}$ is the number of records from class $0$ incorrectly predicted to be of class $1$. Based on the entries in the confusion matrix, the total number of correct predictions made by the model is ($f_{11}+f_{00}$) and the total number of incorrect predictions is ($f_{01}+f_{10}$).

we can use confusion matrix to calculate other performace measures such as accuracy and error rate using the following formulas.

$$Accuracy = \frac{Number\ of\ correct\ predictions}{Total\ number\ of\ predictions} = \frac{f_{11} + f_{00}}{f_{11}+f_{10}+f_{01}+f_{00}}$$

$$Error\ rate = \frac{Number\ of\ wrong\ predictions}{Total\ number\ of\ predictions} = \frac{f_{10} + f_{01}}{f_{11}+f_{10}+f_{01}+f_{00}}$$

Most classification algorithms seek models that attain the highest accuracy, or equivalently, the lowest error rate when applied to the test set.

Previous
Next