# Misc topics related to classification

## Class Imbalance Problem

Data sets with imbalanced class distribution are quite common in many real applications. For example, an automated inspection that monitors products that come off a manufacturing assembly line may find that the number of defective products is significantly fewer than that of non-defective products. Similiarly, in credit card fraud detection, fraudulent transactions are outnumbered by legitmate transactions. In bot of these examples, there is a disproportionate number of instances that belong to differenct classes. The degree of imbalance differs from one application to another. However, because the class distribution is imbalanced, this presents a number of problems to existing classification algorithms.

- The accuracy measure, which is used extensively to compare the performance of classifiers, may not well suited for evaluating models derived from imbalanced sets.
- Detecting instances of the rare class is akin to finding a needle in a haystack. Because their instances occue infrequently, models that describe the rare class tend to be highly specialized.

## Alternate metrics

Since the accuracy measure treats every class as equally important, it may not be suitable for analyzing imbalanced data sets, where the rare class is considered more interesting than the majority class. For binary classification the rare class is often denoted as the positive class, while the majority class is denoted as the negative class. A confusion matrix that summarizes the number of instances predicted correctly or incorrectly by a classification model is shown below.

The following terminology is often used when referring to the counts tabulated in a confusion matrix:

- True Positive (TP) or $f_{++}$, which corrosponds to the number of positive examples correctly predicted by the classification model.
- False Negative (FN) or $f_{+-}$, which corrosponds to the number of positive examples wrongly predicted as negative by the classification model.
- False Positive (FP) or $f_{-+}$, which corrosponds to the number of negative examples wrongly predicted as positive by the classification model.
- True Negative (TN) or $f_{–}$, which corrosponds to the number of negative examples correctly predicted by the classification model.

The counts in a confusion matrix can also be expressed in terms of percentages. The **true positive rate** (TPR) or **senstivity** is defined as the fraction of positive examples predicted correctly by the model.
$$TPR = \frac{TP}{TP+FN}$$
Similiarly the **true negative rate** (TNR) or **specifity** is defined as the fraction of negative examples predicted correcly by the model.
$$FPR = \frac{FP}{TN+FP}$$
while **False negative rate** (FNR) is the fraction of positive examples predicted as a negative class
$$FNR = \frac{FN}{TP+FN}$$

**Recall** and **precision** are two widely used metric employed in applications where successful detection of one of the class is considered more significant of the other class. A formal definition of these metrics is:
$$Precision,\ p = \frac{TP}{TP+FP}$$
$$Recall,\ r = \frac{TP}{TP+FN}$$

Precision determines the fraction of records that actually turns out to be positive in the group the classifier has declared as a positive class. The higher the precision is, the lower the number of false positive errors commited by the classifier.

Recall measures the fraction of positibe examples correctly predicted by the classifier. Classifiers with large recall have very few positive examples misclassified as the negative class.

Precision and recall can be summarized into another metric known as the $F_{1}$ measure: $$F_{1} = \frac{2rp}{r+p} = \frac{2 * TP}{2 * TP + FP + FN}$$

In principle, $F_{1}$ represents a harmonic mean between recall and precision, i.e., $$F_{1} = \frac{2}{\frac{1}{r} + \frac{1}{p}}$$ The harmonic mean of two numbers x and y tends to be closer to the smaller of the two numbers. Hence, a high value of $F_{1}$ measure ensures that both precision and recall are reasonably high