Evaluating the performance of a classifier

It is often useful to measure the performance if the model on the test set because such a measure provides an unbiased estimate of it’s generalization error. The accuracy or error rate computed from the the test set can also be used to compare the relative performance of different classifiers on the same domain. However to do this, the class labels of the test records must be known. We will now study some of the common methods used to evaluate the performance of a classifier.

Holdout Method

In the holdout method, the original data with labeled example is partitioned into two disjoin sets, called the training set and test set, respectively. A classification model is then induced from the training set and its performance is evaluated on the test set. The proportion of data reserved for training and for testing is typically as the discretion of the analyst (for example $\frac{2}{3}$ for training and $\frac{1}{3}$ for testing). The accuracy of the classifier can be estimated on the accuracy of the induced model on the test set.

The holdout method has several well-known limitation.

  1. Fewer labeled examples are available for training because some of the records are with-held for testing.
  2. The model may be highly dependent on the the composition of the training set and test set. The smaller the training set the higher the variance of the model. On the other hand if the training set is too large, then the estimated accuracy computed from the smaller set is less reliable.
  3. The training and test sets are no longer independent of each other, as the training and test sets are subsets of the original data, a class that is overrepresented in one subset will be under-represented in the other, and vice versa.

Random subsampling

The holdout method can be repeated several times to improve the estimation of a classifier’s performance. This approach is known as random subsampling. Let $acc_{i}$, be the model accuracy during the $i^{th}$ iteration. The overall accuracy is given by $acc_{sub} = \sum_{i=1}^{k} \frac{acc_{i}}{k}$.

Random subsampling still encounters some of the problems associated with holdout method such as.

  1. It doesn’t utilize as much data as possible for training
  2. It doesn’t have control over the number of times each record is used for testing and training, Consequently, some records might be used for training more often than others.


An alternative to random subsampling is cross-validation. In this approach, each record is used the same number of times for training and exactly once for testing. For example, suppose we choose one of the subsets for training and the other for testing. We then swap the roles of the subsets so that the previous training subset becomes testing subset and vice versa. This approach is called a two fold cross-validation. The total error is obtained by summing up the errors for both runs. The k-fold cross-validation method generalizes this approach by segmenting the data into k equal sized partitions. During each run, one of the partition is chosen for testing, while the rest of them are used for training. This procedure is repeated k times so that each partition is used for testing exactly once.

A special case of the k-fold cross-validation method sets k = N, the size of the data set. In this so-called leave-one-out approach, each test set contains only one record. This approach has an advantage of utilizing as much data as possible for training. In addition, the test sets are mutually exclusive and they effectively cover the entire data set. The drawback of this approach is that it is computationally expensive to repeat the procedure N times. Furthermore, since each test set contains only one record, the variance of the estimated performance metric tends to be high.


The methods presented so far assume that the training records are sampled without replacement. As a result, there are no duplicate records in the training and test sets. In the bootstrap approach, the training records are sampled with replacement; i.e., a record already chosen for training is put back into the original pool of records so that it is equally likely to be redrawn. If the original data has $N$ records, it can be shown that, on average, a bootstrap sample size of N contains about 63.2% of the records in the original data. This approximation follows from the fact that the probability record is chosen by a bootsrap sample $1 - (1 - \frac{1}{N})^N)$. When N is sufficiently large, the probability asymptotically approaches $1 - e^{-1} = 0.632$. Records that are not included in the bootstrap sample becomes part of the test set. The model induced from the training set is then applied to the test set to obtain an estimate of the accuracy of the bootstrap sample, $\varepsilon_{i}$

There are several variations to the bootsrap sampling approach in terms of how the overall accuracy of the classifier is computed. One of the more widely used approach is the .632 bootstrap, which computes the overall accuracy by combining the accuracies of each bootstrap sample $\varepsilon_{i}$ with the accuracy computed from a training set that contains all the labeled examples in the original data ($acc_{s}$): $$Accuracy,\ acc_{boot} = \frac{1}{b} \sum_{i=1}^{b} (0.632 * \varepsilon_{i} + 0.368 * acc_{s})$$