# KNN classifier

Decision tree and rule-based classifiers are examples of **eager learners** because they are designed to learn a model that maps the input attributes to the class label as soon as the training data becomes available. An opposite startegy is to delay the process of modeling the training data until it is needed to classify the test examples. Techniques that empoy this strategy are known as **lazy learners**. An example of a lazy learner is the **Rote classifier**, which memorizes the entire data and performs classification only if the attributes of a test instance match one of the training example exactly. An obvious drawback of this approach is that some test records may not be classified because they do not match any training example.

## Nearest neighbour classifier.

One way to make the rote classifier approach more flexible is to find all the training examples that are relatively similiar to the attributes of the test example. These examples, which are known as **nearest neighbours**, can be used to determine the class label of the test example. The justification for using nearest neighbours is best exemplified by the following saying: “If it walks like a duck; quacks like a duck, and looks like a duck, then it’s probably a duck.” A nearest neighbour classifier represents each example as a data point in a d-dimensional space, where d is the number of attributes. The figure below illustrates thr 1, 2, and 3 nearest neighbours of a data point located at the center of each circle. The data point is classified based on the class labels of it’s neighbours. In the case where the neighbours have more than one label, the data point is assigned to the majority class of it’s neighbours. In the case where these is a tie between the classes we may randomly choose one of them to classify the data point.

The importance of choosing the right value of k can be understood from the previous discussion. If k is too small, then the nearest-neighbour classifier may be susceptible to overfitting because of noise in the training data. On the other hand, if k is too large, the nearest-neighbour may misclarify the test instance because its list of nearest neighbours may include data points that are located far away from the neighbourhood.

## Algorithm

A high level summary of the nearest-neighbour classification method is given in the following algorithm.

The algorithm computes the distance between each test example z = (x’, y’) and all the training examples (x, y) $\in$ D to determine it’s nearest neighbour list, $D_{z}$. Such computation can be costly if the number of training examples in large. However, efficient indexing techniques are available to reduce the ammount of computation needed to find the nearest neighbour of a test example. Once the nearest-neighbour list is obtained, the test example is classified based on the majority class of its nearest neighbours: $$Majority\ Voting: y’ = argmax_{v} \sum_{(x_{i}, y_{i}) \in D_{z}} I(v = y_{i})$$

where $v$ is the class label for one of the nearest neighbors, and $I(.)$ is an indicator function that returns the value 1 if it’s argument is true and 0 otherwise.

In the majority voting approach, every neighbour has the same impact on the classification. This makes the algorithm sensitive to the choice of $k$ as shown in the previous diagram. One way to reduce the impact of $k$ is to weight the influence of each neighbour $x_{i}$ according to it’s distance: $w_{i} = \frac{1}{d(x’, x_{i})^2}$. As a result, training examples that are located far away from $z$ have weaker impact on the classification as compared to those that are located close to $z$. Using the distance-weighted voting scheme, the class label can be determined as follows: $$Distance-Weighted\ Voting: y’ = argmax_{v} \sum_{(x_{i}, y_{i}) \in D} w_{i}*I(v = y_{i})$$

## Characterstics of Nearest-Neighbour Classifier

- Nearest neighbour classification is part of a more general technique known as instance-based learning, which uses specific training instances to make predictions without having to maintain an abstraction (model) derived from data. Instance based learning algorithm require a proximity measure to determine the similiarity or distance between instances and a classification function that returns the predicted class of a test instance based on its proximity to other instances.
- Lazy learners such as nearest-neighbor classifiers do not require model building. However classifying a test example can be quite expensive because we need to compute the proximity values individually between the test and training examples. In contrast, eager learner often spend bulk of their computing resources to model building. Once a model has been built, classifying a test example is quite fast.
- Nearest neighbor classifiers make their predictions based on local information, whereas decision tree and rule based classifiers attempts to find a global model that fits entire input space. Because the classification decisions are made locally, nearest neighbour classifiers (with small values of k) are quite suceptible to noise.
- Nearest neighbor classifiers can produce arbitarily shaped decision boundaries. Such boundaries provide a more flexible model representation compared to decision tree and rule-based classifiers that are often constrained to rectilinear decision boundaries. The decision boundaries of nearest neighbor classifiers also have high availability because they dependent on the composition of training examples. Increasing the number of nearest neighbors may reduce such variability.
- Nearest neighbor classifiers can produce wrong predictions unless appropriate proximity measure and data pre-processing steps are taken. For example, suppose we want to classify as group of people based on attributes such as height and weight. The height variable has a low variability, ranging from 1.5m to 1.85m, where as the weight attribute may vary from 40kg to 150 kg. If the scale of the attributes are not taken into consideration, the proximity measure may be dominated by difference in the weights of a person.