# K Means Algorithm

The K-means clustering technique is simple, We first choose $K$ initial centroids, where K is a user specified parameter, namely, the number of clusters desired. Each point is then assigned to to the closest centroid, and each collection of points assigned to a centroid is a cluster. The centroid of each cluster is then updated based on the points assigned to the cluster. We repeat the assignment and update steps until no points changes cluster, or equivalently, until the centroids remain the same.

## Algorithm

## Illustration

The K-means algorithm is illustrated in the following figure.

In the first step as shown in the figure, points are assigned to the initial centroids, which are all in the larger group of points. After the points are assigned to a centroid, the centroid is then updated. Again, the figure for each step shows the centroid at the begining of the step and the assignment of points to those centroids. In the second step, points are assigned to the updated centroids, and the centroids are updated again. In steps 2,3 and 4, which are shown in the figure, two of the centroids move to the small groups of points at the bottom of the figures.

For some combinations of proximity functions and types of centroids, K-means always converges to a solution.

## Assigning points to the closest centroid

To assign a point to the closest centroid, we need a proximity measure that quantifies the notion of “closest” for the specific data under consideration. Euclidean ($L_2$) distance is often used for data points in Euclidean space, while cosine similarity is more appropriate for documents. However, there may be several types of proximity measures that are appropriate for a given type of data. For example, Manhattan ($L_1$) distance can be used for Euclidean data, while the Jaccard measure is often employed for documents.

Usually, the similarity measures used for K-means are relatively simple since the algorithm repeatedly calculates the similarity of each point to each centroid. In some cases, however, such as when the data is in low-dimensional Euclidean space, it is possible to avoid computing many of the similarities, thus significantly speeding up the K-means algorithm.

## Centroids and Objective functions

Step 4 of the K-means algorithm was stated rather generally as “recompute the centroid of each cluster,” since the centroid can vary, depending on the proximity measure for the data and the goal of the clustering. The goal of the clustering is typically expressed by an objective function that depends on the proximities of the points to one another or to the cluster centroids; e.g., minimize the squared distance of each point to its closest centroid.

## Data in Euclidean Space

Consider data whose proximity measure is Euclidean distance. For our objective function, which measures the quality of a clustering, we use the sum of the squared error (**SSE**), which is also known as scatter. In other words, we calculate the error of each data point, i.e., its Euclidean distance to the closest centroid, and then compute the total sum of the squared errors. Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one with the smallest squared error since this means that the prototypes (centroids) of this clustering are a better representation of the points in their cluster. SSE can be formally defined as follows

$$SSE = \sum_{i=1}^{k} \sum_{x \in C_{i}} dist(c_{i}, x)^2$$

where $c_i$ is the $i^{th}$ cluster, $x$ is a point, and dist is the standard eucledian distance ($L_2$) distance between two objects in euclidean space. Centroid that minimizes the SSE of the cluster is the mean. The centroid of the $i^{th}$ cluster is defined as

$$c_i = \frac{1}{m_i} \sum_{x \in C_i} x$$

where $c_i$ is the $i^{th}$ cluster, $x$ is a point and $m_i$ is the number of objects in $i^{th}$ cluster.

## Document data

In document data we consider the cosine similarity measure. Here we assume that the document data is represented as a document-term matrix. Our objective is to maximize the similarity of the documents in a cluster to the cluster centroid; this quantity is known as the cohesion of the cluster. For this objective it can be shown that the cluster centroid is, as for Euclidean data, the mean. The analogous quantity to the total SSE is the total cohesion, which is given by Equation

$$Total\ Cohesion = \sum_{i=1}^{k} \sum_{x \in C_i} cosine(x, c_i)$$

where $C_i$ is the $i^{th}$ cluster, x is a point and $c_i$ is the centroid of the $i^{th}$ cluster.

## Choosing initial centroids

When random initialization of centroids is used, different runs of K-means typically produce different total SSEs. Choosing the proper initial centroids is the key step of the basic K-means procedure. A common approach is to choose the initial centroids randomly, but the resulting clusters are often poor.

Because of the problems with using randomly selected initial centroids, which even repeated runs may not overcome, other techniques are often employed for initialization. One effective approach is to take a sample of points and cluster them using a hierarchical clustering technique. K clusters are extracted from the hierarchical clustering, and the centroids of those clusters are used as the initial centroids. This approach often works well, but is practical only if (1) the sample is relatively small, e.g., a few hundred to a few thousand (hierarchical clustering is expensive), and (2) K is relatively small compared to the sample size.

The following procedure is another approach to selecting initial centroids. Select the first point at random or take the centroid of all points. Then, for each successive initial centroid, select the point that is farthest from any of the initial centroids already selected. In this way, we obtain a set of initial centroids that is guaranteed to be not only randomly selected but also well separated. Unfortunately, such an approach can select outliers, rather than points in dense regions (clusters). Also, it is expensive to compute the farthest point from the current set of initial centroids. To overcome these problems, this approach is often applied to a sample of the points. Since outliers are rare, they tend not to show up in a random sample. In contrast, points from every dense region are likely to be included unless the sample size is very small. Also, the computation involved in finding the initial centroids is greatly reduced because the sample size is typically much smaller than the number of points.

## Space and Time complexity

The space requirements for K-means are modest because only the data points and centroids are stored. $$Storage = O \left( (M + K) \times n \right)$$ where m is the number of points and n is the number of attributes. The time requirements for K-means are also modest—basically linear in the number of data points. $$Time = O\left( I \times K \times m \times n \right)$$ where I is the number of iterations required for convergence.

## Additional Issues

**Handling Empty Clusters**

One of the problems with the basic K-means algorithm given earlier is that empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, then a strategy is needed to choose a replacement centroid, since otherwise, the squared error will be larger than necessary. One approach is to choose the point that is farthest away from any current centroid. If nothing else, this eliminates the point that currently contributes most to the total squared error. Another approach is to choose the replacement centroid from the cluster that has the highest SSE. This will typically split the cluster and reduce the overall SSE of the clustering. If there are several empty clusters, then this process can be repeated several times.

**Outliers**

When the squared error criterion is used, outliers can unduly influence the clusters that are found. In particular, when outliers are present, the resulting cluster centroids (prototypes) may not be as representative as they otherwise would be and thus, the SSE will be higher as well. Because of this, it is often useful to discover outliers and eliminate them beforehand. It is important, however, to appreciate that there are certain clustering applications for which outliers should not be eliminated. When clustering is used for data compression, every point must be clustered, and in some cases, such as financial analysis, apparent outliers, e.g., unusually profitable customers, can be the most interesting points.

## Reducing the SSE with postprocessing

An obvious way to reduce the SSE is to find more clusters, i.e., to use a larger K. However, in many cases, we would like to improve the SSE, but don’t want to increase the number of clusters. This is often possible because K-means typically converges to a local minimum. Various techniques are used to “fix up” the resulting clusters in order to produce a clustering that has lower SSE. The strategy is to focus on individual clusters since the total SSE is simply the sum of the SSE contributed by each cluster. We can change the total SSE by performing various operations on the clusters, such as splitting or merging clusters. One commonly used approach is to use alternate cluster splitting and merging phases. During a splitting phase, clusters are divided, while during a merging phase, clusters are combined. In this way, it is often possible to escape local SSE minima and still produce a clustering solution with the desired number of clusters. The following are some techniques used in the splitting and merging phases.

Two strategies that decrease the total SSE by increasing the number of clusters are the following:

- Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster with the largest standard deviation for one particular attribute.
- Introduce a new cluster centroid: Often the point that is farthest from any cluster center is chosen. We can easily determine this if we keep track of the SSE contributed by each point. Another approach is to choose randomly from all points or from the points with the highest SSE.

Two strategies that decrease the number of clusters, while trying to minimize the increase in total SSE, are the following:

- Disperse a cluster: This is accomplished by removing the centroid that corresponds to the cluster and reassigning the points to other clusters. Ideally, the cluster that is dispersed should be the one that increases the total SSE the least.
- Merge two clusters: The clusters with the closest centroids are typically chosen, although another, perhaps better, approach is to merge the two clusters that result in the smallest increase in total SSE.

## Bisecting K-means

The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm that is based on a simple idea: to obtain K clusters, split the set of all points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced.