Monday, 20 March 2017

K-means clustering

The problem of clustering is a rather general one: If one has $m$ observations or measurements in $n$ dimensional space, how to identify $k$ clusters (classes, groups, types) of measurements and their centroids (representatives)?

The k-means method is extremely simple, rather robust and widely used in it numerous variants. It is essentially very similar (but not identical) to Lloyd's algorithm (aka Voronoi relaxation or interpolation used in computer sciences).


Let's use the following indices: $i$ counts measurements, $i \in [0, m-1]$; $j$ counts dimensions, $j \in [0, n-1]$; $l$ counts clusters, $l \in [0, k-1]$.

Each measurement in $n$-dimensional space is represented by a vector $x_i = \{x_{i, 0}, \dots x_{i, n-1}\}$, where index $i$ is counting different measurements ($i = 0, \dots, m-1$). The algorithm can be summarized as:

1. Choose randomly $k$ measurements as initial cluster centers: $c_0, ..., c_{k-1}$. Obviously, each of the clusters is also $n$-dimensional vector.

2. Compute Euclidean distance $D_{i, l}$ between every measurement $x_i$ and every cluster center $c_l$:
$$D_{i, l} = \sqrt{\sum_{j=0}^{n-1} (x_{i, j} - c_{l, j})^2}.$$
3. Assign every measurement $x_i$ to the cluster represented by the closest cluster center $c_l$.

4. Now compute new cluster centers by simply averaging all the measurements in each cluster.

5. Go back to 2. and keep iterating until none of the measurements changes its cluster in two successive iterations.

This procedure is initiated randomly and the result will be slightly different in every run. The result of clustering (and the actual number of necessary iteration) significantly depends on the initial choice of cluster centers. The easiest way to improve the algorithm is to improve the initial choice, i.e. to alter only the step 1. and then to iterate as before. There are to simple alternatives for the initialization.