Monday, 20 March 2017

K-means clustering

The problem of clustering is a rather general one: If one has $m$ observations or measurements in $n$ dimensional space, how to identify $k$ clusters (classes, groups, types) of measurements and their centroids (representatives)?

The k-means method is extremely simple, rather robust and widely used in it numerous variants. It is essentially very similar (but not identical) to Lloyd's algorithm (aka Voronoi relaxation or interpolation used in computer sciences).


Let's use the following indices: $i$ counts measurements, $i \in [0, m-1]$; $j$ counts dimensions, $j \in [0, n-1]$; $l$ counts clusters, $l \in [0, k-1]$.

Each measurement in $n$-dimensional space is represented by a vector $x_i = \{x_{i, 0}, \dots x_{i, n-1}\}$, where index $i$ is counting different measurements ($i = 0, \dots, m-1$). The algorithm can be summarized as:

1. Choose randomly $k$ measurements as initial cluster centers: $c_0, ..., c_{k-1}$. Obviously, each of the clusters is also $n$-dimensional vector.

2. Compute Euclidean distance $D_{i, l}$ between every measurement $x_i$ and every cluster center $c_l$:
$$D_{i, l} = \sqrt{\sum_{j=0}^{n-1} (x_{i, j} - c_{l, j})^2}.$$
3. Assign every measurement $x_i$ to the cluster represented by the closest cluster center $c_l$.

4. Now compute new cluster centers by simply averaging all the measurements in each cluster.

5. Go back to 2. and keep iterating until none of the measurements changes its cluster in two successive iterations.

This procedure is initiated randomly and the result will be slightly different in every run. The result of clustering (and the actual number of necessary iteration) significantly depends on the initial choice of cluster centers. The easiest way to improve the algorithm is to improve the initial choice, i.e. to alter only the step 1. and then to iterate as before. There are to simple alternatives for the initialization.

Furthest point

In the furthest cluster variant, only the very fist cluster center $c_0$ is obtained randomly. Then the distances of all other measurements from $c_0$ are computed and the most distant measurement is selected as $c_1$. The next cluster is selected as the most distant from both $c_0$ and $c_1$, and so on until all initial cluster centers are set. An issue with this method of initialization is that it is very sensitive to outliers especially those on the edges of the measurement space.

Arthur and Vassilvitskii (2006) proposed an alternative method for the initial randomization. The first cluster center $c_0$ is purely random. Then the distances of all other measurements from $c_0$ are computed as before. Now this distances are used to define probability distribution,
$$p = \frac{D_{i, 0}^2}{\sum_{i=0}^{m-1} D_{i, 0}^2}$$,
and the next cluster is chosen again randomly, but now we used the probability distribution $p$ (instead of uniform). The idea is again to draw new clusters so that they are far from the previously chosen. The difference is that now the new clusters will not necessarily be at the very edges of the measurement space, so it is less likely that isolated outliers would be picked.

Example of initialization

I constructed a simple test data with 4 clearly separated sets with $m=2$ and then run the three variants for the k-means realization assuming $k=4$. The result is shown in the figure below. The first selected cluster-center is yellow.

In this trivial case, the 4 clusters are correctly identified already in the first iteration for the furthest-point and k-means++ initializations. In the random initialization, it takes 4 iteration steps. The following figure shows how does the solution propagates step-by-step.


Rousseew (1987) proposed the concept of silhouette as an estimate how consistent are the clusters and how appropriate is clustering for a given dataset. He introduced a measure of dissimilarity as the mean distance between a measurement and all measurements in one cluster. Let's label with $a_j$ the dissimilarity between a measurement and the cluster to which this measurement is assigned, and with $b_j$ the mean dissimilarity between the same measurement and all other clusters. Then the silhouette is defined for every measurement as: $$s_j =\frac{b_j - a_j}{\mathrm{max}(a_j, b_j)}.$$ The value of silhouette averaged over the entire dataset tells us how well the data have been clustered. If the value is closed to 1, the clustering is well done. If it is close to 0, the clusters found by the method are not appropriate for the given data.

IDL implementation

IDL has its own implementation of k-means (CLUSTER, before it was called KMEANS). To be able to experiment and play with the method, I wrote my KMEANS function that includes the three initializations mentioned above. The result of the function is structure with three tags: clusters, frequencies (percentage of the measurements assigned to each cluster) and silhouette (for every measurement).

IDL> n = 100   ; Number of pointsIDL> m = 2     ; Number of dimensions, ie. length of vectors
IDL> k = 4     ; Number of clusters

IDL> x = REFORM(RANDOMU(10, m*n)/4., [m, n])
IDL> x[0, 0:24] += 0.2
IDL> x[1, 0:24] += 0.2
IDL> x[0, 25:49] += 0.8
IDL> x[1, 25:49] += 0.8
IDL> x[0, 50:74] += 0.2
IDL> x[1, 50:74] += 0.8
IDL> x[0, 75:99] += 0.8
IDL> x[1, 75:99] += 0.2
IDL> y = KMEANS(x, k = k, initial = 'kmeanspp') 

; Show the initial data points
IDL> PLOT, x[0, *], x[1, *], psym = 4

; Show data points which are assigned to the cluster 2
IDL> id = WHERE(y.clusters EQ 2)
IDL> OPLOT, x[0, id], x[1, id], psym = -1


Rousseeuw (1987), Silhouettes:  a  graphical  aid  to  the  interpretation  and  validation  of  cluster  analysis 
Su & Dy (2004), A Deterministic Method for Initializing K-means Clustering
Arthur & Vassilvitskii (2007), k-means++ : The Advantages of Careful Seeding
Bahmani et al (2013), Scalable K-Means++

Application in solar physics:
Viticchié & Sánchez Almeida (2011), Asymmetries of the Stokes V profiles observed by HINODE SOT/SP in the quiet Sun

1 comment:

  1. Great post dear. It definitely has increased my knowledge on R Programming. Please keep sharing similar write ups of yours. You can check this too for R Programming tutorial as i have recorded this recently on R Programming. and i'm sure it will be helpful to you.