Monday, February 06, 2006

Truly Unsupervised Clustering Algorithm

The first problem is the clustering algorithm itself. The most widely used algorithms need to be told how many clusters they are searching for and this means the algorithm is not entirely unsupervised.

Refinements

The classic algorithm is the k-means and many are variants of this simple procedure. Some of these are simply refinements which effectively lower the margin of error on the results. Some refinements have the side effect of producing correct clusters in a robust manner, even if the initial number of clusters sought is set to the wrong value. I confirmed this positive behaviour in the Fayyad set of refinements to the basic k-means algorithm.

Beneficial Noise

Another surprising result was that the accuracy of clustering results was significantly improved if the, possibly multi-dimensional, data was extended by another dimension with random values. The reasoning behind this effect is that the random noise actually smooths the objective functions, down which the clustering algorithm descends.

I'll investigate this affect again to confirm its validity. I've not seen others refer to it, despite a quick search at the time.

Truly Unsupervised Clustering Algorithms

However the academic publications have started mentioning other methods which don't require the number of clusters to be preset, and i'll have a look at these.