Compressive Kmeans
Abstract
The LloydMax algorithm is a classical approach to perform Kmeans clustering. Unfortunately, its cost becomes prohibitive as the training dataset grows large. We propose a compressive version of Kmeans (CKM), that estimates cluster centers from a sketch, i.e. from a drastically compressed representation of the training dataset. We demonstrate empirically that CKM performs similarly to LloydMax, for a sketch size proportional to the number of centroids times the ambient dimension, and independent of the size of the original dataset. Given the sketch, the computational complexity of CKM is also independent of the size of the dataset. Unlike LloydMax which requires several replicates, we further demonstrate that CKM is almost insensitive to initialization. For a large dataset of 10^7 data points, we show that CKM can run two orders of magnitude faster than five replicates of LloydMax, with similar clustering performance on artificial data. Finally, CKM achieves lower classification errors on handwritten digits classification.
 Publication:

arXiv eprints
 Pub Date:
 October 2016
 arXiv:
 arXiv:1610.08738
 Bibcode:
 2016arXiv161008738K
 Keywords:

 Computer Science  Machine Learning;
 Statistics  Machine Learning