Efficient Sparse Spherical kMeans for Document Clustering
Abstract
Spherical kMeans is frequently used to cluster document collections because it performs reasonably well in many settings and is computationally efficient. However, the time complexity increases linearly with the number of clusters k, which limits the suitability of the algorithm for larger values of k depending on the size of the collection. Optimizations targeted at the Euclidean kMeans algorithm largely do not apply because the cosine distance is not a metric. We therefore propose an efficient indexing structure to improve the scalability of Spherical kMeans with respect to k. Our approach exploits the sparsity of the input vectors and the convergence behavior of kMeans to reduce the number of comparisons on each iteration significantly.
 Publication:

arXiv eprints
 Pub Date:
 July 2021
 arXiv:
 arXiv:2108.00895
 Bibcode:
 2021arXiv210800895K
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Artificial Intelligence;
 Computer Science  Data Structures and Algorithms
 EPrint:
 ACM DocEng 2021