On the optimality of kernels for highdimensional clustering
Abstract
This paper studies the optimality of kernel methods in highdimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the highdimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel kmeans, are optimal in this regime. We consider the problem of highdimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NPhard kernel kmeans objective matches the known informationtheoretic limit up to a factor of $\sqrt{2}$ for large $k$. It also exactly matches the known upper bounds for the nonkernel setting. We also show that a semidefinite relaxation of the kernel kmeans procedure matches up to constant factors, the spectral threshold, below which no polynomialtime algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel kmeans as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in a higherorder analysis of kernel methods.
 Publication:

arXiv eprints
 Pub Date:
 December 2019
 arXiv:
 arXiv:1912.00458
 Bibcode:
 2019arXiv191200458C
 Keywords:

 Statistics  Machine Learning;
 Computer Science  Machine Learning