Clustering with Queries under SemiRandom Noise
Abstract
The seminal paper by Mazumdar and Saha \cite{MS17a} introduced an extensive line of work on clustering with noisy queries. Yet, despite significant progress on the problem, the proposed methods depend crucially on knowing the exact probabilities of errors of the underlying fullyrandom oracle. In this work, we develop robust learning methods that tolerate general semirandom noise obtaining qualitatively the same guarantees as the best possible methods in the fullyrandom model. More specifically, given a set of $n$ points with an unknown underlying partition, we are allowed to query pairs of points $u,v$ to check if they are in the same cluster, but with probability $p$, the answer may be adversarially chosen. We show that information theoretically $O\left(\frac{nk \log n} {(12p)^2}\right)$ queries suffice to learn any cluster of sufficiently large size. Our main result is a computationally efficient algorithm that can identify large clusters with $O\left(\frac{nk \log n} {(12p)^2}\right) + \text{poly}\left(\log n, k, \frac{1}{12p} \right)$ queries, matching the guarantees of the best known algorithms in the fullyrandom model. As a corollary of our approach, we develop the first parameterfree algorithm for the fullyrandom model, answering an open question by \cite{MS17a}.
 Publication:

arXiv eprints
 Pub Date:
 June 2022
 arXiv:
 arXiv:2206.04583
 Bibcode:
 2022arXiv220604583D
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Data Structures and Algorithms
 EPrint:
 Accepted for presentation at the Conference on Learning Theory (COLT) 2022