Partial k-means to avoid outliers, mathematical programming formulations, complexity results
Abstract
A well-known bottleneck of Min-Sum-of-Square Clustering (MSSC, the celebrated $k$-means problem) is to tackle the presence of outliers. In this paper, we propose a Partial clustering variant termed PMSSC which considers a fixed number of outliers to remove. We solve PMSSC by Integer Programming formulations and complexity results extending the ones from MSSC are studied. PMSSC is NP-hard in Euclidean space when the dimension or the number of clusters is greater than $2$. Finally, one-dimensional cases are studied: Unweighted PMSSC is polynomial in that case and solved with a dynamic programming algorithm, extending the optimality property of MSSC with interval clustering. This result holds also for unweighted $k$-medoids with outliers. A weaker optimality property holds for weighted PMSSC, but NP-hardness or not remains an open question in dimension one.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2023
- DOI:
- 10.48550/arXiv.2302.05644
- arXiv:
- arXiv:2302.05644
- Bibcode:
- 2023arXiv230205644D
- Keywords:
-
- Computer Science - Computational Complexity;
- Computer Science - Computational Geometry;
- Computer Science - Discrete Mathematics
- E-Print:
- doi:10.1007/978-3-031-34020-8_22