Clustering With Side Information: From a Probabilistic Model to a Deterministic Algorithm
Abstract
In this paper, we propose a model-based clustering method (TVClust) that robustly incorporates noisy side information as soft-constraints and aims to seek a consensus between side information and the observed data. Our method is based on a nonparametric Bayesian hierarchical model that combines the probabilistic model for the data instance and the one for the side-information. An efficient Gibbs sampling algorithm is proposed for posterior inference. Using the small-variance asymptotics of our probabilistic model, we then derive a new deterministic clustering algorithm (RDP-means). It can be viewed as an extension of K-means that allows for the inclusion of side information and has the additional property that the number of clusters does not need to be specified a priori. Empirical studies have been carried out to compare our work with many constrained clustering algorithms from the literature on both a variety of data sets and under a variety of conditions such as using noisy side information and erroneous k values. The results of our experiments show strong results for our probabilistic and deterministic approaches under these conditions when compared to other algorithms in the literature.
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2015
- DOI:
- 10.48550/arXiv.1508.06235
- arXiv:
- arXiv:1508.06235
- Bibcode:
- 2015arXiv150806235K
- Keywords:
-
- Statistics - Machine Learning;
- Computer Science - Artificial Intelligence;
- Computer Science - Machine Learning;
- Statistics - Computation