A Computational Theory and Semi-Supervised Algorithm for Clustering
Abstract
A computational theory for clustering and a semi-supervised clustering algorithm is presented. Clustering is defined to be the obtainment of groupings of data such that each group contains no anomalies with respect to a chosen grouping principle and measure; all other examples are considered to be fringe points, isolated anomalies, anomalous clusters or unknown clusters. More precisely, after appropriate modelling under the assumption of uniform random distribution, any example whose expectation of occurrence is <1 with respect to a group is considered an anomaly; otherwise it is assigned a membership of that group. Thus, clustering is conceived as the dual of anomaly detection. The representation of data is taken to be the Euclidean distance of a point to a cluster median. This is due to the robustness properties of the median to outliers, its approximate location of centrality and so that decision boundaries are general purpose. The kernel of the clustering method is Mohammad's anomaly detection algorithm, resulting in a parameter-free, fast, and efficient clustering algorithm. Acknowledging that clustering is an interactive and iterative process, the algorithm relies on a small fraction of known relationships between examples. These relationships serve as seeds to define the user's objectives and guide the clustering process. The algorithm then expands the clusters accordingly, leaving the remaining examples for exploration and subsequent iterations. Results are presented on synthetic and realworld data sets, demonstrating the advantages over the most widely used clustering methods.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2023
- DOI:
- arXiv:
- arXiv:2306.06974
- Bibcode:
- 2023arXiv230606974M
- Keywords:
-
- Computer Science - Machine Learning