On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering

doi:10.48550/arXiv.1609.06533

On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering

Hybrid clustering combines partitional and hierarchical clustering for computational effectiveness and versatility in cluster shape. In such clustering, a dissimilarity measure plays a crucial role in the hierarchical merging. The dissimilarity measure has great impact on the final clustering, and data-independent properties are needed to choose the right dissimilarity measure for the problem at hand. Properties for distance-based dissimilarity measures have been studied for decades, but properties for density-based dissimilarity measures have so far received little attention. Here, we propose six data-independent properties to evaluate density-based dissimilarity measures associated with hybrid clustering, regarding equality, orthogonality, symmetry, outlier and noise observations, and light-tailed models for heavy-tailed clusters. The significance of the properties is investigated, and we study some well-known dissimilarity measures based on Shannon entropy, misclassification rate, Bhattacharyya distance and Kullback-Leibler divergence with respect to the proposed properties. As none of them satisfy all the proposed properties, we introduce a new dissimilarity measure based on the Kullback-Leibler information and show that it satisfies all proposed properties. The effect of the proposed properties is also illustrated on several real and simulated data sets.

Publication:

arXiv e-prints

Pub Date:

September 2016

DOI:

10.48550/arXiv.1609.06533

arXiv:

arXiv:1609.06533

Bibcode:

2016arXiv160906533M

Keywords:

Statistics - Machine Learning;
Computer Science - Machine Learning

E-Print:

M{\o}llersen, K., Dhar, S.S. and Godtliebsen, F. (2016) On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering. Applied Mathematics, 7, 1674-1706

NASA/ADS

On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering

Abstract