HEADSS: HiErArchical Data Splitting and Stitching for non-distributed clustering algorithms
Abstract
HEADSS (HiErArchical Data Splitting and Stitching) facilitates clustering at scale, unlike clustering algorithms that scale poorly with increased data volume or that are intrinsically non-distributed. HEADSS automates data splitting and stitching, allowing repeatable handling, and removal, of edge effects. Implemented in conjunction with scikit's HDBSCAN, the code achieves orders of magnitude reduction in single node memory requirements for both non-distributed and distributed implementations, with the latter offering similar order of magnitude reductions in total run times while recovering analogous accuracy. HEADSS also establishes a hierarchy of features by using a subset of clustering features to split the data.
- Publication:
-
Astrophysics Source Code Library
- Pub Date:
- January 2023
- Bibcode:
- 2023ascl.soft01004C
- Keywords:
-
- Software