Data Driven Method for Selection of Optimum Number of Clusters in Earth Science Data using Centroid based Self-Organising Maps Clustering
Abstract
Geospatial remotely sensed data are becoming prevalent since the introduction of space-based satellite programs and airborne geophysical surveys. However, the link between these multi-variate data and physical parameters is often complex and difficult to model using traditional approaches.
Unsupervised clustering methods take unlabelled vector data and attempts to gather them into coherent groups within the dataspace, where a single central vector is representative of all data in that group. This is a powerful tool in exploratory data analysis on a new dataset or when examining older datasets for new insights. Clustering usually requires expert user knowledge. Centroid clustering is dependent on the initial cluster vectors. Hierarchical clustering requires a high level of user input to determine the level in the "hierarchical tree" where the groups are realistic. Density clustering is based on the theory that the point density per unit volume in data space is larger in a cluster than outside it, but often cannot distinguish between clusters with differing densities. Spectral clustering relies on the creation of non-trivial "similarity graphs". All clustering methods suffering from the problem of determining the correct number of clusters that best represent a dataset. Self-organising maps (SOM) is a centroid-based clustering method. Traditionally a relatively simple distance metric is used to divide the dataspace and place cluster centre vectors. SOM uses competitive learning within a neural network to find similarities between the input vectors and assign cluster vectors. Several methods exist which attempt to determine the appropriate number of clusters for a dataset, however, these require expert user knowledge. A simple method is proposed here based on the stability of cluster vector values, over multiple clustering attempts, for varying number of clusters. High stability means a reliable and repeatable clustering result. The method is demonstrated on an example dataset, the rice grain model, where the number of expected clusters is well understood. The method is then applied to real-world airborne remote sensing data (radiometric) data acquired over a degraded peatland site in Ireland, highlighting areas of similar peat physical properties.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2022
- Bibcode:
- 2022AGUFMIN56A..03O