Fast and accurate probability density estimation in large high dimensional astronomical datasets
Abstract
Astronomical surveys will generate measurements of hundreds of attributes (e.g. color, size, shape) on hundreds of millions of sources. Analyzing these large, high dimensional data sets will require efficient algorithms for data analysis. An example of this is probability density estimation that is at the heart of many classification problems such as the separation of stars and quasars based on their colors. Popular density estimation techniques use binning or kernel density estimation. Kernel density estimation has a small memory footprint but often requires large computational resources. Binning has small computational requirements but usually binning is implemented with multi-dimensional arrays which leads to memory requirements which scale exponentially with the number of dimensions. Hence both techniques do not scale well to large data sets in high dimensions. We present an alternative approach of binning implemented with hash tables (BASH tables). This approach uses the sparseness of data in the high dimensional space to ensure that the memory requirements are small. However hashing requires some extra computation so a priori it is not clear if the reduction in memory requirements will lead to increased computational requirements. Through an implementation of BASH tables in C++ we show that the additional computational requirements of hashing are negligible. Hence this approach has small memory and computational requirements. We apply our density estimation technique to photometric selection of quasars using non-parametric Bayesian classification and show that the accuracy of the classification is same as the accuracy of earlier approaches. Since the BASH table approach is one to three orders of magnitude faster than the earlier approaches it may be useful in various other applications of density estimation in astrostatistics.
- Publication:
-
American Astronomical Society Meeting Abstracts #225
- Pub Date:
- January 2015
- Bibcode:
- 2015AAS...22542206G