Analysing billion-objects catalogue interactively: ApacheSpark for physicists
Abstract
ApacheSpark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is to show with practical uses-cases that the technology is mature enough to be used without excessive programming skills by astronomers or cosmologists in order to perform standard analyses over large datasets, as those originating from future galaxy surveys.
To demonstrate it, we start from a realistic simulation corresponding to 10 years of LSST data taking (6 billions of galaxies). Then, we design, optimize and benchmark a set of Spark python algorithms in order to perform standard operations as adding photometric redshift errors, measuring the selection function or computing power spectra over tomographic bins. Most of the commands execute on the full 110 GB dataset within tens of seconds and can therefore be performed interactively in order to design full-scale cosmological analyses. A jupyter notebook summarizing the analysis is available at https://github.com/astrolabsoftware/1807.03078.- Publication:
-
Astronomy and Computing
- Pub Date:
- July 2019
- DOI:
- 10.1016/j.ascom.2019.100305
- arXiv:
- arXiv:1807.03078
- Bibcode:
- 2019A&C....2800305P
- Keywords:
-
- Large-scale structure of universe;
- Galaxies;
- Statistics;
- Catalogues;
- Distributed programming languages;
- Astrophysics - Instrumentation and Methods for Astrophysics;
- Astrophysics - Cosmology and Nongalactic Astrophysics