Optimization of RTE+RRTMGP-C++: a CUDA implementation of the radiation package RTE+RRTMGP for atmospheric science.
Abstract
The Rapid Radiative Transfer Model for GCM applications (RRTMG) is one of the most widely adopted radiation physics packages throughout the field of atmospheric physics and climate modeling. The successor of RRTMG adopts a more flexible, object-oriented programming interface and balances accuracy and efficiency in the Radiative Transfer for Energetics (RTE) and RRTM for GCM applicationsParallel (RTE+RRTMGP) toolbox. The key algorithms of computing sources and radiative fluxes from given incoming solar flux, gas concentrations, and thermodynamic state within grid columns, present a significant computational burden for applications adopting this scheme. Many of these models therefore have to resort to evaluating radiation on a coarser grid or at a reduced time frequency, resulting in a deteriorating accuracy of the resulting tendencies. Meanwhile, almost all pre-exascale supercomputers will accommodate GPU accelerators to achieve the necessary leaps in efficiency. Where efforts to leverage these compute units for the dynamics of GCMs are abundant, the column physics parameterizations often remain being evaluated on the CPU. The RTE+RRTMGP-C++ package is a C++ interface to the above toolbox that maintains numerical equivalence and feature parity with the original Fortran code. The library contains a CUDA backend that enables the numerical kernels to fully run on NVIDIA cards. Within the ESiWACE-2 project, a collaborative optimization effort of this computational pipeline has been done. We have focused on three main aspects: (i) adopting a memory pool for GPU-resident arrays, (ii) hand-optimizing kernel code to increase memory bandwidth utilization and parallelism and (iii) tuning the compute kernels and their parallel layout in a systematic and automated way. Using these strategies, we have established substantial performance gains for realistic use cases with sufficient grid columns per processor; in such situations we observe over a factor 100 speedup with respect to the single-threaded CPU computation. Hence, this version of RTE+RRTMGP has the potential to become a valuable building block of upcoming extreme-resolution weather models and GPU-resident atmospheric large eddy simulations, designed to perform optimally on modern heterogeneous compute infrastructures.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2021
- Bibcode:
- 2021AGUFM.A55A..03V