Achieving Optimal VPIC Performance on Several Modern CPU Architectures
Abstract
Two significant modifications to the VPIC particle advance implementation are being explored. The first is the use of an Array of Structs of Arrays (AoSoA) data structure for the particles which eliminates the need to transpose vector loads of particle data after loading into vector registers and before storing back to memory. The second is the use of a particle sort performed for every timestep which allows particles to be processed as a double loop over cells and the particles in each cell. This second modification allows several optimizations including hoisting the load of interpolation data and the store of current density accumulation data out of the per-cell particle loop. These modifications eliminate a performance bottleneck associated with shuffle and permute operations in data transpose operations and increase the efficiency of VPIC's use of available memory bandwidth. Initial performance results for some of the VPIC particle kernels is greater than 2x. Results for the complete implementation will be presented on several modern architectures including Intel Knights Landing and IBM Power 9.
This work was supported by the US Department of Energy through the Los Alamos National Laboratory. Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001).- Publication:
-
APS Division of Plasma Physics Meeting Abstracts
- Pub Date:
- 2019
- Bibcode:
- 2019APS..DPPNP10032