Integrating State of the Art Compute, Communication, and Autotuning Strategies to Multiply the Performance of the Application Programm CPMD for Ab Initio Molecular Dynamics Simulations
We present our recent code modernizations of the of the ab initio molecular dynamics program CPMD (www.cpmd.org) with a special focus on the ultra-soft pseudopotential (USPP) code path. Following the internal instrumentation of CPMD, all time critical routines have been revised to maximize the computational throughput and to minimize the communication overhead for optimal performance. Throughout the program missing hybrid MPI+OpenMP parallelization has been added to optimize scaling. For communication intensive routines, as the multiple distributed 3d FFTs of the electronic states and distributed matrix-matrix multiplications related to the $\beta$-projectors of the pseudopotentials, this MPI+OpenMP parallelization now overlaps computation and communication. The necessary partitioning of the workload is optimized by an auto-tuning algorithm. In addition, the largest global MPI_Allreduce operation has been replaced by highly tuned node-local parallelized operations using MPI shared-memory windows to avoid inter-node communication. A batched algorithm for the multiple 3d FFTs improves the throughput of the MPI_Alltoall communication and, thus, the scalability of the implementation, both for USPP and for the frequently used norm-conserving pseudopotential code path. The enhanced performance and scalability is demonstrated on a mid-sized benchmark system of 256 water molecules and further water systems of from 32 up to 2048 molecules.