PhantomGRAPE: Numerical software library to accelerate collisionless Nbody simulation with SIMD instruction set on x86 architecture
Abstract
We have developed a numerical software library for collisionless Nbody simulations named "PhantomGRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius r_{cut} (i.e. f(r)=0 at r>r_{cut}), can be quickly computed. In computing such central forces with an arbitrary force shape f(r), we refer to a precalculated lookup table. We also present a new scheme to create the lookup table whose binning is optimal to keep good accuracy in computing forces and whose size is small enough to avoid cache misses. Using an Intel Core i72600 processor, we measure the performance of our library for both of the Newton's forces and the arbitrarily shaped central forces. In the case of Newton's forces, we achieve 2×10^{9} interactions per second with one processor core (or 75 GFLOPS if we count 38 operations per interaction), which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With four processor cores, we obtain the performance of 8×10^{9} interactions per second (or 300 GFLOPS). In the case of the arbitrarily shaped central forces, we can calculate 1×10^{9} and 4×10^{9} interactions per second with one and four processor cores, respectively. The performance with one processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend only weakly on the number of particles, irrespective of the force shape. It is good contrast with the fact that the performance of force calculations accelerated by graphics processing units (GPUs) depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless Nbody simulations, since these simulations are usually performed with sophisticated Nbody solvers such as Tree and TreePMmethods combined with an individual timestep scheme. We conclude that collisionless Nbody simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments.
 Publication:

New Astronomy
 Pub Date:
 February 2013
 DOI:
 10.1016/j.newast.2012.08.009
 arXiv:
 arXiv:1203.4037
 Bibcode:
 2013NewA...19...74T
 Keywords:

 Astrophysics  Instrumentation and Methods for Astrophysics;
 Physics  Computational Physics
 EPrint:
 19 pages, 11 figures, 4tables, accepted for publication in New Astronomy