A performance evaluation of CCS QCD Benchmark on the COMA (Intel(R) Xeon Phi$^{TM}$, KNC) system
Abstract
The most computationally demanding part of Lattice QCD simulations is solving quark propagators. Quark propagators are typically obtained with a linear equation solver utilizing HPC machines. The CCS QCD Benchmark is a benchmark program solving the WilsonClover quark propagator, and is developed at the Center for Computational Sciences (CCS), University of Tsukuba. We optimized the benchmark program for a \Intel \XeonPhi (Knights Corner, KNC) system named "COMA (PACSIX)" at CCS Tsukuba under the Intel Parallel Computing Center program. A single precision BiCGStab solver with the overlapped Restricted Additive Schwarz (RAS) preconditioner was implemented using SIMD intrinsics, OpenMP and MPI in the offload mode. With the reverseoffloading technique, we could reduce the communication and offloading overheads. We observed a performance of $\sim 200$ GFlops sustained for the WilsonClover hopping matrix multiplication on the lattice sizes larger than $24^3\times 32$ on a sinlge card of the COMA system. A good weak scaling perofmace was observed on the local lattice sizes larger than $24^3\times 32$.
 Publication:

arXiv eprints
 Pub Date:
 December 2016
 arXiv:
 arXiv:1612.06556
 Bibcode:
 2016arXiv161206556B
 Keywords:

 High Energy Physics  Lattice;
 Physics  Computational Physics
 EPrint:
 7 pages, 6 figures, talk presented at the 34th International Symposium on Lattice Field Theory, 2430 July 2016, University of Southampton, UK