RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

doi:10.48550/arXiv.2311.18141

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.

Publication:

arXiv e-prints

Pub Date:

November 2023

DOI:

10.48550/arXiv.2311.18141

arXiv:

arXiv:2311.18141

Bibcode:

2023arXiv231118141B

Keywords:

Computer Science - Distributed;
Parallel;
and Cluster Computing

E-Print:

To appear in ACM International Conference on Supercomputing (ICS) 2024

NASA/ADS

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Abstract