gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

doi:10.48550/arXiv.2308.05199

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

Publication:

arXiv e-prints

Pub Date:

August 2023

DOI:

10.48550/arXiv.2308.05199

arXiv:

arXiv:2308.05199

Bibcode:

2023arXiv230805199H

Keywords:

Computer Science - Distributed;
Parallel;
and Cluster Computing

E-Print:

12 pages, 13 figures, and 2 tables. ICS '24

NASA/ADS

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Abstract