Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

doi:10.48550/arXiv.2007.13055

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Gu, Zijing

We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.

Publication:

arXiv e-prints

Pub Date:

July 2020

DOI:

10.48550/arXiv.2007.13055

arXiv:

arXiv:2007.13055

Bibcode:

2020arXiv200713055G

Keywords:

Computer Science - Mathematical Software;
Computer Science - Distributed;
Parallel;
and Cluster Computing;
Computer Science - Machine Learning;
Mathematics - Numerical Analysis

NASA/ADS

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Abstract