Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM
Abstract
We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2020
- DOI:
- 10.48550/arXiv.2007.13055
- arXiv:
- arXiv:2007.13055
- Bibcode:
- 2020arXiv200713055G
- Keywords:
-
- Computer Science - Mathematical Software;
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing;
- Computer Science - Machine Learning;
- Mathematics - Numerical Analysis