Spartus: A 9.4 TOp/s FPGA-based LSTM Accelerator Exploiting Spatio-Temporal Sparsity
Abstract
Long Short-Term Memory (LSTM) recurrent networks are frequently used for tasks involving time-sequential data such as speech recognition. Unlike previous LSTM accelerators that either exploit spatial weight sparsity or temporal activation sparsity, this paper proposes a new accelerator called "Spartus" that exploits spatio-temporal sparsity to achieve ultra-low latency inference. Spatial sparsity is induced using a new Column-Balanced Targeted Dropout (CBTD) structured pruning method, producing structured sparse weight matrices for a balanced workload. The pruned networks running on Spartus hardware achieve weight sparsity levels of up to 96% and 94% with negligible accuracy loss on the TIMIT and the Librispeech datasets. To induce temporal sparsity in LSTM, we extend the previous DeltaGRU method to the DeltaLSTM method. Combining spatio-temporal sparsity with CBTD and DeltaLSTM saves on weight memory access and associated arithmetic operations. The Spartus architecture is scalable and supports real-time online speech recognition when implemented on small and large FPGAs. Spartus per-sample latency for a single DeltaLSTM layer of 1024 neurons averages 1 us. Exploiting spatio-temporal sparsity on our test LSTM network using the TIMIT dataset leads to 46X speedup of Spartus over its theoretical hardware performance to achieve 9.4 TOp/s effective batch-1 throughput and 1.1 TOp/s/W power efficiency.
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2021
- DOI:
- 10.48550/arXiv.2108.02297
- arXiv:
- arXiv:2108.02297
- Bibcode:
- 2021arXiv210802297G
- Keywords:
-
- Computer Science - Hardware Architecture;
- Computer Science - Artificial Intelligence;
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Machine Learning
- E-Print:
- Accepted for publication in IEEE Transactions on Neural Networks and Learning Systems, 2022