Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

doi:10.48550/arXiv.2409.18654

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.

Publication:

arXiv e-prints

Pub Date:

September 2024

DOI:

10.48550/arXiv.2409.18654

arXiv:

arXiv:2409.18654

Bibcode:

2024arXiv240918654G

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

E-Print:

8 pages

ADS

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

Abstract