MASV: Speaker Verification with Global and Local Context Mamba

doi:10.48550/arXiv.2412.10989

MASV: Speaker Verification with Global and Local Context Mamba

Deep learning models like Convolutional Neural Networks and transformers have shown impressive capabilities in speech verification, gaining considerable attention in the research community. However, CNN-based approaches struggle with modeling long-sequence audio effectively, resulting in suboptimal verification performance. On the other hand, transformer-based methods are often hindered by high computational demands, limiting their practicality. This paper presents the MASV model, a novel architecture that integrates the Mamba module into the ECAPA-TDNN framework. By introducing the Local Context Bidirectional Mamba and Tri-Mamba block, the model effectively captures both global and local context within audio sequences. Experimental results demonstrate that the MASV model substantially enhances verification performance, surpassing existing models in both accuracy and efficiency.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.10989

arXiv:

arXiv:2412.10989

Bibcode:

2024arXiv241210989L

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

ADS

MASV: Speaker Verification with Global and Local Context Mamba

Abstract