ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

doi:10.48550/arXiv.2305.12121

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we build a Multi-Layer Aggregation (MLA) block using ACA to generate fixed-sized identity vectors from variable-length inputs. Through global attention, ACA-Net acts as an efficient global feature extractor that adapts to temporal variability unlike existing SV models that apply a fixed function for pooling over the temporal dimension which may obscure information about the signal's non-stationary temporal variability. Our experiments on the WSJ0-1talker show ACA-Net outperforms a strong baseline by 5\% relative improvement in EER using only 1/5 of the parameters.

Publication:

arXiv e-prints

Pub Date:

May 2023

DOI:

10.48550/arXiv.2305.12121

arXiv:

arXiv:2305.12121

Bibcode:

2023arXiv230512121Y

Keywords:

Computer Science - Sound;
Computer Science - Machine Learning;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Accepted to INTERSPEECH 2023

NASA/ADS

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

Abstract