VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

doi:10.48550/arXiv.2501.01401

VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance.

Publication:

arXiv e-prints

Pub Date:

January 2025

DOI:

10.48550/arXiv.2501.01401

arXiv:

arXiv:2501.01401

Bibcode:

2025arXiv250101401R

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

ADS

VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Abstract