Multi-resolution location-based training for multi-channel continuous speech separation

doi:10.48550/arXiv.2301.06458

Multi-resolution location-based training for multi-channel continuous speech separation

The performance of automatic speech recognition (ASR) systems severely degrades when multi-talker speech overlap occurs. In meeting environments, speech separation is typically performed to improve the robustness of ASR systems. Recently, location-based training (LBT) was proposed as a new training criterion for multi-channel talker-independent speaker separation. Assuming fixed array geometry, LBT outperforms widely-used permutation-invariant training in fully overlapped utterances and matched reverberant conditions. This paper extends LBT to conversational multi-channel speaker separation. We introduce multi-resolution LBT to estimate the complex spectrograms from low to high time and frequency resolutions. With multi-resolution LBT, convolutional kernels are assigned consistently based on speaker locations in physical space. Evaluation results show that multi-resolution LBT consistently outperforms other competitive methods on the recorded LibriCSS corpus.

Publication:

arXiv e-prints

Pub Date:

January 2023

DOI:

10.48550/arXiv.2301.06458

arXiv:

arXiv:2301.06458

Bibcode:

2023arXiv230106458T

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

E-Print:

Submitted to ICASSP 23

ADS

Multi-resolution location-based training for multi-channel continuous speech separation

Abstract