ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning
Abstract
Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit the beneficial characteristics. The experiment shows that ConSep promotes performance in anechoic, noisy, and reverberant settings compared to two celebrated methods, SepFormer and Bi-Sep. Furthermore, we visualize the components of ConSep to strengthen the advantages and cohere with the actualities we have found in preliminary studies.
- Publication:
-
arXiv e-prints
- Pub Date:
- March 2024
- DOI:
- 10.48550/arXiv.2403.01792
- arXiv:
- arXiv:2403.01792
- Bibcode:
- 2024arXiv240301792H
- Keywords:
-
- Computer Science - Sound;
- Electrical Engineering and Systems Science - Audio and Speech Processing