Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement

doi:10.1109/LSP.2020.3000968

Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement

Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while manifesting speech-phoneme structures, and thus complements its air-conducted counterpart. In this study, we propose a novel multi-modal SE structure in the time domain that leverages bone- and air-conducted signals. In addition, we examine two ensemble-learning-based strategies, early fusion (EF) and late fusion (LF), to integrate the two types of speech signals, and adopt a deep learning-based fully convolutional network to conduct the enhancement. The experiment results on the Mandarin corpus indicate that this newly presented multi-modal (integrating bone- and air-conducted signals) SE structure significantly outperforms the single-source SE counterparts (with a bone- or air-conducted signal only) in various speech evaluation metrics. In addition, the adoption of an LF strategy other than an EF in this novel SE multi-modal structure achieves better results.

Publication:

IEEE Signal Processing Letters

Pub Date:

2020

DOI:

10.1109/LSP.2020.3000968

arXiv:

arXiv:1911.09847

Bibcode:

2020ISPL...27.1035Y

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound;
Electrical Engineering and Systems Science - Signal Processing

E-Print:

multi-modal, bone/air-conducted signals, speech enhancement, fully convolutional network

NASA/ADS

Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement

Abstract