Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

doi:10.48550/arXiv.2412.18839

Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over $7.96$ hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at \url{https://diff-nam.github.io/DiffNAM/}

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.18839

arXiv:

arXiv:2412.18839

Bibcode:

2024arXiv241218839S

Keywords:

Computer Science - Sound;
Computer Science - Artificial Intelligence;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Accepted at IEEE ICASSP 2025

ADS

Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

Abstract