Audio-Visual Speech Enhancement with Score-Based Generative Models
Abstract
This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information. In particular, we exploit audio-visual embeddings obtained from a self-super\-vised learning model that has been fine-tuned on lipreading. The layer-wise features of its transformer-based encoder are aggregated, time-aligned, and incorporated into the noise conditional score network. Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality and reduces generative artifacts such as phonetic confusions with respect to the audio-only equivalent. The latter is supported by the word error rate of a downstream automatic speech recognition model, which decreases noticeably, especially at low input signal-to-noise ratios.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2023
- DOI:
- 10.48550/arXiv.2306.01432
- arXiv:
- arXiv:2306.01432
- Bibcode:
- 2023arXiv230601432R
- Keywords:
-
- Electrical Engineering and Systems Science - Audio and Speech Processing;
- Computer Science - Machine Learning
- E-Print:
- Submitted to ITG Conference on Speech Communication