Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
Abstract
Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- arXiv:
- arXiv:2412.16530
- Bibcode:
- 2024arXiv241216530G
- Keywords:
-
- Computer Science - Sound;
- Computer Science - Computation and Language;
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Multimedia;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- Accepted at ICASSP, 4 pages