Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Abstract
Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. The effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in the previous works. Moreover, existing datasets for speaker adaptation have limited vocabulary size and pose variations, limiting the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in wild, sentence-level lip reading for the first time. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, with the proposed adaptation method, we show that the proposed method achieves larger improvements when applied to the target speaker, compared to the previous works.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2024
- DOI:
- 10.48550/arXiv.2409.00986
- arXiv:
- arXiv:2409.00986
- Bibcode:
- 2024arXiv240900986Y
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Computation and Language;
- Electrical Engineering and Systems Science - Audio and Speech Processing;
- Electrical Engineering and Systems Science - Image and Video Processing
- E-Print:
- Code available: https://github.com/JeongHun0716/Personalized-Lip-Reading