Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

doi:10.48550/arXiv.2411.16789

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we generate detailed textual descriptions of sign language components using MLLMs. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be effectively utilized in SLT.

Publication:

arXiv e-prints

Pub Date:

November 2024

DOI:

10.48550/arXiv.2411.16789

arXiv:

arXiv:2411.16789

Bibcode:

2024arXiv241116789K

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Computation and Language

NASA/ADS

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Abstract