LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

doi:10.48550/arXiv.2411.04997

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

CLIP is a foundational multimodal model that aligns image and text features into a shared space using contrastive learning on large-scale image-text pairs. Its strength lies in leveraging natural language as a rich supervisory signal. With the rapid progress of large language models (LLMs), we explore their potential to further enhance CLIP's multimodal representation learning. This work introduces a fine-tuning approach that integrates LLMs with the pretrained CLIP visual encoder, leveraging LLMs' advanced text understanding and open-world knowledge to improve CLIP's ability to process long and complex captions. To address the challenge of LLMs' autoregressive nature, we propose a caption-to-caption contrastive learning framework to enhance the discriminative power of their outputs. Our method achieves substantial performance gains on various downstream tasks, demonstrating the effectiveness of combining LLMs with CLIP for enhanced multimodal learning.

Publication:

arXiv e-prints

Pub Date:

November 2024

DOI:

10.48550/arXiv.2411.04997

arXiv:

arXiv:2411.04997

Bibcode:

2024arXiv241104997H

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Computation and Language

NASA/ADS

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

Abstract