Joint representation learning for text and 3D point cloud
Abstract
Recent advancements in vision-language pre-training (e.g., CLIP) have enabled 2D vision models to benefit from language supervision. However, the joint representation learning of 3D point cloud with text remains under-explored due to challenges in acquiring 3D-Text data pairs. Prior works propose to project point clouds into 2D depth maps and directly use CLIP, while they sacrifice 3D structural information, limiting its applicability. In this paper, we put forward Text4Point, a novel framework to construct language-guided 3D models for dense prediction tasks. Text4Point utilizes 2D images as a bridge to connect the point cloud and language modalities. It follows a pre-training and fine-tuning paradigm. During pre-training, we leverage dense contrastive learning to align the image and point cloud representations using the readily available RGB-D data. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns 3D representations under informative language guidance without 2D images. Extensive experiments demonstrate consistent improvement on various dense prediction tasks with Text4Point.
- Publication:
-
Pattern Recognition
- Pub Date:
- March 2024
- DOI:
- arXiv:
- arXiv:2412.18930
- Bibcode:
- 2024PatRe.14710086H
- Keywords:
-
- Point cloud;
- Multi-modal learning;
- Representation learning;
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- 24 pages, 9 figures, accepted in ACCV2024