P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

doi:10.48550/arXiv.2408.10007

P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

3D pre-training is crucial to 3D perception tasks. However, limited by the difficulties in collecting clean 3D data, 3D pre-training consistently faced data scaling challenges. Inspired by semi-supervised learning leveraging limited labeled data and a large amount of unlabeled data, in this work, we propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D data lifted from images by a large depth estimation model. Another challenge lies in the efficiency. Previous methods such as Point-BERT and Point-MAE, employ k nearest neighbors to embed 3D tokens, requiring quadratic time complexity. To efficiently pre-train on such a large amount of data, we propose a linear-time-complexity token embedding strategy and a training-efficient 2D reconstruction target. Our method achieves state-of-the-art performance in 3D classification and few-shot learning while maintaining high pre-training and downstream fine-tuning efficiency.

Publication:

arXiv e-prints

Pub Date:

August 2024

DOI:

10.48550/arXiv.2408.10007

arXiv:

arXiv:2408.10007

Bibcode:

2024arXiv240810007C

Keywords:

Computer Science - Computer Vision and Pattern Recognition

E-Print:

Under review. Pre-print

NASA/ADS

P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

Abstract