Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
Abstract
Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation. Experiments on the Waymo Open Dataset show that our approach outperforms the prior work by significant margins on various unsupervised 3D perception tasks.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2023
- DOI:
- arXiv:
- arXiv:2309.14491
- Bibcode:
- 2023arXiv230914491N
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- ICCV 2023