DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models
Abstract
We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2024
- DOI:
- 10.48550/arXiv.2402.13181
- arXiv:
- arXiv:2402.13181
- Bibcode:
- 2024arXiv240213181D
- Keywords:
-
- Computer Science - Robotics;
- Computer Science - Machine Learning
- E-Print:
- To appear at 2024 IEEE International Conference on Robotics and Automation (ICRA)