End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

doi:10.48550/arXiv.2411.05755

End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/

Publication:

arXiv e-prints

Pub Date:

November 2024

DOI:

10.48550/arXiv.2411.05755

arXiv:

arXiv:2411.05755

Bibcode:

2024arXiv241105755G

Keywords:

Computer Science - Robotics;
Computer Science - Computation and Language;
Computer Science - Computer Vision and Pattern Recognition

ADS

End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

Abstract