LingoQA: Visual Question Answering for Autonomous Driving
Abstract
We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2023
- DOI:
- 10.48550/arXiv.2312.14115
- arXiv:
- arXiv:2312.14115
- Bibcode:
- 2023arXiv231214115M
- Keywords:
-
- Computer Science - Robotics;
- Computer Science - Artificial Intelligence;
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- Accepted to ECCV 2024. Benchmark and dataset are available at https://github.com/wayveai/LingoQA/