LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Abstract
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2023
- DOI:
- arXiv:
- arXiv:2311.05437
- Bibcode:
- 2023arXiv231105437L
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Artificial Intelligence;
- Computer Science - Computation and Language;
- Computer Science - Machine Learning;
- Computer Science - Multimedia
- E-Print:
- 25 pages, 25M file size. Project Page: https://llava-vl.github.io/llava-plus/