Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

doi:10.48550/arXiv.2405.06634

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.

Publication:

arXiv e-prints

Pub Date:

May 2024

DOI:

10.48550/arXiv.2405.06634

arXiv:

arXiv:2405.06634

Bibcode:

2024arXiv240506634W

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence;
Computer Science - Computation and Language

E-Print:

11 pages, 3 figures

NASA/ADS

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

Abstract