Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

6 June 2025

Main:8 Pages

26 Figures

Bibliography:2 Pages

3 Tables

Appendix:23 Pages

Abstract

Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{this https URL}{this http URL}

View on arXiv

@article{babaiee2025_2506.06242,
  title={ Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models },
  author={ Zahra Babaiee and Peyman M. Kiasari and Daniela Rus and Radu Grosu },
  journal={arXiv preprint arXiv:2506.06242},
  year={ 2025 }
}

Comments on this paper