Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{this https URL}{this http URL}
View on arXiv@article{babaiee2025_2506.06242, title={ Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models }, author={ Zahra Babaiee and Peyman M. Kiasari and Daniela Rus and Radu Grosu }, journal={arXiv preprint arXiv:2506.06242}, year={ 2025 } }