VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models

23 February 2025

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in multimodal understanding; however, their fundamental visual cognitive abilities remain largely underexplored. To bridge this gap, we introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT), a well-established psychometric assessment of human cognition. VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks including spatial reasoning, perceptual speed, and pattern recognition. We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL, using VisFactor under diverse prompting strategies like Chain-of-Thought and Multi-Agent Debate. Our findings reveal a concerning deficiency in current MLLMs' fundamental visual cognition, with performance frequently approaching random guessing and showing only marginal improvements even with advanced prompting techniques. These results underscore the critical need for focused research to enhance the core visual reasoning capabilities of MLLMs. To foster further investigation in this area, we release our VisFactor benchmark atthis https URL.

View on arXiv

@article{huang2025_2502.16435,
  title={ VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models },
  author={ Jen-Tse Huang and Dasen Dai and Jen-Yuan Huang and Youliang Yuan and Xiaoyuan Liu and Wenxuan Wang and Wenxiang Jiao and Pinjia He and Zhaopeng Tu },
  journal={arXiv preprint arXiv:2502.16435},
  year={ 2025 }
}

Comments on this paper