44
0

Affordance Benchmark for MLLMs

Main:6 Pages
4 Figures
Bibliography:1 Pages
1 Tables
Abstract

Affordance theory posits that environments inherently offer action possibilities that shape perception and behavior. While Multimodal Large Language Models (MLLMs) excel in vision-language tasks, their ability to perceive affordance, which is crucial for intuitive and safe interactions, remains underexplored. To address this, we introduce A4Bench, a novel benchmark designed to evaluate the affordance perception abilities of MLLMs across two dimensions: 1) Constitutive Affordance}, assessing understanding of inherent object properties through 1,282 question-answer pairs spanning nine sub-disciplines, and 2) Transformative Affordance, probing dynamic and contextual nuances (e.g., misleading, time-dependent, cultural, or individual-specific affordance) with 718 challenging question-answer pairs. Evaluating 17 MLLMs (nine proprietary and eight open-source) against human performance, we find that proprietary models generally outperform open-source counterparts, but all exhibit limited capabilities, particularly in transformative affordance perception. Furthermore, even top-performing models, such as Gemini-2.0-Pro (18.05% overall exact match accuracy), significantly lag behind human performance (best: 85.34%, worst: 81.25%). These findings highlight critical gaps in environmental understanding of MLLMs and provide a foundation for advancing AI systems toward more robust, context-aware interactions. The dataset is available inthis https URL.

View on arXiv
@article{wang2025_2506.00893,
  title={ Affordance Benchmark for MLLMs },
  author={ Junying Wang and Wenzhe Li and Yalun Wu and Yingji Liang and Yijin Guo and Chunyi Li and Haodong Duan and Zicheng Zhang and Guangtao Zhai },
  journal={arXiv preprint arXiv:2506.00893},
  year={ 2025 }
}
Comments on this paper