ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.05146
161
0
v1v2 (latest)

CIVET: Systematic Evaluation of Understanding in VLMs

5 June 2025
Massimo Rizzoli
Simone Alghisi
Olha Khomyn
Gabriel Roccabruna
Seyed Mahed Mousavi
Giuseppe Riccardi
ArXiv (abs)PDFHTML
Main:8 Pages
8 Figures
Bibliography:2 Pages
19 Tables
Appendix:9 Pages
Abstract

While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs' understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.

View on arXiv
@article{rizzoli2025_2506.05146,
  title={ CIVET: Systematic Evaluation of Understanding in VLMs },
  author={ Massimo Rizzoli and Simone Alghisi and Olha Khomyn and Gabriel Roccabruna and Seyed Mahed Mousavi and Giuseppe Riccardi },
  journal={arXiv preprint arXiv:2506.05146},
  year={ 2025 }
}
Comments on this paper