AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

7 April 2026

Dong She

Xianrong Yao

Liqun Chen

Jinghe Yu

Yang Gao

Zhanpeng Jin

CoGe

ArXiv (abs)PDF HTML Github

Main:7 Pages

38 Figures

Bibliography:4 Pages

11 Tables

Appendix:17 Pages

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

View on arXiv

Comments on this paper