28

AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Dong She
Xianrong Yao
Liqun Chen
Jinghe Yu
Yang Gao
Zhanpeng Jin
Main:7 Pages
38 Figures
Bibliography:4 Pages
11 Tables
Appendix:17 Pages
Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

View on arXiv
Comments on this paper