What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

12 February 2026

Zhenlong Yuan

Xiangyan Qu

Jing Tang

Rui Chen

Lei Sun

Ruidong Chen

Hongwei Yu

Chengxuan Qian

Xiangxiang Chu

Shuo Li

Yuyin Zhou

LLMAG

ArXiv (abs)PDF HTML Github

Main:8 Pages

10 Figures

Bibliography:3 Pages

6 Tables

Appendix:7 Pages

Abstract

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.

View on arXiv

Comments on this paper