Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

21 March 2025

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available atthis https URL.

View on arXiv

@article{tang2025_2503.17109,
  title={ Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval },
  author={ Yuanmin Tang and Jing Yu and Keke Gai and Jiamin Zhuang and Gang Xiong and Gaopeng Gou and Qi Wu },
  journal={arXiv preprint arXiv:2503.17109},
  year={ 2025 }
}

Comments on this paper