182
11
v1v2v3 (latest)

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Abstract

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available atthis https URL.

View on arXiv
@article{an2025_2411.11706,
  title={ MC-LLaVA: Multi-Concept Personalized Vision-Language Model },
  author={ Ruichuan An and Sihan Yang and Ming Lu and Renrui Zhang and Kai Zeng and Yulin Luo and Jiajun Cao and Hao Liang and Ying Chen and Qi She and Shanghang Zhang and Wentao Zhang },
  journal={arXiv preprint arXiv:2411.11706},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.