ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.16112
8
0

AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models

19 June 2025
Yuan Zhang
Chun-Kai Fan
Tao Huang
Ming Lu
Sicheng Yu
Junwen Pan
Kuan Cheng
Qi She
Shanghang Zhang
    VLMLRM
ArXiv (abs)PDFHTML
Main:10 Pages
8 Figures
Bibliography:4 Pages
3 Tables
Appendix:5 Pages
Abstract

Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we developed an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experimental results indicate that AutoV enhances the performance of various LVLMs across multiple popular image understanding tasks. For instance, LLaVA-OV with AutoV achieves 1.7%\textbf{1.7}\%1.7% accuracy gain on LLaVAWild^{\text{Wild}}Wild, and AutoV boosts Qwen2.5-VL by 1.9%\textbf{1.9}\%1.9% on MMMU, highlighting its potential as an optimal visual prompting method for LVLMs.

View on arXiv
@article{zhang2025_2506.16112,
  title={ AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models },
  author={ Yuan Zhang and Chun-Kai Fan and Tao Huang and Ming Lu and Sicheng Yu and Junwen Pan and Kuan Cheng and Qi She and Shanghang Zhang },
  journal={arXiv preprint arXiv:2506.16112},
  year={ 2025 }
}
Comments on this paper