ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.24517
36
0

un2^22CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

30 May 2025
Yinqi Li
Jiahe Zhao
Hong Chang
Ruibing Hou
Shiguang Shan
Xilin Chen
    CLIPVLM
ArXiv (abs)PDFHTML
Main:8 Pages
9 Figures
Bibliography:5 Pages
8 Tables
Appendix:6 Pages
Abstract

Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un2^22CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un2^22CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available atthis https URL.

View on arXiv
@article{li2025_2505.24517,
  title={ un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP },
  author={ Yinqi Li and Jiahe Zhao and Hong Chang and Ruibing Hou and Shiguang Shan and Xilin Chen },
  journal={arXiv preprint arXiv:2505.24517},
  year={ 2025 }
}
Comments on this paper