ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.05270
34
1

CLIP's Visual Embedding Projector is a Few-shot Cornucopia

7 October 2024
Mohammad Fahes
Tuan-Hung Vu
Andrei Bursuc
Patrick Pérez
Raoul de Charette
    VLM
ArXivPDFHTML
Abstract

We consider the problem of adapting a contrastively pretrained vision-language model like CLIP (Radford et al., 2021) for few-shot classification. The literature addresses this problem by learning a linear classifier of the frozen visual features, optimizing word embeddings, or learning external feature adapters. We introduce an alternative way for few-shot CLIP adaptation without adding 'éxternal'' parameters to optimize. We find that simply fine-tuning the embedding projection matrix of the vision encoder leads to better performance than all baselines. Furthermore, we show that regularizing training with the distance between the fine-tuned and pretrained matrices adds reliability for adapting CLIP, making the results stable across different learning rates in the ''validation-free'' setting. This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks, few-shot cross-dataset transfer, domain generalization, and base-to-new class generalization. We also show that ProLIP significantly outperforms prompt tuning when extended to another task of test-time adaptation, while being one order of magnitude faster to train. Code will be made available at:this https URL.

View on arXiv
@article{fahes2025_2410.05270,
  title={ CLIP's Visual Embedding Projector is a Few-shot Cornucopia },
  author={ Mohammad Fahes and Tuan-Hung Vu and Andrei Bursuc and Patrick Pérez and Raoul de Charette },
  journal={arXiv preprint arXiv:2410.05270},
  year={ 2025 }
}
Comments on this paper