CLIP's Visual Embedding Projector is a Few-shot Cornucopia

7 October 2024

Patrick Pérez

Abstract

We consider the problem of adapting a contrastively pretrained vision-language model like CLIP (Radford et al., 2021) for few-shot classification. The literature addresses this problem by learning a linear classifier of the frozen visual features, optimizing word embeddings, or learning external feature adapters. We introduce an alternative way for few-shot CLIP adaptation without adding 'éxternal'' parameters to optimize. We find that simply fine-tuning the embedding projection matrix of the vision encoder leads to better performance than all baselines. Furthermore, we show that regularizing training with the distance between the fine-tuned and pretrained matrices adds reliability for adapting CLIP, making the results stable across different learning rates in the ''validation-free'' setting. This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks, few-shot cross-dataset transfer, domain generalization, and base-to-new class generalization. We also show that ProLIP significantly outperforms prompt tuning when extended to another task of test-time adaptation, while being one order of magnitude faster to train. Code will be made available at:this https URL.

View on arXiv

@article{fahes2025_2410.05270,
  title={ CLIP's Visual Embedding Projector is a Few-shot Cornucopia },
  author={ Mohammad Fahes and Tuan-Hung Vu and Andrei Bursuc and Patrick Pérez and Raoul de Charette },
  journal={arXiv preprint arXiv:2410.05270},
  year={ 2025 }
}

Comments on this paper