We consider the problem of adapting a contrastively pretrained vision-language model like CLIP (Radford et al., 2021) for few-shot classification. The literature addresses this problem by learning a linear classifier of the frozen visual features, optimizing word embeddings, or learning external feature adapters. We introduce an alternative way for few-shot CLIP adaptation without adding 'éxternal'' parameters to optimize. We find that simply fine-tuning the embedding projection matrix of the vision encoder leads to better performance than all baselines. Furthermore, we show that regularizing training with the distance between the fine-tuned and pretrained matrices adds reliability for adapting CLIP, making the results stable across different learning rates in the ''validation-free'' setting. This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks, few-shot cross-dataset transfer, domain generalization, and base-to-new class generalization. We also show that ProLIP significantly outperforms prompt tuning when extended to another task of test-time adaptation, while being one order of magnitude faster to train. Code will be made available at:this https URL.
View on arXiv@article{fahes2025_2410.05270, title={ CLIP's Visual Embedding Projector is a Few-shot Cornucopia }, author={ Mohammad Fahes and Tuan-Hung Vu and Andrei Bursuc and Patrick Pérez and Raoul de Charette }, journal={arXiv preprint arXiv:2410.05270}, year={ 2025 } }