9
0

CLIP Embeddings for AI-Generated Image Detection: A Few-Shot Study with Lightweight Classifier

Abstract

Verifying the authenticity of AI-generated images presents a growing challenge on social media platforms these days. While vision-language models (VLMs) like CLIP outdo in multimodal representation, their capacity for AI-generated image classification is underexplored due to the absence of such labels during the pre-training process. This work investigates whether CLIP embeddings inherently contain information indicative of AI generation. A proposed pipeline extracts visual embeddings using a frozen CLIP model, feeds its embeddings to lightweight networks, and fine-tunes only the final classifier. Experiments on the public CIFAKE benchmark show the performance reaches 95% accuracy without language reasoning. Few-shot adaptation to curated custom with 20% of the data results in performance to 85%. A closed-source baseline (Gemini-2.0) has the best zero-shot accuracy yet fails on specific styles. Notably, some specific image types, such as wide-angle photographs and oil paintings, pose significant challenges to classification. These results indicate previously unexplored difficulties in classifying certain types of AI-generated images, revealing new and more specific questions in this domain that are worth further investigation.

View on arXiv
@article{ou2025_2505.10664,
  title={ CLIP Embeddings for AI-Generated Image Detection: A Few-Shot Study with Lightweight Classifier },
  author={ Ziyang Ou },
  journal={arXiv preprint arXiv:2505.10664},
  year={ 2025 }
}
Comments on this paper