ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.03195
33
0

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

1 June 2025
Yunqi Hong
Sohyun An
Andrew Bai
Neil Y. C. Lin
Cho-Jui Hsieh
    VLM
ArXiv (abs)PDFHTML
Main:9 Pages
10 Figures
Bibliography:3 Pages
2 Tables
Appendix:4 Pages
Abstract

Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at:this https URL

View on arXiv
@article{hong2025_2506.03195,
  title={ Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs },
  author={ Yunqi Hong and Sohyun An and Andrew Bai and Neil Y.C. Lin and Cho-Jui Hsieh },
  journal={arXiv preprint arXiv:2506.03195},
  year={ 2025 }
}
Comments on this paper