ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.09333
64
5
v1v2 (latest)

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Computer Vision and Pattern Recognition (CVPR), 2025
16 January 2025
A. Chowdhury
Dipanjyoti Paul
Zheda Mai
Zanming Huang
Ziheng Zhang
Kazi Sajeed Mehrab
Elizabeth G. Campolongo
Daniel Rubenstein
Charles V. Stewart
Anuj Karpatne
T. Berger-Wolf
Yu-Chuan Su
Wei-Lun Chao
    VPVLMVLM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)
Main:8 Pages
24 Figures
Bibliography:3 Pages
5 Tables
Appendix:12 Pages
Abstract

We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. Pre-trained ViTs, such as DINO, have demonstrated remarkable capabilities in extracting localized, discriminative features. However, saliency maps like Grad-CAM often fail to identify these traits, producing blurred, coarse heatmaps that highlight entire objects instead. We propose a novel approach, Prompt Class Attention Map (Prompt-CAM), to address this limitation. Prompt-CAM learns class-specific prompts for a pre-trained ViT and uses the corresponding outputs for classification. To correctly classify an image, the true-class prompt must attend to unique image patches not present in other classes' images (i.e., traits). As a result, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a ``free lunch,'' requiring only a modification to the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM easy to train and apply, in stark contrast to other interpretable methods that require designing specific models and training processes. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate the superior interpretation capability of Prompt-CAM. The source code and demo are available atthis https URL.

View on arXiv
Comments on this paper