54
0

VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection

Abstract

As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.

View on arXiv
@article{zhang2025_2503.22291,
  title={ VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection },
  author={ Bin Zhang and Xiaoyang Qu and Guokuan Li and Jiguang Wan and Jianzong Wang },
  journal={arXiv preprint arXiv:2503.22291},
  year={ 2025 }
}
Comments on this paper