VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection

As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.
View on arXiv@article{zhang2025_2503.22291, title={ VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection }, author={ Bin Zhang and Xiaoyang Qu and Guokuan Li and Jiguang Wan and Jianzong Wang }, journal={arXiv preprint arXiv:2503.22291}, year={ 2025 } }