: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
- TTAVLM

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur in both audio and visual modalities, we introduce , a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. comprises four audio-visual benchmark datasets, , , , and , each incorporating 75 bimodal audio-visual corruptions that are and . Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on and , offer minimal improvements in performance under bimodal corruptions. We further propose , a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on . We hope that will steer the development of more effective and robust audio-visual TTA approaches. Our code is available .
View on arXiv@article{maharana2025_2506.00358, title={ $\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time }, author={ Sarthak Kumar Maharana and Saksham Singh Kushwaha and Baoming Zhang and Adrian Rodriguez and Songtao Wei and Yapeng Tian and Yunhui Guo }, journal={arXiv preprint arXiv:2506.00358}, year={ 2025 } }