VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

19 April 2026

Yi-Cheng Lin

Yusuke Hirota

Sung-Feng Huang

Hung-yi Lee

AuLLM

ArXiv (abs)PDF HTML Github

Main:4 Pages

3 Figures

Bibliography:2 Pages

2 Tables

Appendix:3 Pages

Abstract

Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairness. We propose VIBE, a framework that evaluates generative bias through open-ended tasks such as personalized recommendations, using real-world human recordings. Unlike MCQs, our method allows stereotypical associations to manifest organically without predefined options, making it easily extensible to new tasks. Evaluating 11 state-of-the-art LALMs reveals systematic biases in realistic scenarios. We find that gender cues often trigger larger distributional shifts than accent cues, indicating that current LALMs reproduce social stereotypes.

View on arXiv

Comments on this paper