OpenAI's Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings.In this paper, we introduce a novel method to reduce Whisper's hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER degradation on LibriSpeech test-clean and test-other.
View on arXiv@article{wang2025_2505.12969, title={ Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down }, author={ Yingzhi Wang and Anas Alhmoud and Saad Alsahly and Muhammad Alqurishi and Mirco Ravanelli }, journal={arXiv preprint arXiv:2505.12969}, year={ 2025 } }