Automatic Lyrics Transcription (ALT) aims to recognize lyrics from singing voices, similar to Automatic Speech Recognition (ASR) for spoken language, but faces added complexity due to domain-specific properties of the singing voice. While foundation ASR models show robustness in various speech tasks, their performance degrades on singing voice, especially in the presence of musical accompaniment. This work focuses on this performance gap and explores Low-Rank Adaptation (LoRA) for ALT, investigating both single-domain and dual-domain fine-tuning strategies. We propose using a consistency loss to better align vocal and mixture encoder representations, improving transcription on mixture without relying on singing voice separation. Our results show that while naïve dual-domain fine-tuning underperforms, structured training with consistency loss yields modest but consistent gains, demonstrating the potential of adapting ASR foundation models for music.
View on arXiv@article{huang2025_2506.02339, title={ Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss }, author={ Jiawen Huang and Felipe Sousa and Emir Demirel and Emmanouil Benetos and Igor Gadelha }, journal={arXiv preprint arXiv:2506.02339}, year={ 2025 } }