Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

12 June 2025

Main:4 Pages

9 Figures

Bibliography:1 Pages

Abstract

Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or "pseudo-label", this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a "speaker code" characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in word error rate on one minute of adaptation data, increasing on 10 minutes to 29%.

View on arXiv

@article{dalen2025_2506.10653,
  title={ Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes },
  author={ Rogier C. van Dalen and Shucong Zhang and Titouan Parcollet and Sourav Bhattacharya },
  journal={arXiv preprint arXiv:2506.10653},
  year={ 2025 }
}

Comments on this paper