ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.03681
49
0

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

4 June 2025
Pradeep Rangappa
Andres Carofilis
Jeena Prakash
Shashi Kumar
Sergio Burdisso
S. Madikeri
Esaú Villatoro-Tello
Bidisha Sharma
P. Motlícek
Kadri Hacioğlu
Shankar Venkatesan
Saurabh Vyas
Andreas Stolcke
ArXiv (abs)PDFHTML
Main:3 Pages
2 Figures
Bibliography:2 Pages
4 Tables
Abstract

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

View on arXiv
@article{rangappa2025_2506.03681,
  title={ Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering },
  author={ Pradeep Rangappa and Andres Carofilis and Jeena Prakash and Shashi Kumar and Sergio Burdisso and Srikanth Madikeri and Esau Villatoro-Tello and Bidisha Sharma and Petr Motlicek and Kadri Hacioglu and Shankar Venkatesan and Saurabh Vyas and Andreas Stolcke },
  journal={arXiv preprint arXiv:2506.03681},
  year={ 2025 }
}
Comments on this paper