Length Aware Speech Translation for Video Dubbing

31 May 2025

Main:4 Pages

Bibliography:1 Pages

2 Tables

Abstract

In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.

View on arXiv

@article{chadha2025_2506.00740,
  title={ Length Aware Speech Translation for Video Dubbing },
  author={ Harveen Singh Chadha and Aswin Shanmugam Subramanian and Vikas Joshi and Shubham Bansal and Jian Xue and Rupeshkumar Mehta and Jinyu Li },
  journal={arXiv preprint arXiv:2506.00740},
  year={ 2025 }
}

Comments on this paper