CMU's IWSLT 2025 Simultaneous Speech Translation System

16 June 2025

Main:4 Pages

1 Figures

Bibliography:2 Pages

2 Tables

Abstract

This paper presents CMU's submission to the IWSLT 2025 Simultaneous Speech Translation (SST) task for translating unsegmented English speech into Chinese and German text in a streaming manner. Our end-to-end speech-to-text system integrates a chunkwise causal Wav2Vec 2.0 speech encoder, an adapter, and the Qwen2.5-7B-Instruct as the decoder. We use a two-stage simultaneous training procedure on robust speech segments curated from LibriSpeech, CommonVoice, and VoxPopuli datasets, utilizing standard cross-entropy loss. Our model supports adjustable latency through a configurable latency multiplier. Experimental results demonstrate that our system achieves 44.3 BLEU for English-to-Chinese and 25.1 BLEU for English-to-German translations on the ACL60/60 development set, with computation-aware latencies of 2.7 seconds and 2.3 seconds, and theoretical latencies of 2.2 and 1.7 seconds, respectively.

View on arXiv

@article{ouyang2025_2506.13143,
  title={ CMU's IWSLT 2025 Simultaneous Speech Translation System },
  author={ Siqi Ouyang and Xi Xu and Lei Li },
  journal={arXiv preprint arXiv:2506.13143},
  year={ 2025 }
}

Comments on this paper