v1v2 (latest)

Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

16 June 2025

Main:3 Pages

1 Figures

Bibliography:1 Pages

2 Tables

Abstract

This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.

View on arXiv

@article{nguyen2025_2506.13596,
  title={ Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems },
  author={ Tuan Nguyen and Long-Vu Hoang and Huy-Dat Tran },
  journal={arXiv preprint arXiv:2506.13596},
  year={ 2025 }
}

Comments on this paper