ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.05899
22
0

WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction

6 June 2025
Jakaria Islam Emon
Kazi Tamanna Alam
Md Abu Salek
ArXiv (abs)PDFHTML
Main:3 Pages
2 Figures
Bibliography:1 Pages
2 Tables
Abstract

Mean Opinion Score (MOS) prediction for text to music systems requires evaluating both overall musical quality and text prompt alignment. This paper introduces WhisQ, a multimodal architecture that addresses this dual-assessment challenge through sequence level co-attention and optimal transport regularization. WhisQ employs the Whisper Base pretrained model for temporal audio encoding and Qwen 3, a 0.6B Small Language Model (SLM), for text encoding, with both maintaining sequence structure for fine grained cross-modal modeling. The architecture features specialized prediction pathways: OMQ is predicted from pooled audio embeddings, while TA leverages bidirectional sequence co-attention between audio and text. Sinkhorn optimal transport loss further enforce semantic alignment in the shared embedding space. On the MusicEval Track-1 dataset, WhisQ achieves substantial improvements over the baseline: 7% improvement in Spearman correlation for OMQ and 14% for TA. Ablation studies reveal that optimal transport regularization provides the largest performance gain (10% SRCC improvement), demonstrating the importance of explicit cross-modal alignment for text-to-music evaluation.

View on arXiv
@article{emon2025_2506.05899,
  title={ WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction },
  author={ Jakaria Islam Emon and Kazi Tamanna Alam and Md. Abu Salek },
  journal={arXiv preprint arXiv:2506.05899},
  year={ 2025 }
}
Comments on this paper