ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

26 May 2025

Pooneh Mousavi

Main:4 Pages

4 Figures

Bibliography:2 Pages

Appendix:1 Pages

Abstract

Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key consideration for these models is the cross-modal alignment between text and audio modalities, which is a telltale sign as to whether or not LLM is able to associate semantic meaning to audio segments. While various methods exist for fusing these modalities, there is no standard metric to evaluate alignment quality in LLMs. In this work, we propose a new metric, ALAS (Automatic Latent Alignment Score). Our study examines the correlation between audio and text representations across transformer layers, for two different tasks (Spoken Question Answering and Emotion Recognition). We showcase that our metric behaves as expected across different layers and different tasks.

View on arXiv

@article{mousavi2025_2505.19937,
  title={ ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs },
  author={ Pooneh Mousavi and Yingzhi Wang and Mirco Ravanelli and Cem Subakan },
  journal={arXiv preprint arXiv:2505.19937},
  year={ 2025 }
}

Comments on this paper