Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

17 June 2025

Jiamin Xie

Ju Lin

Yiteng Huang

Tyler Vuong

Zhaojiang Lin

Zhaojun Yang

Peng Su

Prashant Rawat

Sangeeta Srivastava

Ming Sun

Florian Metze

ArXiv (abs)PDF HTML

Main:4 Pages

1 Figures

Bibliography:1 Pages

4 Tables

Abstract

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression. To enhance the model's ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA). Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both speech recognition and source localization tasks.

View on arXiv

@article{xie2025_2506.14973,
  title={ Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition },
  author={ Jiamin Xie and Ju Lin and Yiteng Huang and Tyler Vuong and Zhaojiang Lin and Zhaojun Yang and Peng Su and Prashant Rawat and Sangeeta Srivastava and Ming Sun and Florian Metze },
  journal={arXiv preprint arXiv:2506.14973},
  year={ 2025 }
}

Comments on this paper