LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

24 May 2025

Abstract

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

View on arXiv

@article{mousavi2025_2505.18517,
  title={ LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs },
  author={ Pooneh Mousavi and Shubham Gupta and Cem Subakan and Mirco Ravanelli },
  journal={arXiv preprint arXiv:2505.18517},
  year={ 2025 }
}

Comments on this paper