Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

View on arXiv

@article{huang2025_2506.08967,
  title={ Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model },
  author={ Ailin Huang and Bingxin Li and Bruce Wang and Boyong Wu and Chao Yan and Chengli Feng and Heng Wang and Hongyu Zhou and Hongyuan Wang and Jingbei Li and Jianjian Sun and Joanna Wang and Mingrui Chen and Peng Liu and Ruihang Miao and Shilei Jiang and Tian Fei and Wang You and Xi Chen and Xuerui Yang and Yechang Huang and Yuxiang Zhang and Zheng Ge and Zheng Gong and Zhewei Huang and Zixin Zhang and Bin Wang and Bo Li and Buyun Ma and Changxin Miao and Changyi Wan and Chen Xu and Dapeng Shi and Dingyuan Hu and Enle Liu and Guanzhe Huang and Gulin Yan and Hanpeng Hu and Haonan Jia and Jiahao Gong and Jiaoren Wu and Jie Wu and Jie Yang and Junzhe Lin and Kaixiang Li and Lei Xia and Longlong Gu and Ming Li and Nie Hao and Ranchen Ming and Shaoliang Pang and Siqi Liu and Song Yuan and Tiancheng Cao and Wen Li and Wenqing He and Xu Zhao and Xuelin Zhang and Yanbo Yu and Yinmin Zhong and Yu Zhou and Yuanwei Liang and Yuanwei Lu and Yuxiang Yang and Zidong Yang and Zili Zhang and Binxing Jiao and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Xiangyu Zhang and Xinhao Zhang and Yibo Zhu and Daxin Jiang and Shuchang Zhou and Chen Hu },
  journal={arXiv preprint arXiv:2506.08967},
  year={ 2025 }
}

Comments on this paper