ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.08967
40
0
v1v2 (latest)

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

10 June 2025
Ailin Huang
B. Li
Bruce Wang
Boyong Wu
Chao Yan
Chengli Feng
Heng Wang
Hongyu Zhou
Hongyuan Wang
Jingbei Li
Jianjian Sun
Joanna Wang
M. Ben-Chen
Peng Liu
Ruihang Miao
Shilei Jiang
Tian Fei
Wang You
Xi Chen
Xuerui Yang
Yechang Huang
Yuxiang Zhang
Z. Ge
Zheng Gong
Zhewei Huang
Zixin Zhang
Bin Wang
Bo Li
Buyun Ma
Changxin Miao
Changyi Wan
C. Xu
Dapeng Shi
Dingyuan Hu
Enle Liu
Guanzhe Huang
Gulin Yan
Hanpeng Hu
Haonan Jia
Jiahao Gong
J. Wu
Jie Wu
J. Yang
J. Lin
K. Li
Lei Xia
Longlong Gu
Ming Li
Nie Hao
Ranchen Ming
Shaoliang Pang
Siqi Liu
Song Yuan
Tiancheng Cao
W. Li
Wenqing He
Xu Zhao
X. Zhang
Yanbo Yu
Y. Zhong
Yu Zhou
Yuanwei Liang
Yuanwei Lu
Y. Yang
Zidong Yang
Zili Zhang
Binxing Jiao
H. Shum
Jiansheng Chen
Jing Li
Xiangyu Zhang
X. Zhang
Yibo Zhu
Daxin Jiang
Shuchang Zhou
Chen-Hao Hu
    AuLLM
ArXiv (abs)PDFHTML
Main:8 Pages
3 Figures
Bibliography:3 Pages
2 Tables
Appendix:1 Pages
Abstract

Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

View on arXiv
@article{huang2025_2506.08967,
  title={ Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model },
  author={ Ailin Huang and Bingxin Li and Bruce Wang and Boyong Wu and Chao Yan and Chengli Feng and Heng Wang and Hongyu Zhou and Hongyuan Wang and Jingbei Li and Jianjian Sun and Joanna Wang and Mingrui Chen and Peng Liu and Ruihang Miao and Shilei Jiang and Tian Fei and Wang You and Xi Chen and Xuerui Yang and Yechang Huang and Yuxiang Zhang and Zheng Ge and Zheng Gong and Zhewei Huang and Zixin Zhang and Bin Wang and Bo Li and Buyun Ma and Changxin Miao and Changyi Wan and Chen Xu and Dapeng Shi and Dingyuan Hu and Enle Liu and Guanzhe Huang and Gulin Yan and Hanpeng Hu and Haonan Jia and Jiahao Gong and Jiaoren Wu and Jie Wu and Jie Yang and Junzhe Lin and Kaixiang Li and Lei Xia and Longlong Gu and Ming Li and Nie Hao and Ranchen Ming and Shaoliang Pang and Siqi Liu and Song Yuan and Tiancheng Cao and Wen Li and Wenqing He and Xu Zhao and Xuelin Zhang and Yanbo Yu and Yinmin Zhong and Yu Zhou and Yuanwei Liang and Yuanwei Lu and Yuxiang Yang and Zidong Yang and Zili Zhang and Binxing Jiao and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Xiangyu Zhang and Xinhao Zhang and Yibo Zhu and Daxin Jiang and Shuchang Zhou and Chen Hu },
  journal={arXiv preprint arXiv:2506.08967},
  year={ 2025 }
}
Comments on this paper