AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

28 May 2025

Main:4 Pages

1 Figures

Bibliography:1 Pages

3 Tables

Abstract

Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre-trained diffusion models, this study integrates pre-trained models with the rectified diffusion method to improve the efficiency of text-to-audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps, outperforms prior models and reduces inference to 3 steps compared to a flow-matching-based acceleration model.

View on arXiv

@article{zhao2025_2505.22106,
  title={ AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion },
  author={ Junqi Zhao and Jinzheng Zhao and Haohe Liu and Yun Chen and Lu Han and Xubo Liu and Mark Plumbley and Wenwu Wang },
  journal={arXiv preprint arXiv:2505.22106},
  year={ 2025 }
}

Comments on this paper