ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.08175
28
0

Fast Text-to-Audio Generation with Adversarial Post-Training

13 May 2025
Zachary Novack
Zach Evans
Zack Zukowski
Josiah Taylor
CJ Carr
Julian Parker
Adnan Al-Sinan
Gian Marco Iodice
Julian McAuley
Taylor Berg-Kirkpatrick
Jordi Pons
ArXivPDFHTML
Abstract

Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating ≈\approx≈12s of 44.1kHz stereo audio in ≈\approx≈75ms on an H100, and ≈\approx≈7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.

View on arXiv
@article{novack2025_2505.08175,
  title={ Fast Text-to-Audio Generation with Adversarial Post-Training },
  author={ Zachary Novack and Zach Evans and Zack Zukowski and Josiah Taylor and CJ Carr and Julian Parker and Adnan Al-Sinan and Gian Marco Iodice and Julian McAuley and Taylor Berg-Kirkpatrick and Jordi Pons },
  journal={arXiv preprint arXiv:2505.08175},
  year={ 2025 }
}
Comments on this paper