v1v2 (latest)

DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization

3 June 2025

Main:4 Pages

1 Figures

Bibliography:1 Pages

4 Tables

Abstract

Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality-specific challenges. To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero-shot audio separation. The code is available at:this https URL

View on arXiv

@article{lee2025_2506.02858,
  title={ DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization },
  author={ Geonyoung Lee and Geonhee Han and Paul Hongsuck Seo },
  journal={arXiv preprint arXiv:2506.02858},
  year={ 2025 }
}

Comments on this paper