ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.16833
10
0

Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training

20 June 2025
Jianyuan Feng
Guangzheng Li
Yangfei Xu
    VLM
ArXiv (abs)PDFHTML
Main:4 Pages
5 Figures
Bibliography:1 Pages
2 Tables
Abstract

Language-queried Audio Separation (LASS) employs linguistic queries to isolate target sounds based on semantic descriptions. However, existing methods face challenges in aligning complex auditory features with linguistic context while preserving separation precision. Current research efforts focus primarily on text description augmentation and architectural innovations, yet the potential of integrating pre-trained self-supervised learning (SSL) audio models and Contrastive Language-Audio Pretraining (CLAP) frameworks, capable of extracting cross-modal audio-text relationships, remains underexplored. To address this, we present HybridSep, a two-stage LASS framework that synergizes SSL-based acoustic representations with CLAP-derived semantic embeddings. Our framework introduces Adversarial Consistent Training (ACT), a novel optimization strategy that treats diffusion as an auxiliary regularization loss while integrating adversarial training to enhance separation fidelity. Experiments demonstrate that HybridSep achieves significant performance improvements over state-of-the-art baselines (e.g., AudioSep, FlowSep) across multiple metrics, establishing new benchmarks for LASS tasks.

View on arXiv
@article{feng2025_2506.16833,
  title={ Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training },
  author={ Jianyuan Feng and Guangzheng Li and Yangfei Xu },
  journal={arXiv preprint arXiv:2506.16833},
  year={ 2025 }
}
Comments on this paper