This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influenced by the room's acoustic characteristics, such as dimension and reverberation time, making it challenging for TSE systems that rely solely on distance clues to generalize across a variety of different rooms. To solve this, we suggest providing room environmental information (room dimensions and reverberation time) for distance-based TSE for better generalization capabilities. Especially, we propose a distance and environment-based TSE model in the time-frequency (TF) domain with learnable distance and room embedding. Results on both simulated and real collected datasets demonstrate its feasibility. Demonstration materials are available atthis https URL.
View on arXiv@article{shi2025_2505.14433, title={ Single-Channel Target Speech Extraction Utilizing Distance and Room Clues }, author={ Runwu Shi and Zirui Lin and Benjamin Yen and Jiang Wang and Ragib Amin Nihal and Kazuhiro Nakadai }, journal={arXiv preprint arXiv:2505.14433}, year={ 2025 } }