Signal-Guided Source Separation
State-of-the-art separation of desired signal components (DSCs) from a mixture is achieved using time-frequency masks or filters estimated by a deep neural network (DNN). The DSCs are typically defined at the time of training, or alternatively during inference via a reference signal (RS). In the latter case, typically, an auxiliary DNN extracts signal characteristics (SCs) from the RS and estimates a set of adaptive weights (AWs) of the first DNN. In both cases, the information of DSCs is stored in the DNN weights. Current methods using audio RSs estimate time-invariant AWs. Applications where the RS and DSCs exhibit time-variant SCs, i.e., they cannot be assigned to a specific class like speech, require time-variant AWs. An example is acoustic echo cancellation with the loudspeaker signal as RS. We propose a method to extract time-variant AWs from a RS and additionally show that current time-invariant AWs methods can be employed for universal source separation. To avoid strong scaling between the estimate and the mixture, we propose to train with the dual scale-invariant signal-to-distortion ratio in a TASNET inspired DNN. We evaluate the proposed AWs systems under various acoustic conditions and show the scenario-dependent advantages of time-variant over time-invariant AWs.
View on arXiv