v1v2 (latest)

Towards Generalized Source Tracing for Codec-Based Deepfake Speech

8 June 2025

Main:6 Pages

5 Figures

Bibliography:2 Pages

Abstract

Recent attempts at source tracing for codec-based deepfake speech (CodecFake), generated by neural audio codec-based speech generation (CoSG) models, have exhibited suboptimal performance. However, how to train source tracing models using simulated CoSG data while maintaining strong performance on real CoSG-generated audio remains an open challenge. In this paper, we show that models trained solely on codec-resynthesized data tend to overfit to non-speech regions and struggle to generalize to unseen content. To mitigate these challenges, we introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding. Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.

View on arXiv

@article{chen2025_2506.07294,
  title={ Towards Generalized Source Tracing for Codec-Based Deepfake Speech },
  author={ Xuanjun Chen and I-Ming Lin and Lin Zhang and Haibin Wu and Hung-yi Lee and Jyh-Shing Roger Jang },
  journal={arXiv preprint arXiv:2506.07294},
  year={ 2025 }
}

Comments on this paper