v1v2 (latest)

Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion

19 June 2025

Markus Frohmann

Gabriel Meseguer-Brocal

Main:4 Pages

4 Figures

Bibliography:5 Pages

8 Tables

Appendix:4 Pages

Abstract

The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available atthis https URL.

View on arXiv

@article{frohmann2025_2506.15981,
  title={ Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion },
  author={ Markus Frohmann and Gabriel Meseguer-Brocal and Markus Schedl and Elena V. Epure },
  journal={arXiv preprint arXiv:2506.15981},
  year={ 2025 }
}

Comments on this paper