Leveraging large multimodal models for audio-video deepfake detection: a pilot study

25 February 2026

Songjun Cao

Yuqi Li

Yunpeng Luo

Jianjun Yin

Long Ma

VGen

ArXiv (abs)PDF HTML Github

Main:4 Pages

3 Figures

Bibliography:1 Pages

3 Tables

Abstract

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

View on arXiv

Comments on this paper