Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors

Face swapping manipulations in video streams represents an increasing threat in remote video communications, due to advancesin automated and real-time tools. Recent literature proposes to characterize and exploit visual artifacts introduced in video framesby swapping algorithms when dealing with challenging physical scenes, such as face occlusions. This paper investigates theeffectiveness of this approach by benchmarking CNN-based data-driven models on two data corpora (including a newly collectedone) and analyzing generalization capabilities with respect to different acquisition sources and swapping algorithms. The resultsconfirm excellent performance of general-purpose CNN architectures when operating within the same data source, but a significantdifficulty in robustly characterizing occlusion-based visual cues across datasets. This highlights the need for specialized detectionstrategies to deal with such artifacts.
View on arXiv