111
4

Concat-ID: Towards Universal Identity-Preserving Video Synthesis

Abstract

We present Concat-ID, a unified framework for identity-preserving video generation. Concat-ID employs Variational Autoencoders to extract image features, which are concatenated with video latents along the sequence dimension, leveraging solely 3D self-attention mechanisms without the need for additional modules. A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance identity consistency and facial editability while enhancing video naturalness. Extensive experiments demonstrate Concat-ID's superiority over existing methods in both single and multi-identity generation, as well as its seamless scalability to multi-subject scenarios, including virtual try-on and background-controllable generation. Concat-ID establishes a new benchmark for identity-preserving video synthesis, providing a versatile and scalable solution for a wide range of applications.

View on arXiv
@article{zhong2025_2503.14151,
  title={ Concat-ID: Towards Universal Identity-Preserving Video Synthesis },
  author={ Yong Zhong and Zhuoyi Yang and Jiayan Teng and Xiaotao Gu and Chongxuan Li },
  journal={arXiv preprint arXiv:2503.14151},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.