LayerFlow: A Unified Model for Layer-aware Video Generation

4 June 2025

Main:7 Pages

9 Figures

Bibliography:2 Pages

3 Tables

Appendix:1 Pages

Abstract

We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.

View on arXiv

@article{ji2025_2506.04228,
  title={ LayerFlow: A Unified Model for Layer-aware Video Generation },
  author={ Sihui Ji and Hao Luo and Xi Chen and Yuanpeng Tu and Yiyang Wang and Hengshuang Zhao },
  journal={arXiv preprint arXiv:2506.04228},
  year={ 2025 }
}

Comments on this paper