63
0

Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training

Main:9 Pages
12 Figures
Bibliography:4 Pages
4 Tables
Appendix:8 Pages
Abstract

Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few predefined conditioning settings. To tackle this issue, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. We also show more insights of our method by detailed ablation study and analysis.

View on arXiv
@article{lai2025_2505.20629,
  title={ Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training },
  author={ Bolin Lai and Sangmin Lee and Xu Cao and Xiang Li and James M. Rehg },
  journal={arXiv preprint arXiv:2505.20629},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.