CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

10 April 2026

Haoyu Zhao

Zihao Zhang

Jiaxi Gu

Haoran Chen

Qingping Zheng

Pin Tang

Yeyin Jin

Yuang Zhang

Junqi Cheng

Zenghui Lu

Peng Shu

Zuxuan Wu

Yu-Gang Jiang

DiffM

VGen

ArXiv (abs)PDF HTML Github (25★)

Main:12 Pages

11 Figures

Bibliography:3 Pages

9 Tables

Appendix:17 Pages

Abstract

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

View on arXiv

Comments on this paper