37
v1v2 (latest)

Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

Jialun Liu
Tian Li
Xiao Cao
Yukuo Ma
Gonghu Shang
Haibin Huang
Chi Zhang
Xiangzhen Chang
Zhiyong Huang
Jiakui Hu
Zuoxin Li
Yuanzhi Liang
Cong Liu
Junqi Liu
Robby T. Tan
Haitong Tang
Qizhen Weng
Yifan Xu
Liying Yang
Xiaoyan Yang
Peng Yu
Shiwen Zhang
Xuelong Li
Main:12 Pages
10 Figures
Bibliography:6 Pages
Abstract

Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.

View on arXiv
Comments on this paper