v1v2 (latest)

Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

10 February 2026

Jialun Liu

Tian Li

Xiao Cao

Yukuo Ma

Gonghu Shang

Haibin Huang

Chi Zhang

Xiangzhen Chang

Zhiyong Huang

Jiakui Hu

Zuoxin Li

Yuanzhi Liang

Cong Liu

Junqi Liu

Robby T. Tan

Haitong Tang

Qizhen Weng

Yifan Xu

Liying Yang

Xiaoyan Yang

Peng Yu

Shiwen Zhang

Xuelong Li

DiffM

VGen

ArXiv (abs)PDF HTML

Main:12 Pages

10 Figures

Bibliography:6 Pages

Abstract

Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.

View on arXiv

Comments on this paper