317

Kling-Omni Technical Report

Kling Team
Jialu Chen
Yuanzheng Ci
Xiangyu Du
Zipeng Feng
Kun Gai
Sainan Guo
Feng Han
Jingbin He
Kang He
Xiao Hu
Xiaohua Hu
Boyuan Jiang
Fangyuan Kong
Hang Li
Jie Li
Qingyu Li
Shen Li
Xiaohan Li
Yan Li
Jiajun Liang
Borui Liao
Yiqiao Liao
Weihong Lin
Quande Liu
Xiaokun Liu
Yilun Liu
Yuliang Liu
Shun Lu
Hangyu Mao
Yunyao Mao
Haodong Ouyang
Wenyu Qin
Wanqi Shi
Xiaoyu Shi
Lianghao Su
Haozhi Sun
Peiqin Sun
Pengfei Wan
Chao Wang
Chenyu Wang
Meng Wang
Qiulin Wang
Runqi Wang
Xintao Wang
Xuebo Wang
Zekun Wang
Min Wei
Tiancheng Wen
Guohao Wu
Xiaoshi Wu
Zhenhua Wu
Da Xie
Yingtong Xiong
Yulong Xu
Sile Yang
Zikang Yang
Weicai Ye
Ziyang Yuan
Shenglong Zhang
Shuaiyu Zhang
Yuanxing Zhang
Yufan Zhang
Wenzheng Zhao
Ruiliang Zhou
Yan Zhou
Guosheng Zhu
Yongjie Zhu
Main:34 Pages
26 Figures
Bibliography:3 Pages
1 Tables
Abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

View on arXiv
Comments on this paper