Kling-Omni Technical Report

18 December 2025

Kling Team

Jialu Chen

Yuanzheng Ci

Xiangyu Du

Zipeng Feng

Kun Gai

Sainan Guo

Feng Han

Jingbin He

Kang He

Xiao Hu

Xiaohua Hu

Boyuan Jiang

Fangyuan Kong

Hang Li

Jie Li

Qingyu Li

Shen Li

Xiaohan Li

Yan Li

Jiajun Liang

Borui Liao

Yiqiao Liao

Weihong Lin

Quande Liu

Xiaokun Liu

Yilun Liu

Yuliang Liu

Shun Lu

Hangyu Mao

Yunyao Mao

Haodong Ouyang

Wenyu Qin

Wanqi Shi

Xiaoyu Shi

Lianghao Su

Haozhi Sun

Peiqin Sun

Pengfei Wan

Chao Wang

Chenyu Wang

Meng Wang

Qiulin Wang

Runqi Wang

Xintao Wang

Xuebo Wang

Zekun Wang

Min Wei

Tiancheng Wen

Guohao Wu

Xiaoshi Wu

Zhenhua Wu

Da Xie

Yingtong Xiong

Yulong Xu

Sile Yang

Zikang Yang

Weicai Ye

Ziyang Yuan

Shenglong Zhang

Shuaiyu Zhang

Yuanxing Zhang

Yufan Zhang

Wenzheng Zhao

Ruiliang Zhou

Yan Zhou

Guosheng Zhu

Yongjie Zhu

ArXiv (abs)PDF HTML HuggingFace (154 upvotes)

Main:34 Pages

26 Figures

Bibliography:3 Pages

1 Tables

Abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

View on arXiv

Comments on this paper