ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2512.13313
16
0

KlingAvatar 2.0 Technical Report

15 December 2025
Kling Team
Jialu Chen
Yikang Ding
Zhixue Fang
Kun Gai
Yuan Gao
Kang He
Jingyun Hua
Boyuan Jiang
Mingming Lao
Xiaohan Li
Hui Liu
Jiwen Liu
Xiaoqiang Liu
Yuan Liu
Shun Lu
Yongsen Mao
Yingchao Shao
Huafeng Shi
Xiaoyu Shi
Peiqin Sun
Songlin Tang
Pengfei Wan
Chao Wang
Xuebo Wang
Haoxian Zhang
Yuanxing Zhang
Yan Zhou
    VGen
ArXiv (abs)PDFHTML
Main:10 Pages
9 Figures
Bibliography:4 Pages
1 Tables
Abstract

Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.

View on arXiv
Comments on this paper