The Trinity of Consistency as a Defining Principle for General World Models

26 February 2026

Jingxuan Wei

Siyuan Li

Yuhang Xu

Zheng Sun

Junjie Jiang

Hexuan Jin

Caijun Jia

Honghao He

Xinglong Xu

Xi bai

Chang Yu

Yumou Liu

Junnan Zhu

Xuanhe Zhou

Jintao Chen

Xiaobin Hu

Shancheng Pang

Bihui Yu

Ran He

Zhen Lei

Stan Z. Li

Conghui He

Shuicheng Yan

Cheng Tan

VGen

ArXiv (abs)PDF HTML

Main:97 Pages

52 Figures

Bibliography:22 Pages

15 Tables

Abstract

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through this tripartite lens, we systematically review the evolution of multimodal learning, revealing a trajectory from loosely coupled specialized modules toward unified architectures that enable the synergistic emergence of internal world simulators. To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios. CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol. Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.

View on arXiv

Comments on this paper