ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.13138
22
0
v1v2 (latest)

STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation

16 June 2025
Jiamin Wang
Yichen Yao
Xiang Feng
Hang Wu
Yaming Wang
Qingqiu Huang
Y. Ma
Xinge Zhu
    VGen
ArXiv (abs)PDFHTML
Main:6 Pages
6 Figures
Bibliography:1 Pages
Abstract

The generation of temporally consistent, high-fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio-temporal dynamics and limited cross-frame feature propagation mechanisms. To address these limitations, we present STAGE (Streaming Temporal Attention Generative Engine), a novel auto-regressive framework that pioneers hierarchical feature coordination and multi-phase optimization for sustainable video synthesis. To achieve high-quality long-horizon driving video generation, we introduce Hierarchical Temporal Feature Transfer (HTFT) and a novel multi-stage training strategy. HTFT enhances temporal consistency between video frames throughout the video generation process by modeling the temporal and denoising process separately and transferring denoising features between frames. The multi-stage training strategy is to divide the training into three stages, through model decoupling and auto-regressive inference process simulation, thereby accelerating model convergence and reducing error accumulation. Experiments on the Nuscenes dataset show that STAGE has significantly surpassed existing methods in the long-horizon driving video generation task. In addition, we also explored STAGE's ability to generate unlimited-length driving videos. We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.

View on arXiv
@article{wang2025_2506.13138,
  title={ STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation },
  author={ Jiamin Wang and Yichen Yao and Xiang Feng and Hang Wu and Yaming Wang and Qingqiu Huang and Yuexin Ma and Xinge Zhu },
  journal={arXiv preprint arXiv:2506.13138},
  year={ 2025 }
}
Comments on this paper