17

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Jinghui Lu
Jiayi Guan
Zhijian Huang
Jinlong Li
Guang Li
Lingdong Kong
Yingyan Li
Han Wang
Shaoqing Xu
Yuechen Luo
Fang Li
Chenxu Dang
Junli Wang
Tao Xu
Jing Wu
Jianhua Wu
Xiaoshuai Hao
Wen Zhang
Tianyi Jiang
Lingfeng Zhang
Lei Zhou
Yingbo Tang
Jie Wang
Yinfeng Gao
Xizhou Bu
Haochen Tian
Yihang Qiu
Feiyang Jia
Lin Liu
Yigu Ge
Hanbing Li
Yuannan Shen
Jianwei Cui
Hongwei Xie
Bing Wang
Haiyang Sun
Jingwei Zhao
Jiahui Huang
Pei Liu
Zeyu Zhu
Yuncheng Jiang
Zibin Guo
Chuhong Gong
Hanchao Leng
Kun Ma
Naiyang Wang
Guang Chen
Kuiyuan Yang
Hangjun Ye
Long Chen
Main:22 Pages
22 Figures
Bibliography:8 Pages
13 Tables
Appendix:19 Pages
Abstract

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page:this https URL

View on arXiv
Comments on this paper