OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

20 April 2026

Jinghui Lu

Jiayi Guan

Zhijian Huang

Jinlong Li

Guang Li

Lingdong Kong

Yingyan Li

Han Wang

Shaoqing Xu

Yuechen Luo

Fang Li

Chenxu Dang

Junli Wang

Tao Xu

Jing Wu

Jianhua Wu

Xiaoshuai Hao

Wen Zhang

Tianyi Jiang

Lingfeng Zhang

Lei Zhou

Yingbo Tang

Jie Wang

Yinfeng Gao

Xizhou Bu

Haochen Tian

Yihang Qiu

Feiyang Jia

Lin Liu

Yigu Ge

Hanbing Li

Yuannan Shen

Jianwei Cui

Hongwei Xie

Bing Wang

Haiyang Sun

Jingwei Zhao

Jiahui Huang

Pei Liu

Zeyu Zhu

Yuncheng Jiang

Zibin Guo

Chuhong Gong

Hanchao Leng

Kun Ma

Naiyang Wang

Guang Chen

Kuiyuan Yang

Hangjun Ye

Long Chen

OffRL

LRM

ArXiv (abs)PDF HTML Github

Main:22 Pages

22 Figures

Bibliography:8 Pages

13 Tables

Appendix:19 Pages

Abstract

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page:this https URL

View on arXiv

Comments on this paper