MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

25 February 2026

Lingjun Zhang

Yujian Yuan

Changjie Wu

Xinyuan Chang

Xin Cai

Shuang Zeng

Linzhe Shi

Sijin Wang

Hang Zhang

Mu Xu

LM&Ro

LRM

ArXiv (abs)PDF HTML Github

Main:13 Pages

12 Figures

Bibliography:5 Pages

11 Tables

Abstract

Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available atthis https URL.

View on arXiv

Comments on this paper