DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

6 June 2025

Main:9 Pages

5 Figures

Bibliography:4 Pages

18 Tables

Appendix:12 Pages

Abstract

Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from users' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.

View on arXiv

@article{hao2025_2506.05667,
  title={ DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models },
  author={ Yuhan Hao and Zhengning Li and Lei Sun and Weilong Wang and Naixin Yi and Sheng Song and Caihong Qin and Mofan Zhou and Yifei Zhan and Peng Jia and Xianpeng Lang },
  journal={arXiv preprint arXiv:2506.05667},
  year={ 2025 }
}

Comments on this paper