Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

12 December 2025

Kechun Xu

Zhenjie Zhu

Anzhe Chen

Shuqi Zhao

Qing Huang

Yifei Yang

Haojian Lu

Rong Xiong

Masayoshi Tomizuka

Yue Wang

VLM

ArXiv (abs)PDF HTML Github (598★)

Main:17 Pages

17 Figures

Bibliography:2 Pages

Appendix:1 Pages

Abstract

The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at:this https URL.

View on arXiv

Comments on this paper