v1v2 (latest)

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

13 February 2026

William Chen

Jagdeep Singh Bhatia

Catherine Glossop

Nikhil Mathihalli

Ria Doshi

Andy Tang

Danny Driess

Karl Pertsch

Sergey Levine

LM&Ro

LRM

ArXiv (abs)PDF HTML

Main:8 Pages

29 Figures

Bibliography:4 Pages

6 Tables

Appendix:15 Pages

Abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks.Website:this http URL

View on arXiv

Comments on this paper