LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction
- LM&Ro

Vision-language-action (VLA) models have demonstrated strong semantic understanding and zero-shot generalization, yet most existing systems assume an accurate low-level controller with hand-crafted action "vocabulary" such as end-effector pose or root velocity. This assumption confines prior work to quasi-static tasks and precludes the agile, whole-body behaviors required by humanoid whole-body control (WBC) tasks. To capture this gap in the literature, we start by introducing the first sim-to-real-ready, vision-language, closed-loop benchmark for humanoid WBC, comprising over 150 tasks from 10 categories. We then propose LeVERB: Latent Vision-Language-Encoded Robot Behavior, a hierarchical latent instruction-following framework for humanoid vision-language WBC, the first of its kind. At the top level, a vision-language policy learns a latent action vocabulary from synthetically rendered kinematic demonstrations; at the low level, a reinforcement-learned WBC policy consumes these latent verbs to generate dynamics-level commands. In our benchmark, LeVERB can zero-shot attain a 80% success rate on simple visual navigation tasks, and 58.5% success rate overall, outperforming naive hierarchical whole-body VLA implementation by 7.8 times.
View on arXiv@article{xue2025_2506.13751, title={ LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction }, author={ Haoru Xue and Xiaoyu Huang and Dantong Niu and Qiayuan Liao and Thomas Kragerud and Jan Tommy Gravdahl and Xue Bin Peng and Guanya Shi and Trevor Darrell and Koushil Screenath and Shankar Sastry }, journal={arXiv preprint arXiv:2506.13751}, year={ 2025 } }