57

HoloBrain-0 Technical Report

Xuewu Lin
Tianwei Lin
Yun Du
Hongyu Xie
Yiwei Jin
Jiawei Li
Shijie Wu
Qingze Wang
Mengdi Li
Mengao Zhao
Ziang Li
Chaodong Huang
Hongzhe Bi
Lichao Huang
Zhizhong Su
Main:17 Pages
13 Figures
Bibliography:5 Pages
14 Tables
Appendix:10 Pages
Abstract

In this work, we introduce HoloBrain-0, a comprehensive Vision-Language-Action (VLA) framework that bridges the gap between foundation model research and reliable real-world robot deployment. The core of our system is a novel VLA architecture that explicitly incorporates robot embodiment priors, including multi-view camera parameters and kinematic descriptions (URDF), to enhance 3D spatial reasoning and support diverse embodiments. We validate this design through a scalable ``pre-train then post-train" paradigm, achieving state-of-the-art results on simulation benchmarks such as RoboTwin 2.0, LIBERO, and GenieSim, as well as strong results on challenging long-horizon real-world manipulation tasks. Notably, our efficient 0.2B-parameter variant rivals significantly larger baselines, enabling low-latency on-device deployment. To further accelerate research and practical adoption, we fully open-source the entire HoloBrain ecosystem, which includes: (1) powerful pre-trained VLA foundations; (2) post-trained checkpoints for multiple simulation suites and real-world tasks; and (3) RoboOrchard, a full-stack VLA infrastructure for data curation, model training and deployment. Together with standardized data collection protocols, this release provides the community with a complete, reproducible path toward high-performance robotic manipulation.

View on arXiv
Comments on this paper