13
v1v2 (latest)

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Tianle Zhang
Zhihao Yuan
Dafeng Chi
Peidong Liu
Dongwei Li
Kejun Hu
Likui Zhang
Junnan Nie
Ziming Wei
Zengjue Chen
Yili Tang
Jiayi Li
Zhiyuan Xiang
Mingyang Li
Tianci Luo
Hanwen Wan
Ao Li
Linbo Zhai
Zhihao Zhan
Xiaodong Bai
Jiakun Cai
Peng Cao
Kangliang Chen
Siang Chen
Yixiang Dai
Shuai Di
Yicheng Gong
Chenguang Gui
Yucheng Guo
Peng Hao
Qingrong He
Haoyang Huang
Kunrui Huang
Zhixuan Huang
Shibo Jin
Yixiang Jin
Anson Li
Dongjiang Li
Jiawei Li
Ruodai Li
Yihang Li
Yuzhen Li
Jiaming Liang
Fangsheng Liu
Jing Long
Mingxi Luo
Xing Pan
Hui Shen
Xiaomeng Tian
Daming Wang
Song Wang
Junwu Xiong
Hang Xu
Wanting Xu
Zhengcheng Yu
He Zhang
Jiyao Zhang
Lin Zhao
Chen Zhou
Nan Duan
Yuzheng Zhuang
Liang Lin
Main:14 Pages
8 Figures
Bibliography:3 Pages
9 Tables
Appendix:5 Pages
Abstract

Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

View on arXiv
Comments on this paper