ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.16054
31
10

π0.5π_{0.5}π0.5​: a Vision-Language-Action Model with Open-World Generalization

22 April 2025
Physical Intelligence
Kevin Black
Noah Brown
James Darpinian
Karan Dhabalia
Danny Driess
Adnan Esmail
Michael Equi
Chelsea Finn
Niccolo Fusai
Manuel Y. Galliker
Dibya Ghosh
Lachy Groom
Karol Hausman
Brian Ichter
Szymon Jakubczak
Tim Jones
Liyiming Ke
Devin LeBlanc
Sergey Levine
Adrian Li-Bell
Mohith Mothukuri
Suraj Nair
Karl Pertsch
Allen Z. Ren
Lucy Xiaoyang Shi
Laura M. Smith
Jost Tobias Springenberg
Kyle Stachowicz
James Tanner
Q. Vuong
Homer Walke
Anna Walling
Haohuan Wang
Lili Yu
Ury Zhilinsky
    LM&Ro
    VLM
ArXivPDFHTML
Abstract

In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe π0.5\pi_{0.5}π0.5​, a new model based on π0\pi_{0}π0​ that uses co-training on heterogeneous tasks to enable broad generalization. π0.5\pi_{0.5}π0.5​\ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.

View on arXiv
@article{intelligence2025_2504.16054,
  title={ $π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization },
  author={ Physical Intelligence and Kevin Black and Noah Brown and James Darpinian and Karan Dhabalia and Danny Driess and Adnan Esmail and Michael Equi and Chelsea Finn and Niccolo Fusai and Manuel Y. Galliker and Dibya Ghosh and Lachy Groom and Karol Hausman and Brian Ichter and Szymon Jakubczak and Tim Jones and Liyiming Ke and Devin LeBlanc and Sergey Levine and Adrian Li-Bell and Mohith Mothukuri and Suraj Nair and Karl Pertsch and Allen Z. Ren and Lucy Xiaoyang Shi and Laura Smith and Jost Tobias Springenberg and Kyle Stachowicz and James Tanner and Quan Vuong and Homer Walke and Anna Walling and Haohuan Wang and Lili Yu and Ury Zhilinsky },
  journal={arXiv preprint arXiv:2504.16054},
  year={ 2025 }
}
Comments on this paper