GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation

17 May 2025

Abstract

Learning manipulation skills from human demonstration videos offers a promising path toward generalizable and interpretable robotic intelligence-particularly through the lens of actionable affordances. However, transferring such knowledge remains challenging due to: 1) a lack of large-scale datasets with precise affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts. To address these gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset comprising 500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local affordance training framework that effectively transfers actionable affordance knowledge from human demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic manipulation tasks. By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as valuable resources for bridging the gap between human demonstrations and robotic manipulation capabilities.

View on arXiv

@article{ma2025_2505.11865,
  title={ GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation },
  author={ Teli Ma and Jia Zheng and Zifan Wang and Ziyao Gao and Jiaming Zhou and Junwei Liang },
  journal={arXiv preprint arXiv:2505.11865},
  year={ 2025 }
}

Comments on this paper