1.3K
0

Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

Main:8 Pages
6 Figures
Bibliography:4 Pages
5 Tables
Appendix:3 Pages
Abstract

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present I\textbf{I}nter-token Con\textbf{Con}trast (ICon\textbf{ICon}), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website:this https URL

View on arXiv
@article{wang2025_2505.18487,
  title={ Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning },
  author={ Junlin Wang and Zhiyun Lin },
  journal={arXiv preprint arXiv:2505.18487},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.