Vision in Action: Learning Active Perception from Human Demonstrations

18 June 2025

Haoyu Xiong

Xiaomeng Xu

Jimmy Wu

Yifan Hou

Jeannette Bohg

Shuran Song

Author Contacts:

haoyux.me@gmail.com

ArXiv (abs)PDF HTML

Main:9 Pages

7 Figures

Bibliography:5 Pages

Abstract

We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.

View on arXiv

@article{xiong2025_2506.15666,
  title={ Vision in Action: Learning Active Perception from Human Demonstrations },
  author={ Haoyu Xiong and Xiaomeng Xu and Jimmy Wu and Yifan Hou and Jeannette Bohg and Shuran Song },
  journal={arXiv preprint arXiv:2506.15666},
  year={ 2025 }
}

Comments on this paper