Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE's SOTA performance in addressing hand-only and hand-object scenarios. Code will be released onthis https URL.
View on arXiv@article{wang2025_2503.13303, title={ UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation }, author={ Yinqiao Wang and Hao Xu and Pheng-Ann Heng and Chi-Wing Fu }, journal={arXiv preprint arXiv:2503.13303}, year={ 2025 } }