HuMoCon: Concept Discovery for Human Motion Understanding

We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding. We will open-source the associated code with our paper.
View on arXiv@article{fang2025_2505.20920, title={ HuMoCon: Concept Discovery for Human Motion Understanding }, author={ Qihang Fang and Chengcheng Tang and Bugra Tekin and Shugao Ma and Yanchao Yang }, journal={arXiv preprint arXiv:2505.20920}, year={ 2025 } }