Efficient Egocentric Action Recognition with Multimodal Data
- EgoV

The increasing availability of wearable XR devices opens new perspectives for Egocentric Action Recognition (EAR) systems, which can provide deeper human understanding and situation awareness. However, deploying real-time algorithms on these devices can be challenging due to the inherent trade-offs between portability, battery life, and computational resources. In this work, we systematically analyze the impact of sampling frequency across different input modalities - RGB video and 3D hand pose - on egocentric action recognition performance and CPU usage. By exploring a range of configurations, we provide a comprehensive characterization of the trade-offs between accuracy and computational efficiency. Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands. Notably, we observe up to a 3x reduction in CPU usage with minimal to no loss in recognition performance. This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.
View on arXiv@article{calzavara2025_2506.01757, title={ Efficient Egocentric Action Recognition with Multimodal Data }, author={ Marco Calzavara and Ard Kastrati and Matteo Macchini and Dushan Vasilevski and Roger Wattenhofer }, journal={arXiv preprint arXiv:2506.01757}, year={ 2025 } }