Efficient Egocentric Action Recognition with Multimodal Data

2 June 2025

Main:3 Pages

2 Figures

Bibliography:1 Pages

Abstract

The increasing availability of wearable XR devices opens new perspectives for Egocentric Action Recognition (EAR) systems, which can provide deeper human understanding and situation awareness. However, deploying real-time algorithms on these devices can be challenging due to the inherent trade-offs between portability, battery life, and computational resources. In this work, we systematically analyze the impact of sampling frequency across different input modalities - RGB video and 3D hand pose - on egocentric action recognition performance and CPU usage. By exploring a range of configurations, we provide a comprehensive characterization of the trade-offs between accuracy and computational efficiency. Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands. Notably, we observe up to a 3x reduction in CPU usage with minimal to no loss in recognition performance. This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.

View on arXiv

@article{calzavara2025_2506.01757,
  title={ Efficient Egocentric Action Recognition with Multimodal Data },
  author={ Marco Calzavara and Ard Kastrati and Matteo Macchini and Dushan Vasilevski and Roger Wattenhofer },
  journal={arXiv preprint arXiv:2506.01757},
  year={ 2025 }
}

Comments on this paper