Scaling Human Activity Recognition: A Comparative Evaluation of Synthetic Data Generation and Augmentation Techniques

Human activity recognition (HAR) is often limited by the scarcity of labeled datasets due to the high cost and complexity of real-world data collection. To mitigate this, recent work has explored generating virtual inertial measurement unit (IMU) data via cross-modality transfer. While video-based and language-based pipelines have each shown promise, they differ in assumptions and computational cost. Moreover, their effectiveness relative to traditional sensor-level data augmentation remains unclear. In this paper, we present a direct comparison between these two virtual IMU generation approaches against classical data augmentation techniques. We construct a large-scale virtual IMU dataset spanning 100 diverse activities from Kinetics-400 and simulate sensor signals at 22 body locations. The three data generation strategies are evaluated on benchmark HAR datasets (UTD-MHAD, PAMAP2, HAD-AW) using four popular models. Results show that virtual IMU data significantly improves performance over real or augmented data alone, particularly under limited-data conditions. We offer practical guidance on choosing data generation strategies and highlight the distinct advantages and disadvantages of each approach.
View on arXiv@article{leng2025_2506.07612, title={ Scaling Human Activity Recognition: A Comparative Evaluation of Synthetic Data Generation and Augmentation Techniques }, author={ Zikang Leng and Archith Iyer and Thomas Plötz }, journal={arXiv preprint arXiv:2506.07612}, year={ 2025 } }