Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images

16 June 2025

Main:5 Pages

6 Figures

Bibliography:1 Pages

7 Tables

Appendix:1 Pages

Abstract

Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.

View on arXiv

@article{mahanta2025_2506.13458,
  title={ Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images },
  author={ Cristina Mahanta and Gagan Bhatia },
  journal={arXiv preprint arXiv:2506.13458},
  year={ 2025 }
}

Comments on this paper