Keystep Recognition using Graph Neural Networks

We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.
View on arXiv@article{romero2025_2506.01102, title={ Keystep Recognition using Graph Neural Networks }, author={ Julia Lee Romero and Kyle Min and Subarna Tripathi and Morteza Karimzadeh }, journal={arXiv preprint arXiv:2506.01102}, year={ 2025 } }