Graph Neural Networks are perfectly suited to capture latent interactions between various entities in the spatio-temporal domain (e.g. videos). However, when an explicit structure is not available, it is not obvious what atomic elements should be represented as nodes. Current works generally use pre-trained object detectors or fixed, predefined regions to extract graph nodes. In turn, our proposed model learns nodes that dynamically attach to salient space-time regions, which are relevant for a higher-level task, without using any object-level supervision. Constructing these localised, adaptive nodes gives our model inductive bias towards object-centric representations and we show that it discovers regions that are well correlated with objects in the video. The localised nodes are the key components of the method and visualising their regions leads to a more explainable model. In extensive ablation studies and experiments on two challenging datasets we show superior performance to previous graph neural networks models for video classification.
View on arXiv