39
0

Advancing Video Self-Supervised Learning via Image Foundation Models

Main:7 Pages
7 Figures
Bibliography:1 Pages
10 Tables
Abstract

In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by 3.4×3.4\times and GPU memory usage by 8.2×8.2\times. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available atthis https URL.

View on arXiv
@article{wu2025_2505.19218,
  title={ Advancing Video Self-Supervised Learning via Image Foundation Models },
  author={ Jingwei Wu and Zhewei Huang and Chang Liu },
  journal={arXiv preprint arXiv:2505.19218},
  year={ 2025 }
}
Comments on this paper