Papers
arxiv:2505.19218

Advancing Video Self-Supervised Learning via Image Foundation Models

Published on May 25
Authors:
,

Abstract

AdViSe reduces training overhead and GPU memory usage for video representation models by integrating temporal modeling modules with pre-trained image foundation models.

AI-generated summary

In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by 3.4times and GPU memory usage by 8.2times. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.19218 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.19218 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.19218 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.