Papers
arxiv:2507.07966

Scaling RL to Long Videos

Published on Jul 10
ยท Submitted by Yukang on Jul 11
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A framework for scaling vision-language models to long videos using reinforcement learning, achieving strong performance on various reasoning tasks with a specialized training infrastructure.

AI-generated summary

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

Community

Paper submitter

We introduce a full-stack framework that scales VLM to long videos with RL, including a dataset named LongVideo-Reason, with 52k QAs-with reasons, a Multi-modal Reinforcement Sequence Parallel (MR-SP) system that speeds up long video RL training by 2.1x, and supports hour-long videos (e.g., 3,600 frames / around 256k tokens) on a single node 8 A100s. In addition, our codebase supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. Code are available at https://github.com/NVlabs/Long-RL

This comment has been hidden

arXiv explained breakdown of this paper ๐Ÿ‘‰ https://arxivexplained.com/papers/scaling-rl-to-long-videos

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.07966 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.07966 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.07966 in a Space README.md to link it from this page.

Collections including this paper 5