Scaling RL to Long Videos - LongVILA‑R1 and MR‑SP (Overview)

Download printable cheat-sheet (CC-BY 4.0)

10 Jul 2025, 00:00 Z

TL;DR Long‑RL introduces a full‑stack recipe for long‑video reasoning: (1) LongVideo‑Reason (104K QA pairs with reasoning), (2) a two‑stage training pipeline (Chain‑of‑Thought SFT then RL), and (3) a training system (MR‑SP) that adds sequence parallelism and a vLLM‑based engine with cached video embeddings to accelerate RL on hour‑long videos.

What is “Scaling RL to Long Videos”?

The work presents a practical way to improve long‑video understanding for vision‑language models (VLMs) using reinforcement learning. Key components:

  • LongVideo‑Reason: ~104K long‑video QA pairs with high‑quality reasoning annotations, spanning sports, games, vlogs and more.
  • Two‑stage training: start with Chain‑of‑Thought supervised fine‑tuning (CoT‑SFT), then optimize with RL for long‑horizon reasoning.
  • MR‑SP training stack: Multi‑modal Reinforcement Sequence Parallelism integrates sequence parallelism and a vLLM‑based engine that caches video embeddings for faster rollout/prefill.

Reported results (paper/repo): LongVILA‑R1‑7B achieves 65.1% / 71.1% on VideoMME (without / with subtitles), outperforming LongVILA‑7B on several benchmarks. It supports up to 8,192 frames per video with configurable FPS, and the MR‑SP system reports up to 2.1× speed‑up for long‑video RL training. On a single A100 node (8 GPUs), RL training on hour‑long videos (~3,600 frames) is supported.

Links:


Why it matters

  • Long‑context reasoning: Extends VLMs from short clips to hour‑scale content with explicit reasoning signals and RL optimization.
  • Efficiency: Sequence parallelism, cached embeddings, and vLLM prefilling reduce training overheads at long horizons.
  • Generality: The released system targets multiple modalities (video, text, audio) and supports different backbones (e.g., VILA, Qwen) and even (video/image) generation models.

Quick start

Installation (from repo):

git clone https://github.com/NVlabs/Long-RL.git
cd Long-RL
pip install -e .

# Optional (Qwen Omni support)
bash vllm_replace.sh

Single‑node training (8× GPU; example):

bash examples/new_supports/qwen2_5_vl_3b_video_grpo.sh $VIDEO_PATH

Multi‑node launcher:

bash scripts/srun_multi_nodes.sh examples/new_supports/qwen2_5_vl_3b_video_grpo.sh 2

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.