Scaling RL to Long Videos — LongVILA‑R1 and MR‑SP (Overview)
Download printable cheat-sheet (CC-BY 4.0)10 Jul 2025, 00:00 Z
TL;DR Long‑RL introduces a full‑stack recipe for long‑video reasoning: (1) LongVideo‑Reason (104K QA pairs with reasoning), (2) a two‑stage training pipeline (Chain‑of‑Thought SFT then RL), and (3) a training system (MR‑SP) that adds sequence parallelism and a vLLM‑based engine with cached video embeddings to accelerate RL on hour‑long videos.
What is “Scaling RL to Long Videos”?
The work presents a practical way to improve long‑video understanding for vision‑language models (VLMs) using reinforcement learning. Key components:
- LongVideo‑Reason: ~104K long‑video QA pairs with high‑quality reasoning annotations, spanning sports, games, vlogs and more.
- Two‑stage training: start with Chain‑of‑Thought supervised fine‑tuning (CoT‑SFT), then optimize with RL for long‑horizon reasoning.
- MR‑SP training stack: Multi‑modal Reinforcement Sequence Parallelism integrates sequence parallelism and a vLLM‑based engine that caches video embeddings for faster rollout/prefill.
Reported results (paper/repo): LongVILA‑R1‑7B achieves 65.1% / 71.1% on VideoMME (without / with subtitles), outperforming LongVILA‑7B on several benchmarks. It supports up to 8,192 frames per video with configurable FPS, and the MR‑SP system reports up to 2.1× speed‑up for long‑video RL training. On a single A100 node (8 GPUs), RL training on hour‑long videos (~3,600 frames) is supported.
Links:
- Demo (Gradio): https://long-rl.hanlab.ai
Why it matters
- Long‑context reasoning: Extends VLMs from short clips to hour‑scale content with explicit reasoning signals and RL optimization.
- Efficiency: Sequence parallelism, cached embeddings, and vLLM prefilling reduce training overheads at long horizons.
- Generality: The released system targets multiple modalities (video, text, audio) and supports different backbones (e.g., VILA, Qwen) and even (video/image) generation models.
Quick start
Installation (from repo):
git clone https://github.com/NVlabs/Long-RL.git
cd Long-RL
pip install -e .
# Optional (Qwen Omni support)
bash vllm_replace.shSingle‑node training (8× GPU; example):
bash examples/new_supports/qwen2_5_vl_3b_video_grpo.sh $VIDEO_PATHMulti‑node launcher:
bash scripts/srun_multi_nodes.sh examples/new_supports/qwen2_5_vl_3b_video_grpo.sh 2