Video‑RAG - Visually‑Aligned Retrieval‑Augmented Long Video Comprehension (Overview)

Download printable cheat-sheet (CC-BY 4.0)

20 Nov 2024, 00:00 Z

TL;DR Video‑RAG extracts audio transcripts, on‑screen text, and object cues from long videos, aligns them to frames/clips, and feeds them as auxiliary texts alongside sampled frames to an LVLM in a single‑turn retrieval step. This training‑free approach improves accuracy on long‑video QA/understanding while keeping compute low and remaining model‑agnostic.

Context (Nov–Dec 2024)

The paper reports: submitted 20 Nov 2024; last revised 20 Dec 2024. It demonstrates consistent gains on long‑video benchmarks (Video‑MME, MLVU, LongVideoBench) and highlights that a strong open model (e.g., a 72B LVLM) with Video‑RAG can surpass some proprietary systems on these tasks.

References:


What is Video‑RAG?

Large video‑language models struggle with hour‑long videos due to context limits and information dispersion. Video‑RAG uses a retrieval‑augmented approach that turns raw video into a compact, visually‑aligned text corpus which the LVLM can efficiently search and reason over in one pass.

Key ideas:

  • Visually‑aligned auxiliary texts: Extract ASR (audio transcripts), OCR (on‑screen text), and open‑vocabulary object detections; timestamp and align them to frames/clips.
  • Single‑turn retrieval: One lightweight retrieval step selects the most relevant snippets based on the user query.
  • Plug‑and‑play with any LVLM: Works as an input‑side augmentation; no LVLM fine‑tuning required.

Benefits:

  • Training‑free: No new model weights to train.
  • Low overhead: Single‑turn retrieval keeps compute reasonable for long videos.
  • Broad compatibility: Integrates with different LVLM backbones.

Pipeline at a glance

  1. Pre-processing/Indexing
    • Sample frames/clips from the long video.
    • Run ASR over audio; OCR over frames; open-vocabulary detection (objects/scenes).
    • Build a temporally aligned “evidence” store mapping timestamps → (text spans, object tags).
  2. Retrieval (single-turn)
    • Embed the user query and the evidence store (or use sparse retrieval).
    • Select top-k clips/evidence that best match the query.
  3. LVLM reasoning

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.