Video‑RAG — Visually‑Aligned Retrieval‑Augmented Long Video Comprehension (Overview)
Download printable cheat-sheet (CC-BY 4.0)20 Nov 2024, 00:00 Z
TL;DR Video‑RAG extracts audio transcripts, on‑screen text, and object cues from long videos, aligns them to frames/clips, and feeds them as auxiliary texts alongside sampled frames to an LVLM in a single‑turn retrieval step. This training‑free approach improves accuracy on long‑video QA/understanding while keeping compute low and remaining model‑agnostic.
Context (Nov–Dec 2024)
The paper reports: submitted 20 Nov 2024; last revised 20 Dec 2024. It demonstrates consistent gains on long‑video benchmarks (Video‑MME, MLVU, LongVideoBench) and highlights that a strong open model (e.g., a 72B LVLM) with Video‑RAG can surpass some proprietary systems on these tasks.
References:
- Repo (reference): https://github.com/Leon1207/Video-RAG-master
What is Video‑RAG?
Large video‑language models struggle with hour‑long videos due to context limits and information dispersion. Video‑RAG uses a retrieval‑augmented approach that turns raw video into a compact, visually‑aligned text corpus which the LVLM can efficiently search and reason over in one pass.
Key ideas:
- Visually‑aligned auxiliary texts: Extract ASR (audio transcripts), OCR (on‑screen text), and open‑vocabulary object detections; timestamp and align them to frames/clips.
- Single‑turn retrieval: One lightweight retrieval step selects the most relevant snippets based on the user query.
- Plug‑and‑play with any LVLM: Works as an input‑side augmentation; no LVLM fine‑tuning required.
Benefits:
- Training‑free: No new model weights to train.
- Low overhead: Single‑turn retrieval keeps compute reasonable for long videos.
- Broad compatibility: Integrates with different LVLM backbones.
Pipeline at a glance
- Pre-processing/Indexing
- Sample frames/clips from the long video.
- Run ASR over audio; OCR over frames; open-vocabulary detection (objects/scenes).
- Build a temporally aligned “evidence” store mapping timestamps → (text spans, object tags).
- Retrieval (single-turn)
- Embed the user query and the evidence store (or use sparse retrieval).
- Select top-k clips/evidence that best match the query.
- LVLM reasoning