SpatialVID — A Large-Scale Video Dataset with Spatial Annotations (Overview)
Download printable cheat-sheet (CC-BY 4.0)11 Sep 2025, 00:00 Z
TL;DR SpatialVID curates more than 21,000 hours of raw video into ~2.7M clips (~7,089 hours) and annotates them with per‑frame camera poses, depth maps, dynamic masks, structured captions, and motion instructions — a resource aimed at scaling spatial intelligence research and data‑hungry video/3D models.
What is SpatialVID?
SpatialVID is a large‑scale dataset designed to address the shortage of diverse, high‑quality spatial annotations for real‑world dynamic videos. It focuses on rich geometric and semantic labels that are useful for 3D perception, reconstruction, and video generation systems.
Key components:
- Per‑frame camera poses (ground‑truth camera motion for dynamic scenes)
- Dense depth maps across time
- Dynamic object masks
- Structured captions
- Serialized motion instructions
References:
- Paper (arXiv): https://arxiv.org/abs/2509.09676
- Project page: https://nju-3dv.github.io/projects/SpatialVID/
Scale and curation pipeline
According to the paper and project materials, the pipeline:
- Collects more than 21,000 hours of raw, in‑the‑wild video.
- Filters and segments the corpus into approximately 2.7 million clips, totalling ~7,089 hours of dynamic content.
- Runs an annotation pipeline that adds camera poses, depth maps, dynamic masks, captions, and motion instructions.
The intent is to improve model generalization by increasing scene diversity, motion patterns, and annotation richness.
Note: Always consult the upstream repo for the latest statistics and updates, as numbers and splits may evolve.
Downloading the dataset
The authors provide downloads via multiple mirrors and helper utilities:
- Hugging Face organization: https://huggingface.co/SpatialVID
- ModelScope org: https://www.modelscope.cn/organization/SpatialVID
- Repo utilities:
utils/download_SpatialVID.py(dataset) andutils/download_YouTube.py(raw videos)
Example (see repository for exact arguments and latest instructions):
git clone --recursive https://github.com/NJU-3DV/SpatialVID.git
cd SpatialVID
conda create -n SpatialVID python=3.10.13
conda activate SpatialVID
pip install -r requirements/requirements.txt
# Optional: scoring dependencies
pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
pip install -r requirements/requirements_scoring.txt
# Use helper script to download datasets (adjust prefixes/paths)
python utils/download_SpatialVID.py --help