Stand-In — A Lightweight and Plug-and-Play Identity Control for Video Generation (Overview)
Download printable cheat-sheet (CC-BY 4.0)11 Aug 2025, 00:00 Z
TL;DR Stand-In bolts identity control onto Wan 2.1 text-to-video backbones with only ~1% additional parameters. A conditional image branch, restricted self-attention, and conditional position mapping lock in the reference face so teams can run subject-driven, pose-guided, stylised, or face-swapped videos without retraining the full model.
What is Stand-In?
Stand-In is a lightweight identity-preserving add-on for diffusion video generators, announced on 11 August 2025 alongside an arXiv preprint and open-source repo. Rather than fine-tuning all parameters, the authors insert a small conditional branch that ingests a reference image and steers the video backbone (Wan 2.1–14B in the release) so the generated subject keeps consistent facial features.
Identity control hinges on two pieces: restricted self-attention that gates the influence of reference features, and conditional position mapping that aligns the reference embedding with frame locations. The framework learns from roughly 2,000 image–video pairs yet surpasses heavier baselines on face similarity and naturalness. Because the add-on is modular, the team shows it working with subject-driven video generation, community LoRAs, VACE pose control, stylization, and even experimental face swapping.
Links:
- Project page: https://www.stand-in.tech
- Hugging Face weights: https://huggingface.co/BowenXue/Stand-In
Key ideas
- Conditional image branch: feeds a reference portrait through Stand-In’s adapters so the video model gets identity cues without replacing its base text-video path (paper abstract + README).
- Restricted self-attention with conditional position mapping: constrains attention to identity-relevant regions and aligns the reference to temporal positions for stable facial structure (paper abstract).
- Tiny parameter overhead: training adds ~1% parameters relative to Wan 2.1 yet beats full-parameter methods on face similarity and naturalness metrics (README callout).
- Data efficiency: identity adapters converge with about 2,000 paired samples, keeping compute manageable for custom subjects (paper abstract).
- Task compatibility: the released toolkit covers subject-driven T2V, pose-referenced video generation via VACE, stylization with community LoRAs, and experimental face swapping (README usage + news log).
Model lineup & availability
- Stand-In v1.0 adapters (153 M parameters) targeting Wan2.1-14B-T2V.