“UMO Stills” - Multi‑Identity Consistency with an OmniGen2‑Class Image Base (Pattern Overview)

Download printable cheat-sheet (CC-BY 4.0)

21 Sep 2025, 00:00 Z

TL;DR “UMO stills” refers to a production pattern for multi‑identity consistency: curate identity‑clean stills with a strong image base (e.g., an OmniGen2‑class generator), index them in an identity bank, then use retrieval‑guided conditioning (plus optional LoRA adapters) to keep subjects consistent across long or multi‑scene videos.

What problem does it solve?

Long or multi‑scene videos with several recurring subjects (actors, hosts, avatars) often drift in face details, outfits, or accessories. Typical video models can maintain a single identity in a short clip, but multi‑identity and long‑form projects need stronger anchors and repeatable retrieval.

“UMO stills” addresses this by:

Creating identity‑clean, high‑SNR still frames for each subject
Indexing those stills with robust embeddings for retrieval
Feeding anchor imagery and retrieval hints back into the video pipeline

This elevates identity stability while keeping creative control (pose, camera motion, scene swaps) intact.

Core components

OmniGen2-class image base (stills)
- Use a modern text-image generator (OmniGen2-class) to produce or refine identity-clean stills (front/¾ profile, neutral expression, key outfits).
- Enforce quality gates: resolution ≥ 1024 px, sharpness, exposure, and minimal motion blur; crop to consistent head/torso framing.
Identity bank (embeddings + metadata)
- Compute embeddings with a face/ID model and a general visual encoder (CLIP-family). Store per-identity vectors plus rich tags (hair, outfit, glasses, accessories).
- Deduplicate with cosine thresholding; maintain a curated “gold” set.
Retrieval-guided conditioning (video)
- At generation time, query the bank by prompt + rough frame description to fetch the nearest anchor still(s).
- Condition the video model with anchor crops (concat channels, reference frames, or adapter inputs) and prompt constraints.
- Optionally blend LoRA adapters per identity/outfit for stronger lock-in.
Consistency checks and feedback
- During generation, run face/ID similarity on sampled frames. If drift > τ, nudge guidance (increase identity weight, swap anchor, or re-seed).
- For long-takes, insert “refresh” keyframes (UMO stills) at scene boundaries.

Suggested pipeline (high level)

Curate stills
- Generate/refine 6–12 stills per identity with the image base (clean backgrounds, neutral to expressive variants).

“UMO Stills” - Multi‑Identity Consistency with an OmniGen2‑Class Image Base (Pattern Overview)

What problem does it solve?

Core components

Suggested pipeline (high level)

Turn AI video into a repeatable engine

Practical tips

Where this pattern fits

Notes and sourcing

References

Related Posts

What problem does it solve?

Core components

Suggested pipeline (high level)

Turn AI video into a repeatable engine

Practical tips

Where this pattern fits

Notes and sourcing

References

Related Posts

How Open-Source TTS Architectures Differ - And What It Means for Fine-Tuning (2026)

Build an AI YouTube Shorts Pipeline - Remotion + TTS + Automated Publishing

DeepSeek OCR-2 in Production - What the Benchmarks Don't Tell You