UniAnimate - Taming Unified Video Diffusion for Consistent Human Image Animation (Overview)

Download printable cheat-sheet (CC-BY 4.0)

21 Sep 2025, 00:00 Z

TL;DR UniAnimate targets consistent human image animation by unifying modalities inside a single video diffusion model, supporting both short and long videos. It reduces identity‑alignment overhead, introduces a unified noise input for better long‑term behavior, and considers state‑space temporal modeling to replace heavy temporal Transformers.

What is UniAnimate?

UniAnimate focuses on human image animation: given a reference image (identity) and a target pose sequence, synthesize a coherent video following the poses while preserving the subject’s appearance. The core idea is to avoid maintaining a separate identity/reference branch by embedding reference image, pose guidance, and noise video into a shared feature space within one unified video diffusion model.

Links:

Project: https://unianimate.github.io/
Repo: https://github.com/ali-vilab/UniAnimate
Related (new model): UniAnimate‑DiT based on Wan2.1 - https://github.com/ali-vilab/UniAnimate-DiT

Key ideas

Unified feature space: Reference image, posture guidance, and noise are mapped into a common space inside a single video diffusion model to reduce optimization complexity and improve temporal coherence.
Unified noise input: Supports random‑noise starts and first‑frame‑conditioned inputs, helping extend sequence length and stabilize identity across longer videos.
Efficient temporal modeling: Explores replacing temporal Transformer with a state‑space model (SSM) for long‑sequence efficiency.
Practical engineering tips (from repo):
- CPU offload for CLIP/VAE (set CPU_CLIP_VAE: True) to cut GPU memory (reported ~12 GB for 32×768×512).
- Multi‑segment parallel denoising on large‑VRAM GPUs via context_batch_size > 1.
- Noise prior option that can improve background/appearance preservation in long videos.

Quickstart (inference)

Generate a short clip (32 frames, e.g., 512×768), then adjust settings:

# Install dependencies per repo instructions (PyTorch + deps)
# ...

# Short video generation
python inference.py --cfg configs/UniAnimate_infer.yaml

# Increase resolution (e.g., 768×1216) in configs/UniAnimate_infer.yaml:
#   resolution: [768, 1216]
# Then re‑run the same command.

Generate long videos (sliding window with temporal overlap):

python inference.py --cfg configs/UniAnimate_infer_long.yaml

# In this config, test_list entries contain:
# [frame_interval, reference_image_path, driving_pose_sequence_path]
#   frame_interval=1 → use every pose frame; 2 → sample every two frames.

UniAnimate - Taming Unified Video Diffusion for Consistent Human Image Animation (Overview)

What is UniAnimate?

Key ideas

Quickstart (inference)

Turn AI video into a repeatable engine

Config knobs and tips

References

Related Posts

What is UniAnimate?

Key ideas

Quickstart (inference)

Turn AI video into a repeatable engine

Config knobs and tips

References

Related Posts

How Open-Source TTS Architectures Differ - And What It Means for Fine-Tuning (2026)

Build an AI YouTube Shorts Pipeline - Remotion + TTS + Automated Publishing

DeepSeek OCR-2 in Production - What the Benchmarks Don't Tell You