UniAnimate - Taming Unified Video Diffusion for Consistent Human Image Animation (Overview)

Download printable cheat-sheet (CC-BY 4.0)

21 Sep 2025, 00:00 Z

TL;DR UniAnimate targets consistent human image animation by unifying modalities inside a single video diffusion model, supporting both short and long videos. It reduces identity‑alignment overhead, introduces a unified noise input for better long‑term behavior, and considers state‑space temporal modeling to replace heavy temporal Transformers.

What is UniAnimate?

UniAnimate focuses on human image animation: given a reference image (identity) and a target pose sequence, synthesize a coherent video following the poses while preserving the subject’s appearance. The core idea is to avoid maintaining a separate identity/reference branch by embedding reference image, pose guidance, and noise video into a shared feature space within one unified video diffusion model.

Links:


Key ideas

  • Unified feature space: Reference image, posture guidance, and noise are mapped into a common space inside a single video diffusion model to reduce optimization complexity and improve temporal coherence.
  • Unified noise input: Supports random‑noise starts and first‑frame‑conditioned inputs, helping extend sequence length and stabilize identity across longer videos.
  • Efficient temporal modeling: Explores replacing temporal Transformer with a state‑space model (SSM) for long‑sequence efficiency.
  • Practical engineering tips (from repo):
    • CPU offload for CLIP/VAE (set CPU_CLIP_VAE: True) to cut GPU memory (reported ~12 GB for 32×768×512).
    • Multi‑segment parallel denoising on large‑VRAM GPUs via context_batch_size > 1.
    • Noise prior option that can improve background/appearance preservation in long videos.

Quickstart (inference)

Generate a short clip (32 frames, e.g., 512×768), then adjust settings:

# Install dependencies per repo instructions (PyTorch + deps)
# ...

# Short video generation
python inference.py --cfg configs/UniAnimate_infer.yaml

# Increase resolution (e.g., 768×1216) in configs/UniAnimate_infer.yaml:
#   resolution: [768, 1216]
# Then re‑run the same command.

Generate long videos (sliding window with temporal overlap):

python inference.py --cfg configs/UniAnimate_infer_long.yaml

# In this config, test_list entries contain:
# [frame_interval, reference_image_path, driving_pose_sequence_path]
#   frame_interval=1 → use every pose frame; 2 → sample every two frames.

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.