OmniAvatar — Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation (Overview)
Download printable cheat-sheet (CC-BY 4.0)23 Jun 2025, 00:00 Z
TL;DR OmniAvatar adapts Wan 2.1 into an audio-driven avatar generator with full-body motion. Pixel-wise multi-hierarchical audio embeddings improve lip sync, while LoRA-based training keeps prompt creativity intact. The release ships LoRA weights (14B and 1.3B) plus inference code, so teams can render 480p avatars with controllable prompts and audio guidance on-prem.
What is OmniAvatar?
OmniAvatar is a research framework for producing audio-driven avatar videos that move beyond face-only animation. The team combines Wan 2.1 text-to-video backbones with new audio conditioning so characters maintain lip sync and natural body dynamics, even in conversational or performance settings. The work was posted to arXiv on 23 June 2025 and open-sourced with inference code a day later.
The method introduces a pixel-wise multi-hierarchical audio embedding that slots into the latent space of the diffusion model. By pairing that with lightweight LoRA adaptation, OmniAvatar keeps Wan’s ability to follow creative prompts while fusing in speech nuances that drive torso, arm, and facial motion in sync with the soundtrack.
Links:
- Project page: https://omni-avatar.github.io/
- Hugging Face weights: https://huggingface.co/OmniAvatar/OmniAvatar-14B
Key ideas
- Pixel-wise multi-hierarchical audio embeddings: audio is encoded across scales so the diffusion latent receives fine-grained phoneme cues and broader rhythm, sharpening lip sync in diverse scenes (abstract).
- Adaptive body animation: conditioning extends to upper-body pose and gestures, so avatars react naturally in podcasts, dialogues, dynamic scenes, and singing use cases (abstract + project page).
- LoRA-based alignment: OmniAvatar adds LoRA adapters on top of Wan 2.1 (14B and 1.3B) rather than retraining end-to-end, preserving prompt controllability for styling and camera direction (GitHub README).
- Decoupled guidance: guidance scales let you tune prompt adherence versus audio faithfulness (
guidance_scalevsaudio_scale), with audio CFG in the recommended 4–6 band for reliable lip sync (GitHub README). - Runtime efficiency knobs: supports FSDP, TeaCache, and per-layer persistence settings so teams can trade VRAM for speed—e.g., 4×A800 with FSDP drops sampling to 4.8 s/it while keeping ~14.3 GB per GPU (GitHub README).
Model lineup & availability
- OmniAvatar-14B LoRA + audio condition weights (pairs with Wan2.1-T2V-14B and wav2vec2-base-960h).