HuMo - Human‑Centric Video Generation via Collaborative Multi‑Modal Conditioning (Overview)

Download printable cheat-sheet (CC-BY 4.0)

10 Sep 2025, 00:00 Z

TL;DR HuMo targets human‑centric video generation with collaborative multi‑modal conditioning. It pairs text, reference images, and audio; trains progressively across sub‑tasks (subject consistency and audio‑visual sync); and uses a time‑adaptive CFG during inference for flexible control.

What is HuMo?

HuMo is a research framework for generating controllable human videos from combinations of text, images, and audio. It focuses on two hard sub‑tasks: keeping the subject consistent (face, clothing, identity) and aligning generated motion and lip dynamics with audio.

Links:

Paper (arXiv): https://arxiv.org/abs/2509.08519
Project: https://phantom-video.github.io/HuMo/
Repo: https://github.com/Phantom-video/HuMo

Key ideas

Paired tri‑modal data: Curates paired text, reference images, and audio to supervise collaborative multimodal control.
Progressive multimodal training: Two‑stage scheme that first builds subject preservation, then introduces audio‑visual sync on top of it.
Minimal‑invasive image injection: Preserves the base model’s prompt‑following and visual fidelity while injecting identity/appearance.
Focus‑by‑predicting for audio: Beyond audio cross‑attention, the model is guided to associate audio with facial regions for improved sync.
Time‑adaptive CFG (inference): Dynamically adjusts guidance weights across denoising steps for finer per‑modality control.

These design choices aim to unify separate sub‑tasks under one model rather than maintaining specialized models per task.

Models and availability

HuMo‑17B: research‑grade quality; 480p and 720p supported (heavier compute).
HuMo‑1.7B: lighter; 480p in ~8 minutes on a 32G GPU (per project README), with audio‑visual sync largely retained vs 17B.

Weights and example code are available from the project’s Hugging Face hub and GitHub repo.

Quickstart (from repo docs)

Environment setup:

conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 \
  --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg

Model prep (abbrev.):

huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3

HuMo - Human‑Centric Video Generation via Collaborative Multi‑Modal Conditioning (Overview)

What is HuMo?

Key ideas

Models and availability

Quickstart (from repo docs)

Turn AI video into a repeatable engine

Config highlights (inference)

Practical notes

References

Related Posts

What is HuMo?

Key ideas

Models and availability

Quickstart (from repo docs)

Turn AI video into a repeatable engine

Config highlights (inference)

Practical notes

References

Related Posts

How Open-Source TTS Architectures Differ - And What It Means for Fine-Tuning (2026)

Build an AI YouTube Shorts Pipeline - Remotion + TTS + Automated Publishing

DeepSeek OCR-2 in Production - What the Benchmarks Don't Tell You