MeiGen MultiTalk — Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Download printable cheat-sheet (CC-BY 4.0)28 May 2025, 00:00 Z
TL;DR MultiTalk extends Wan 2.1 with label-aware audio injection, partial fine-tuning, and multi-task training so you can drive two or more performers from separate speech tracks, keep prompts responsive, and render 15-second clips at 480p/720p.
What is MeiGen MultiTalk?
MeiGen's MultiTalk is a research and open-source framework for generating multi-person conversational video that stays lip-synced to multi-stream audio. The team builds on Wan 2.1 I2V-14B, injects audio labels through a novel Label Rotary Position Embedding (L-RoPE), and keeps the base model's instruction-following intact via selective, partial parameter fine-tuning. MultiTalk can animate humans, stylised avatars, and cartoon characters and supports both short clips and 15-second streaming segments.
Links:
- Paper (arXiv): https://arxiv.org/abs/2505.22647
- Project page: https://meigen-ai.github.io/multi-talk/
Why it matters for production teams
- Orchestrate conversational explainers or interviews without filming on set, binding the right voice track to the right digital actor.
- Rapidly localise dialogue: swap out speech tracks (real or TTS) while preserving interaction prompts and body language.
- Mix humans and stylised avatars in the same shot—handy for brand mascots, product walk-throughs, and hybrid live-action/animated content.
- Scale creative testing with minimal compute: TeaCache, LoRA accelerators, and INT8 options bring render times and VRAM down to RTX 4090 class hardware.
Core innovations
- Label Rotary Position Embedding (L-RoPE): tags each audio stream so the diffusion backbone knows which character to drive, stopping cross-talk and mismatched lip-sync.
- Partial parameter training: fine-tunes a subset of layers to retain Wan's prompt-following while specialising for audio-person binding.
- Multi-task curriculum: stages training over talking-head, talking-body, and multi-person data to balance fidelity, instruction following, and motion diversity.
- Time-aligned conditioning stacks: aligns audio features, reference frames, and text prompts per denoising step for tighter conversational timing.