3DV-TON - Textured 3D-Guided Consistent Video Try-on via Diffusion Models

Download printable cheat-sheet (CC-BY 4.0)

24 Apr 2025, 00:00 Z

TL;DR 3DV-TON replaces per-frame warping with textured, animatable 3D human guidance and a diffusion UNet initialised from Stable Diffusion 1.5 + AnimateDiff. It cuts VFID on ViViD to 10.97 (from 17.29 for CatV2TON), introduces the 130-video HR-VVT benchmark at 720p, and ships code plus weights so teams can stand up consistent try-on pilots.

What is 3DV-TON?

3DV-TON is a video try-on framework from Alibaba DAMO Lab that keeps garment identity and motion consistent across frames. Instead of trusting pure 2D warping, it reconstructs a single textured 3D mesh from a keyframe, animates it with video-driven SMPL sequences, and feeds that textured guidance to a diffusion UNet. The paper debuts at ACM MM 2025 and arrives with a project page, inference code, model weights, and a new evaluation set.

Links:


Why it matters for video commerce teams

  • Drives conversion by preserving logos, textures, and fabric flow when shoppers see garments on moving bodies.
  • Tackles the usual "good stills, jittery footage" failure mode with explicit motion references rather than heavier temporal smoothing.
  • Adds HR-VVT (130 videos at 1280×720) so you can evaluate beyond the low-res, single-view ViViD standard.
  • Ships open weights and a reproducible preprocessing stack (masking, SMPL fitting, 3D reconstruction), making it viable for in-house experimentation.

Inside the pipeline

  1. Adaptive keyframe selection chooses the cleanest video frame and runs a 2D image try-on (CatVTON or similar) to produce an initial garment-wearing person.
  2. Animatable textured 3D mesh: ECON-style reconstruction with SMPL-X refinement (10 iterations) creates a clothed human mesh. The team freezes pose parameters and only optimises shape, translation, and camera scale so reconstruction completes in ~30s.
  3. Video-driven animation: SMPL sequences from GVHMR/Video-based HPS rig the mesh so textures follow body motion without stretching.
  4. Rectangular masking expands the edit region, preventing original garment leakage before the diffusion pass.

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.