3DV-TON — Textured 3D-Guided Consistent Video Try-on via Diffusion Models
Download printable cheat-sheet (CC-BY 4.0)24 Apr 2025, 00:00 Z
TL;DR 3DV-TON replaces per-frame warping with textured, animatable 3D human guidance and a diffusion UNet initialised from Stable Diffusion 1.5 + AnimateDiff. It cuts VFID on ViViD to 10.97 (from 17.29 for CatV2TON), introduces the 130-video HR-VVT benchmark at 720p, and ships code plus weights so teams can stand up consistent try-on pilots.
What is 3DV-TON?
3DV-TON is a video try-on framework from Alibaba DAMO Lab that keeps garment identity and motion consistent across frames. Instead of trusting pure 2D warping, it reconstructs a single textured 3D mesh from a keyframe, animates it with video-driven SMPL sequences, and feeds that textured guidance to a diffusion UNet. The paper debuts at ACM MM 2025 and arrives with a project page, inference code, model weights, and a new evaluation set.
Links:
- Code + checkpoints: https://github.com/2y7c3/3DV-TON
- Project page (teasers, comparisons): https://2y7c3.github.io/3DV-TON/
Why it matters for video commerce teams
- Drives conversion by preserving logos, textures, and fabric flow when shoppers see garments on moving bodies.
- Tackles the usual "good stills, jittery footage" failure mode with explicit motion references rather than heavier temporal smoothing.
- Adds HR-VVT (130 videos at 1280×720) so you can evaluate beyond the low-res, single-view ViViD standard.
- Ships open weights and a reproducible preprocessing stack (masking, SMPL fitting, 3D reconstruction), making it viable for in-house experimentation.
Inside the pipeline
- Adaptive keyframe selection chooses the cleanest video frame and runs a 2D image try-on (CatVTON or similar) to produce an initial garment-wearing person.
- Animatable textured 3D mesh: ECON-style reconstruction with SMPL-X refinement (10 iterations) creates a clothed human mesh. The team freezes pose parameters and only optimises shape, translation, and camera scale so reconstruction completes in ~30s.
- Video-driven animation: SMPL sequences from GVHMR/Video-based HPS rig the mesh so textures follow body motion without stretching.
- Rectangular masking expands the edit region, preventing original garment leakage before the diffusion pass.