3DV-TON — Textured 3D-Guided Consistent Video Try-on via Diffusion Models
Download printable cheat-sheet (CC-BY 4.0)24 Apr 2025, 00:00 Z
TL;DR 3DV-TON replaces per-frame warping with textured, animatable 3D human guidance and a diffusion UNet initialised from Stable Diffusion 1.5 + AnimateDiff. It cuts VFID on ViViD to 10.97 (from 17.29 for CatV2TON), introduces the 130-video HR-VVT benchmark at 720p, and ships code plus weights so teams can stand up consistent try-on pilots.
What is 3DV-TON?
3DV-TON is a video try-on framework from Alibaba DAMO Lab that keeps garment identity and motion consistent across frames. Instead of trusting pure 2D warping, it reconstructs a single textured 3D mesh from a keyframe, animates it with video-driven SMPL sequences, and feeds that textured guidance to a diffusion UNet. The paper debuts at ACM MM 2025 and arrives with a project page, inference code, model weights, and a new evaluation set.
Links:
- Code + checkpoints: https://github.com/2y7c3/3DV-TON
- Project page (teasers, comparisons): https://2y7c3.github.io/3DV-TON/
Why it matters for video commerce teams
- Drives conversion by preserving logos, textures, and fabric flow when shoppers see garments on moving bodies.
- Tackles the usual "good stills, jittery footage" failure mode with explicit motion references rather than heavier temporal smoothing.
- Adds HR-VVT (130 videos at 1280×720) so you can evaluate beyond the low-res, single-view ViViD standard.
- Ships open weights and a reproducible preprocessing stack (masking, SMPL fitting, 3D reconstruction), making it viable for in-house experimentation.
Inside the pipeline
- Adaptive keyframe selection chooses the cleanest video frame and runs a 2D image try-on (CatVTON or similar) to produce an initial garment-wearing person.
- Animatable textured 3D mesh: ECON-style reconstruction with SMPL-X refinement (10 iterations) creates a clothed human mesh. The team freezes pose parameters and only optimises shape, translation, and camera scale so reconstruction completes in ~30s.
- Video-driven animation: SMPL sequences from GVHMR/Video-based HPS rig the mesh so textures follow body motion without stretching.
- Rectangular masking expands the edit region, preventing original garment leakage before the diffusion pass.
- Diffusion UNet: Stable Diffusion 1.5 backbone with AnimateDiff temporal blocks. Guidance features from the garment image and textured video are fused through self-attention layers, producing 32-frame clips at 768×576.
Training facts: VITON-HD + DressCode (paired images) and ViViD (paired videos) form the training mix; the model trains for 40k steps on A800 GPUs at lr 1e-5.
Benchmarks and numbers
- ViViD benchmark (paired, 768×576): 3DV-TON hits
SSIM 0.899
andLPIPS 0.052
, while slashing VFID (I3D) to10.97
versus17.29
for CatV2TON and18.20
for ViViD. Even with a larger rectangular mask, it keeps VFID (ResNeXt) at0.203
. - HR-VVT (new, 720p): Against ViViD and CatV2TON, 3DV-TON records
SSIM 0.880
,LPIPS 0.086
, paired VFID (I3D)10.77
, and paired VFID (ResNeXt)0.142
. In unpaired VFID, the method still leads with14.55
on I3D. - User study (130 clips, 20 raters): 3DV-TON scores 69% preference on ViViD and 86% on HR-VVT for overall quality, favoured for both fidelity and motion coherence.
- Ablations: Adding SMPL guidance raises SSIM from
0.858→0.880
. Injecting textured 3D guidance pushes it to0.909
and halves VFID (I3D) from5.24→2.38
.
HR-VVT benchmark at a glance
- 130 evaluation videos spanning 50 tops, 40 bottoms, 40 dresses with outdoor/indoor scenarios.
- Distributed via Hugging Face (
2y7c3/HR-VVT
) with garment crops and metadata. - Challenges legacy baselines that overfit to studio views; expect more occlusions, camera pans, limb crossings, and spectators in frame.
Working with the open-source release
- Clone the repo and install requirements (
pip install -r requirements.txt
). - Pull base assets: Stable Diffusion image variations, AnimateDiff motion module, SD VAE, and 3DV-TON checkpoints into
./ckpts
. - Generate agnostic masks with the provided cloth masker (
python preprocess/seg_mask.py --type overall
). - Run GVHMR to recover SMPL sequences, then reconstruct textured meshes with the supplied renderer swap.
- Edit
configs/inference/demo_test.yaml
and runpython infer.py --config ...
to render a 32-frame try-on clip.
Expect full pipeline latency around one minute per 32-frame video (30s for reconstruction + 35s for diffusion without cross-attention).
Implementation notes for production pilots
- The rectangular mask can be tuned; using ViViD-style tight masks gives marginally higher SSIM but loses robustness when garments swing.
- Freezing SMPL pose keeps reconstruction stable when limbs fall out of frame—useful for UGC where camera crops are aggressive.
- HR-VVT pairs well with qualitative review: include motion-specific checks (arm raises, scene transitions) when curating your own QA set.
- Integrate the cloth masker into upload workflows to enforce consistent agnostic conditioning before inference.
Limitations and open questions
- Residual temporal jitter appears when the 3D guidance fails (e.g., occluded arms); diffusion still needs reliable geometry input.
- The rectangular mask increases generated area, which can soften SSIM vs. tighter baselines—worth balancing per product shot.
- Dataset and code currently assume single-person, front-facing videos; multi-person scenes or extreme camera rotations remain open work.
- As the authors note, strong generative power raises misuse risk (e.g., deepfakes); incorporate consent and watermarking policies early.
Next steps for your team
- Stress-test on your catalogue videos: start with 10–20 SKUs mixing textures (denim, prints) to validate the textured guidance advantage.
- Compare against existing try-on stacks using VFID (I3D) and user preference studies; the HR-VVT protocol is a ready-made template.
- If you need faster turnaround, monitor the repo’s TODO list (integrated image try-on and 3D guidance release) or swap in lighter SMPL estimators once code lands.
- Feed results back into merchandising flows—successful trials can seed PDP loops, social ads, or live shopping overlays.
References
CTA: Shoot us an idea