VoxCPM 2 Full SFT on a 24 GB GPU - The Run That Actually Fit

Download printable cheat-sheet (CC-BY 4.0)

10 May 2026, 00:00 Z

60-second takeaway
VoxCPM 2 full SFT did fit on a single RTX 3090 Ti, but vanilla full fine-tuning did not.
The working stack was gradient_checkpointing: true, PagedAdamW8bit, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, batch_size: 1, gradient accumulation, and a cleaned grouped train/validation/test split.
The final 9000-step checkpoint was not the best checkpoint. Held-out validation selected step_0002000.

Who this is for

This is for engineers trying to fine-tune a modern open-source TTS model on one 24 GB NVIDIA GPU:

  • RTX 3090
  • RTX 3090 Ti
  • RTX 4090
  • A10
  • L40 / L40S

The practical question is not whether VoxCPM 2 has an official full-SFT entrypoint. It does. The practical question is:

What has to change before a 2B voice model can actually complete full SFT on a 24 GB card?

Short answer

Use LoRA first if you are still exploring a dataset. For VoxCPM 2, LoRA is fast, the checkpoints are small, and a 9000-step run completed in about 2.4 hours in our setup.

Move to full SFT when the target voice needs deeper adaptation than an adapter can provide. In our run, full SFT completed 9000 steps in about 5 hours with the memory stack below:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VOXCPM_PAGED_OPTIM=1 \
python scripts/train_voxcpm_finetune_full_sft.py \
  --config_path conf/voxcpm_v2/voxcpm_finetune_all_female01.yaml

The config also needs:

  • gradient_checkpointing: true
  • batch_size: 1
  • grad_accum_steps: 8
  • clean train, validation, and test manifests
  • frequent checkpoint saves
  • post-hoc validation in a separate process

The important result:

ResultValue
ModelVoxCPM 2, about 2B parameters
GPURTX 3090 Ti, 24 GB

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.