VoxCPM 2 Full SFT on a 24 GB GPU - The Run That Actually Fit

Download printable cheat-sheet (CC-BY 4.0)

10 May 2026, 00:00 Z

60-second takeaway
VoxCPM 2 full SFT did fit on a single RTX 3090 Ti, but vanilla full fine-tuning did not.
The working stack was gradient_checkpointing: true, PagedAdamW8bit, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, batch_size: 1, gradient accumulation, and a cleaned grouped train/validation/test split.
The final 9000-step checkpoint was not the best checkpoint. Held-out validation selected step_0002000.

Who this is for

This is for engineers trying to fine-tune a modern open-source TTS model on one 24 GB NVIDIA GPU:

RTX 3090
RTX 3090 Ti
RTX 4090
A10
L40 / L40S

The practical question is not whether VoxCPM 2 has an official full-SFT entrypoint. It does. The practical question is:

What has to change before a 2B voice model can actually complete full SFT on a 24 GB card?

Short answer

Use LoRA first if you are still exploring a dataset. For VoxCPM 2, LoRA is fast, the checkpoints are small, and a 9000-step run completed in about 2.4 hours in our setup.

Move to full SFT when the target voice needs deeper adaptation than an adapter can provide. In our run, full SFT completed 9000 steps in about 5 hours with the memory stack below:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VOXCPM_PAGED_OPTIM=1 \
python scripts/train_voxcpm_finetune_full_sft.py \
  --config_path conf/voxcpm_v2/voxcpm_finetune_all_female01.yaml

The config also needs:

gradient_checkpointing: true
batch_size: 1
grad_accum_steps: 8
clean train, validation, and test manifests
frequent checkpoint saves
post-hoc validation in a separate process

The important result:

Result	Value
Model	VoxCPM 2, about 2B parameters
GPU	RTX 3090 Ti, 24 GB

Component	Approximate size
BF16 weights	4 GB
BF16 gradients	4 GB
FP32 AdamW optimizer state	16 GB
Subtotal before activations	24 GB
Activations	no room left

Phase	Change	Result	Lesson
0	Vanilla `torch.optim.AdamW`, `batch_size=2`	OOM at first forward	Activations did not fit
1	`AdamW8bit`, `batch_size=1`	OOM at first forward	Optimizer state was not the first problem
2a	Gradient checkpointing + GPU-resident 8-bit Adam	Step 0 worked, step 1 OOMed	Activations fit, optimizer state became persistent
2b	Added allocator tuning	Still OOMed	Fragmentation was not enough to explain it
2c	Gradient checkpointing + `PagedAdamW8bit` + allocator tuning	50-step full-SFT smoke completed	This was the first stable recipe

Constraint	Fix	Why it mattered
Activation memory	Gradient checkpointing	Stores fewer activations and recomputes layers during backward
Optimizer state	`PagedAdamW8bit`	Quantizes optimizer state and pages it through CPU memory
Allocation churn	`expandable_segments:True`	Reduces PyTorch allocator fragmentation under checkpointing and paging
Per-microbatch memory	`batch_size: 1`	Keeps the forward pass small enough
Gradient quality	`grad_accum_steps: 8`	Restores effective batch size without increasing microbatch memory

Problem class	Count	Share
Empty-text rows	5,229	30.66%
Boilerplate rows	86	0.50%
Sub-100ms audio	439	2.57%

Metric	Dirty manifest	Clean manifest	Improvement
Max `grad_norm`	50.96	6.09	8.4x lower
Max post-warmup `grad_norm`	46.06	about 3.6	about 13x lower
Max `loss/stop`	27.000	0.345	78x lower
Step 1610 `loss/stop`	27.0	0.027	normal step

VoxCPM 2 Full SFT on a 24 GB GPU - The Run That Actually Fit

Who this is for

Short answer

Turn AI video into a repeatable engine

Why vanilla full SFT failed

The working memory stack

The dataset problem we almost missed

Production run

Checkpoint selection

Listening pass

What to copy

What not to copy

LoRA vs full SFT after this run

Sources and related posts

Related Posts

Item	Value
Final logged train step	8999
Final checkpoint	`step_0009000`
`latest` checkpoint state	`{"step": 9000}`
Final `loss/diff`	0.757150
Final `loss/stop`	0.000005
Final `grad_norm`	0.871526
Error scan	no `Traceback`, `OutOfMemory`, `Killed`, or `RuntimeError` markers
Run size before cleanup	159 GB

Checkpoint	`loss/total`	`loss/diff`	`loss/stop`
`step_0000000`	1.068616	0.916918	0.151699
`step_0001000`	0.888265	0.838856	0.049409
`step_0002000`	0.885899	0.827947	0.057952
`step_0003000`	0.930273	0.823368	0.106905
`step_0004000`	0.915051	0.818194	0.096857
`step_0005000`	0.931633	0.814075	0.117557
`step_0006000`	0.959932	0.811755	0.148178
`step_0007000`	0.955816	0.809483	0.146333
`step_0008000`	0.959506	0.808552	0.150954
`step_0008999`	0.960920	0.808400	0.152520
`step_0009000`	0.960920	0.808400	0.152520

Path	Best use
VoxCPM 2 LoRA	fast dataset check, adapter A/B tests, small artifacts
VoxCPM 2 full SFT	deeper voice adaptation, branded voice defaults, validation-backed production checkpoints

Split	Rows
Train	10,776
Validation	599
Test	598

Who this is for

Short answer

Turn AI video into a repeatable engine

Why vanilla full SFT failed

The working memory stack

The dataset problem we almost missed

Production run

Checkpoint selection

Listening pass

What to copy

What not to copy

LoRA vs full SFT after this run

Sources and related posts

Related Posts

FP8 on RTX 3090 Ti - What Actually Works on Consumer GPUs

LoRA vs Full SFT for Voice Models - What Actually Changes on a 24 GB GPU

Running OpenAI Privacy Filter on an M2 MacBook Pro - 52-Case Benchmark