VoxCPM 1.5 LoRA Finetuning on IMDA NSC FEMALE_01

Download printable cheat-sheet (CC-BY 4.0)

07 Feb 2026, 00:00 Z

60-second takeaway
VoxCPM produced a strong practical result in this benchmark once we stabilized dataset prep and used a clean inference protocol.
The best balance of quality and stability came from step_0004000 in our run.
Prompted inference can over-copy prompt noise, so prompt quality matters as much as checkpoint choice.

Where this fits

  • For founders: VoxCPM is a viable production candidate from this benchmark.
  • For engineers: use this page for train recipe, checkpoint pick logic, and inference defaults.

For the series overview matrix, see:

Experiment setup

  • Base model: VoxCPM1.5
  • Dataset: IMDA NSC FEMALE_01
  • Audio prep: resampled to 44.1 kHz for VoxCPM1.5 path
  • Hardware: RTX 3090 Ti 24 GB
  • Training mode: LoRA fine-tuning

Best checkpoint logic

We tracked validation total loss across steps and selected the strongest zone by both trend and listening:

  • Best recorded validation total in this run was at step_0004000.
  • Later checkpoints remained usable, but were not consistently better on subjective naturalness.

Audio evidence

Best practical sample (this run)

Settings: no-prompt, no denoiser, long text test.

Failure modes we saw

  • Prompted outputs can inherit prompt-room noise strongly.
  • Denoisers can clean hiss but also shift timbre and bandwidth perception.
  • Long-form outputs are sensitive to prompt clip quality and consistency.

Recommended inference settings

For this exact benchmark setup:

  • Start from step_0004000 as default checkpoint.
  • Use no-prompt generation first to estimate model prior naturalness.
  • Add prompt only when you need stronger speaker lock.
  • Use denoiser only when hiss/noise is clearly audible.

Engineer appendix

Key paths from this run

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.