CosyVoice LoRA Fine-Tuning - What Worked, What Didn't, and What the Rerun Fixed

Download printable cheat-sheet (CC-BY 4.0)

25 Mar 2026, 00:00 Z

60-second takeaway
Our first CosyVoice3 run used full SFT (all 506M params) and failed after epoch 1 - massive overfitting, 174 GB of checkpoints, grad_norm explosion.
The corrected LoRA rerun (2.16M trainable params via PEFT) reached its best checkpoint at epoch 12, with 8.3 MB adapters and stable gradients.
This post covers the diagnosis, the fix, 9 known pitfalls, and the companion repo.

Where this fits

This is an engineering note in the IMDA NSC Voice Cloning Finetuning Benchmark 2026 series. The companion repo is at instavar/cosyvoice3-lora-finetuning.

  • For founders: the LoRA rerun produced a viable checkpoint (epoch 12). Listening evaluation is pending before production deployment.
  • For engineers: the companion repo has the PEFT-integrated training script, 9 pitfalls, and corrected hyperparameters.

LoRA rerun results

The corrected LoRA run (PEFT, 2.16M trainable params, LR 5e-5) on IMDA NSC FEMALE_01 (16,535 train / 870 dev utterances):

MetricLoRA rerunFailed full-SFT run
Trainable params2.16M (0.44%)506M (100%)
Best CV loss3.044 (epoch 12)2.900 (epoch 1)
Epochs before overfit121
Checkpoint size

Voice cloning

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.