CosyVoice Fine-Tuning Guide - LoRA, Data Requirements, and Voice Quality

Download printable cheat-sheet (CC-BY 4.0)

25 Mar 2026, 00:00 Z

60-second takeaway
Our first CosyVoice3 run used full SFT (all 506M params) and failed after epoch 1 - massive overfitting, 174 GB of checkpoints, grad_norm explosion.
The corrected LoRA rerun (2.16M trainable params via PEFT) reached its best checkpoint at epoch 12, with 8.3 MB adapters and stable gradients.
This post covers the diagnosis, the fix, 9 known pitfalls, and the companion repo.

If you searched for CosyVoice fine-tuning guide, CosyVoice fine-tuning data requirements, CosyVoice fine-tuning consumer GPU, or CosyVoice voice cloning quality, start here. The important lesson is not that CosyVoice is bad. It is that the training mode decides the outcome: full SFT failed quickly in our run, while the corrected LoRA path produced a stable checkpoint that still needs listening evaluation before production use.

Where this fits

This is an engineering note in the IMDA NSC Voice Cloning Finetuning Benchmark 2026 series. The companion repo is at instavar/cosyvoice3-lora-finetuning.

  • For founders: the LoRA rerun produced a viable checkpoint (epoch 12). Listening evaluation is pending before production deployment.
  • For engineers: the companion repo has the PEFT-integrated training script, 9 pitfalls, and corrected hyperparameters.

Decision shortcut:

  • Use this CosyVoice LoRA path when you want a consumer-GPU fine-tuning experiment with small adapter files and explicit checkpoint control.
  • Use CosyVoice 2 or CosyVoice 3 zero-shot first when you need a quality reference before spending training time.
  • Use Qwen3-TTS LoRA when you need adapter scale control at inference time.
  • Use VoxCPM 1.5 when you need the lowest-friction fine-tuned voice that already passed our production listening checks.

LoRA rerun results

The corrected LoRA run (PEFT, 2.16M trainable params, LR 5e-5) on IMDA NSC FEMALE_01 (16,535 train / 870 dev utterances):

MetricLoRA rerunFailed full-SFT run
Trainable params

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.