CosyVoice LoRA Fine-Tuning — What Worked, What Didn't, and the Rerun Plan
Download printable cheat-sheet (CC-BY 4.0)25 Mar 2026, 00:00 Z
60-second takeaway
CosyVoice3 LoRA fine-tuning on IMDA NSC FEMALE_01 did not reach production quality in our first run. CosyVoice2 baseline audio was acceptable as a control.
The failure was configuration-specific: checkpoint drift, prompt sensitivity, and operational fragility in the training loop.
We have a clear rerun plan with tighter checkpoint gating and a fixed prompt harness — this post documents what to reproduce and what to change.
Where this fits
This is a run-specific engineering note in the IMDA NSC Voice Cloning Finetuning Benchmark 2026 series. The other models in that series — VoxCPM 1.5 and Qwen3-TTS LoRA — both produced deployable results. CosyVoice3 is the outlier that needs a rerun before it can be evaluated fairly.
- For founders: do not deploy the current CosyVoice3 run. Use VoxCPM or Qwen3-TTS while the rerun is pending.
- For engineers: use this page as the diagnostic handoff for the next CosyVoice LoRA run.
CosyVoice2 vs CosyVoice3: what the benchmark found
| CosyVoice2 (baseline/control) | CosyVoice3 (current run) | |
| Training mode | No finetune — zero-shot reference | Full SFT via LoRA path (train_cosyvoice3_lora.py) |
| Qualitative result | Acceptable naturalness, usable as control | Did not reach production quality in this run |
| Long-form stability | Stable | Unstable beyond ~20 seconds |