CosyVoice LoRA Fine-Tuning - What Worked, What Didn't, and What the Rerun Fixed

Download printable cheat-sheet (CC-BY 4.0)

25 Mar 2026, 00:00 Z

60-second takeaway
Our first CosyVoice3 run used full SFT (all 506M params) and failed after epoch 1 - massive overfitting, 174 GB of checkpoints, grad_norm explosion.
The corrected LoRA rerun (2.16M trainable params via PEFT) reached its best checkpoint at epoch 12, with 8.3 MB adapters and stable gradients.
This post covers the diagnosis, the fix, 9 known pitfalls, and the companion repo.

Where this fits

This is an engineering note in the IMDA NSC Voice Cloning Finetuning Benchmark 2026 series. The companion repo is at instavar/cosyvoice3-lora-finetuning.

For founders: the LoRA rerun produced a viable checkpoint (epoch 12). Listening evaluation is pending before production deployment.
For engineers: the companion repo has the PEFT-integrated training script, 9 pitfalls, and corrected hyperparameters.

LoRA rerun results

The corrected LoRA run (PEFT, 2.16M trainable params, LR 5e-5) on IMDA NSC FEMALE_01 (16,535 train / 870 dev utterances):

Metric	LoRA rerun	Failed full-SFT run
Trainable params	2.16M (0.44%)	506M (100%)
Best CV loss	3.044 (epoch 12)	2.900 (epoch 1)
Epochs before overfit	12	1
Checkpoint size

	CosyVoice2 (baseline/control)	CosyVoice3 (first run - full SFT)
Training mode	No finetune - zero-shot reference	Full SFT (all 506M params trained)
Qualitative result	Acceptable naturalness, usable as control	Did not reach production quality
Long-form stability	Stable	Unstable beyond ~20 seconds
Linguistic consistency	Consistent	Weak in listening checks

Situation	Recommended model
Need deployable output now on 24GB GPU	VoxCPM 1.5 (step 4000)
Need LoRA-style fine-tuning with scale control	Qwen3-TTS 1.7B (epoch 10, scale 0.3–0.35)
Need full SFT baseline with crash recovery	IndexTTS2 (step 14000)
CosyVoice3 LoRA fine-tuning	Use companion repo, best checkpoint at epoch 12

CosyVoice LoRA Fine-Tuning - What Worked, What Didn't, and What the Rerun Fixed

Where this fits

LoRA rerun results

Need consented AI voiceovers?

What went wrong in the first run

Audio evidence

CosyVoice2 baseline/control

CosyVoice3 representative sample (this run - not production-ready)

Failure mode analysis

1. Checkpoint quality drift

2. Prompt sensitivity and long-text decoding

3. Operational fragility: large checkpoint churn

What the LoRA rerun fixed

Companion repo

When to use CosyVoice vs alternatives

FAQ

Related deep dives

Related Posts

Where this fits

LoRA rerun results

Need consented AI voiceovers?

What went wrong in the first run

Audio evidence

CosyVoice2 baseline/control

CosyVoice3 representative sample (this run - not production-ready)

Failure mode analysis

1. Checkpoint quality drift

2. Prompt sensitivity and long-text decoding

3. Operational fragility: large checkpoint churn

What the LoRA rerun fixed

Companion repo

When to use CosyVoice vs alternatives

FAQ

Related deep dives

Related Posts

Running OpenAI Privacy Filter on an M2 MacBook Pro - 52-Case Benchmark

How Open-Source TTS Architectures Differ - And What It Means for Fine-Tuning (2026)

Build an AI YouTube Shorts Pipeline - Remotion + TTS + Automated Publishing