We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
60-second takeaway Our first CosyVoice3 run used full SFT (all 506M params) and failed after epoch 1 - massive overfitting, 174 GB of checkpoints, grad_norm explosion. The corrected LoRA rerun (2.16M trainable params via PEFT) reached its best checkpoint at epoch 12, with 8.3 MB adapters and stable gradients. This post covers the diagnosis, the fix, 9 known pitfalls, and the companion repo.
Listening evaluation of the epoch 12 checkpoint is pending. The LoRA approach produced a much more stable training run with 12 epochs of genuine improvement before divergence.
What went wrong in the first run
CosyVoice2 (baseline/control)
CosyVoice3 (first run - full SFT)
Training mode
No finetune - zero-shot reference
Full SFT (all 506M params trained)
Qualitative result
Acceptable naturalness, usable as control
Did not reach production quality
Long-form stability
Stable
Unstable beyond ~20 seconds
Linguistic consistency
Consistent
Weak in listening checks
The critical discovery: the first run used full SFT via the upstream training pipeline, not LoRA. The train_cosyvoice3_lora.py script existed but was not the training path that produced the evaluated checkpoints. This caused catastrophic forgetting of the pretrained model's generalization.
Audio evidence
CosyVoice2 baseline/control
CosyVoice3 representative sample (this run - not production-ready)
Listen to the two clips side by side. The CosyVoice2 clip is cleaner on naturalistic prosody. The CosyVoice3 clip shows the instability in long-form decoding that blocked production deployment.
Failure mode analysis
Three contributing factors explain the current run outcome:
1. Checkpoint quality drift
CosyVoice3 LoRA training showed a pattern where early epochs were noticeably better than later ones. We did not have a strict per-epoch validation gate in place, so the run continued past the best region. The practical lesson: CosyVoice3 needs tighter checkpoint gating with explicit listening checkpoints every 2–3 epochs, not just loss curve tracking.
2. Prompt sensitivity and long-text decoding
CosyVoice3 in this run was sensitive to prompt formatting. Small changes to how the text was segmented before inference changed output quality significantly. Long-text generation (>20s) showed linguistic inconsistency - words dropped or merged in a way that sounded unnatural on listening review. VoxCPM and Qwen3-TTS were more robust to prompt formatting variation.
3. Operational fragility: large checkpoint churn
The training loop produced large checkpoint files at short intervals. Without explicit retention policy, older (potentially better) checkpoints were overwritten. By the time we evaluated, some of the best early-epoch checkpoints were no longer available. This is the same issue that affected the IndexTTS2 run, but with a more acute impact here because the best zone was earlier.
What the LoRA rerun fixed
The corrected LoRA run addressed all three failure modes:
1. LoRA instead of full SFT
└─ 2.16M trainable params (0.44%) via PEFT
└─ 8.3 MB checkpoints (480x smaller)
└─ No catastrophic forgetting - base model frozen
2. Stable training curve
└─ 12 epochs of genuine improvement (vs 1 for full SFT)
└─ Grad norm stayed at 1.4–4.0 (no explosion)
└─ CV loss bottomed at 3.044 (epoch 12)
3. All checkpoints preserved
└─ 200 checkpoints at 8.3 MB each = 1.7 GB total
└─ Best epoch (12) available for evaluation
└─ Compare: old run had 43 checkpoints at 4 GB each = 174 GB
Is CosyVoice LoRA fine-tuning worth trying on a 24GB GPU?
Yes. The LoRA rerun peaked at 6.96 GB VRAM on an RTX 3090 Ti. The companion repo (instavar/cosyvoice3-lora-finetuning) has the PEFT-integrated training script and 9 documented pitfalls.
Why did the first run fail?
The first run used full SFT (all 506M parameters), not LoRA. This caused catastrophic forgetting after epoch 1. The corrected LoRA rerun (2.16M params, LR 5e-5) trained stably for 12 epochs.
What learning rate should I use?
5e-5 for LoRA (not 1e-5, which was used for full SFT). The first run's artifact path female01_cv3_lr1e5_run1 reflects the full SFT LR. The LoRA rerun used female01_cv3_lora_lr5e5_run1.