We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
60-second takeaway Our first CosyVoice3 run used full SFT (all 506M params) and failed after epoch 1 - massive overfitting, 174 GB of checkpoints, grad_norm explosion. The corrected LoRA rerun (2.16M trainable params via PEFT) reached its best checkpoint at epoch 12, with 8.3 MB adapters and stable gradients. This post covers the diagnosis, the fix, 9 known pitfalls, and the companion repo.
If you searched for CosyVoice fine-tuning guide, CosyVoice fine-tuning data requirements, CosyVoice fine-tuning consumer GPU, or CosyVoice voice cloning quality, start here. The important lesson is not that CosyVoice is bad. It is that the training mode decides the outcome: full SFT failed quickly in our run, while the corrected LoRA path produced a stable checkpoint that still needs listening evaluation before production use.
Listening evaluation of the epoch 12 checkpoint is pending. The LoRA approach produced a much more stable training run with 12 epochs of genuine improvement before divergence.
What went wrong in the first run
CosyVoice2 (baseline/control)
CosyVoice3 (first run - full SFT)
Training mode
No finetune - zero-shot reference
Full SFT (all 506M params trained)
Qualitative result
Acceptable naturalness, usable as control
Did not reach production quality
Long-form stability
Stable
Unstable beyond ~20 seconds
Linguistic consistency
Consistent
Weak in listening checks
The critical discovery: the first run used full SFT via the upstream training pipeline, not LoRA. The train_cosyvoice3_lora.py script existed but was not the training path that produced the evaluated checkpoints. This caused catastrophic forgetting of the pretrained model's generalization.
Audio evidence
CosyVoice2 baseline/control
CosyVoice3 representative sample (this run - not production-ready)
Listen to the two clips side by side. The CosyVoice2 clip is cleaner on naturalistic prosody. The CosyVoice3 clip shows the instability in long-form decoding that blocked production deployment.
Failure mode analysis
Three contributing factors explain the current run outcome:
1. Checkpoint quality drift
CosyVoice3 LoRA training showed a pattern where early epochs were noticeably better than later ones. We did not have a strict per-epoch validation gate in place, so the run continued past the best region. The practical lesson: CosyVoice3 needs tighter checkpoint gating with explicit listening checkpoints every 2-3 epochs, not just loss curve tracking.
2. Prompt sensitivity and long-text decoding
CosyVoice3 in this run was sensitive to prompt formatting. Small changes to how the text was segmented before inference changed output quality significantly. Long-text generation (>20s) showed linguistic inconsistency - words dropped or merged in a way that sounded unnatural on listening review. VoxCPM and Qwen3-TTS were more robust to prompt formatting variation.
3. Operational fragility: large checkpoint churn
The training loop produced large checkpoint files at short intervals. Without explicit retention policy, older (potentially better) checkpoints were overwritten. By the time we evaluated, some of the best early-epoch checkpoints were no longer available. This is the same issue that affected the IndexTTS2 run, but with a more acute impact here because the best zone was earlier.
Streaming and long-form reality check
CosyVoice is attractive when a demo clip sounds natural, but production use depends on whether the same voice survives streaming, prompt variation, and longer text. Treat CosyVoice as a stronger candidate for pre-produced narration than for live voice agents until your own first-audio latency and chunk-stability checks pass.
Failure mode
What users notice
What to test before shipping
First-chunk latency
The assistant feels slow even if the full clip renders quickly
Measure time to first audible audio, not only full-file real-time factor
Stutter
Short repeats or broken rhythm during streaming
Test the same text in batch and streaming paths, then compare artifacts
Voice drift
The speaker changes across chunks or segments
Reuse the same reference handling and compare multi-paragraph outputs
Swallowed words
Words disappear or merge in longer prompts
Keep a held-out script with names, numbers, punctuation, and short clauses
Prompt leakage
Formatting or reference text affects the spoken output
Test clean text, markdown-like text, and segmented text separately
Long-text collapse
Quality drops after the first 20 to 30 seconds
Evaluate the actual passage length you plan to publish, not only a one-sentence sample
Checkpoint churn
A later checkpoint sounds worse even when loss looks better
Preserve checkpoints and run listening checks every few epochs
The practical split is simple: use CosyVoice when you can batch-generate, review, cut, and regenerate audio before publishing. Prefer Qwen3-TTS for latency-first voice agents, VoxCPM 1.5 for the lowest-friction fine-tuned production path, or the TTS model decision tree when licensing, hardware, or voice-cloning quality is the first constraint.
What the LoRA rerun fixed
The corrected LoRA run addressed all three failure modes:
1. LoRA instead of full SFT
└─ 2.16M trainable params (0.44%) via PEFT
└─ 8.3 MB checkpoints (480x smaller)
└─ No catastrophic forgetting - base model frozen
2. Stable training curve
└─ 12 epochs of genuine improvement (vs 1 for full SFT)
└─ Grad norm stayed at 1.4-4.0 (no explosion)
└─ CV loss bottomed at 3.044 (epoch 12)
3. All checkpoints preserved
└─ 200 checkpoints at 8.3 MB each = 1.7 GB total
└─ Best epoch (12) available for evaluation
└─ Compare: old run had 43 checkpoints at 4 GB each = 174 GB
Is CosyVoice LoRA fine-tuning worth trying on a 24GB GPU?
Yes. The LoRA rerun peaked at 6.96 GB VRAM on an RTX 3090 Ti. The companion repo (instavar/cosyvoice3-lora-finetuning) has the PEFT-integrated training script and 9 documented pitfalls.
Why did the first run fail?
The first run used full SFT (all 506M parameters), not LoRA. This caused catastrophic forgetting after epoch 1. The corrected LoRA rerun (2.16M params, LR 5e-5) trained stably for 12 epochs.
What learning rate should I use?
5e-5 for LoRA (not 1e-5, which was used for full SFT). The first run's artifact path female01_cv3_lr1e5_run1 reflects the full SFT LR. The LoRA rerun used female01_cv3_lora_lr5e5_run1.