CosyVoice 2 vs 3 - Voice Cloning Quality Compared (2026)
Download printable cheat-sheet (CC-BY 4.0)07 Feb 2026, 00:00 Z
Experiment Status: LoRA rerun completed - best checkpoint at epoch 12. Listening evaluation pending.
60-second takeaway
The first CosyVoice3 run (full SFT) failed after epoch 1 with catastrophic overfitting. A corrected LoRA rerun (2.16M params via PEFT) reached its best checkpoint at epoch 12 with stable training.
CosyVoice2 baseline audio remains the control. The LoRA rerun tools and 9 pitfalls are published at instavar/cosyvoice3-lora-finetuning.
Where this fits
- For founders: do not deploy this CosyVoice3 run as-is.
- For engineers: use this page as a diagnostic handoff for the next rerun.
Series overview:
For the full cross-model comparison, see the TTS Model Decision Tree - CosyVoice 3 is recommended for pre-produced content when zero-shot consistency matters most.
Result summary
CosyVoice2 is included as a baseline/control and produced acceptable qualitative output on our selected sample. CosyVoice3 finetuning in this run did not reach production-ready quality, with unstable long-form behavior and weaker linguistic consistency in listening checks.
Audio evidence
CosyVoice2 baseline/control
CosyVoice3 representative sample from this run
What this does and does not mean
This conclusion is specific to our exact setup: dataset shape, checkpoint path, prompt handling, and inference configuration. It should not be interpreted as a universal model-family ranking.
Likely contributors in this run
- Checkpoint quality drift after early epochs.
- Sensitivity to prompt formatting and long-text decoding behavior.
- Operational fragility from large checkpoint churn and unstable inference zones.
LoRA rerun update
The corrected LoRA rerun has been completed. Key results:
- Best checkpoint: epoch 12 (CV loss 3.044)
- Training stability: 12 epochs of improvement before divergence (vs 1 for full SFT)