CosyVoice 2 vs 3 - Voice Cloning Quality Compared (2026)
Download printable cheat-sheet (CC-BY 4.0)07 Feb 2026, 00:00 Z
Experiment Status: LoRA rerun completed - best checkpoint at epoch 12. Listening evaluation pending.
60-second takeaway
The first CosyVoice3 run (full SFT) failed after epoch 1 with catastrophic overfitting. A corrected LoRA rerun (2.16M params via PEFT) reached its best checkpoint at epoch 12 with stable training.
CosyVoice2 baseline audio remains the control. The LoRA rerun tools and 9 pitfalls are published at instavar/cosyvoice3-lora-finetuning.
If you searched for CosyVoice 2 quality, CosyVoice 2 voice cloning quality, CosyVoice 2 quality review 2026, or CosyVoice 2 vs CosyVoice 3, this page is the quality-comparison view. It should be read beside the full CosyVoice fine-tuning guide, which covers the data, VRAM, LoRA, and rerun details.
Where this fits
- For founders: do not deploy this CosyVoice3 run as-is.
- For engineers: use this page as a diagnostic handoff for the next rerun.
Quick read:
- CosyVoice2 remains the cleaner control sample from this evidence set.
- CosyVoice3 is still attractive for zero-shot quality, but this specific fine-tuned run did not clear production listening review.
- The corrected CosyVoice3 LoRA rerun is operationally healthier than the full-SFT run, but quality promotion still depends on listening evaluation.
Series overview:
For the full cross-model comparison, see the TTS Model Decision Tree - CosyVoice 3 is recommended for pre-produced content when zero-shot consistency matters most.
Result summary
CosyVoice2 is included as a baseline/control and produced acceptable qualitative output on our selected sample. CosyVoice3 finetuning in this run did not reach production-ready quality, with unstable long-form behavior and weaker linguistic consistency in listening checks.