CosyVoice 2 vs 3 - Voice Cloning Quality Compared (2026)

Download printable cheat-sheet (CC-BY 4.0)

07 Feb 2026, 00:00 Z

Experiment Status: LoRA rerun completed - best checkpoint at epoch 12. Listening evaluation pending.
60-second takeaway
The first CosyVoice3 run (full SFT) failed after epoch 1 with catastrophic overfitting. A corrected LoRA rerun (2.16M params via PEFT) reached its best checkpoint at epoch 12 with stable training.
CosyVoice2 baseline audio remains the control. The LoRA rerun tools and 9 pitfalls are published at instavar/cosyvoice3-lora-finetuning.

If you searched for CosyVoice 2 quality, CosyVoice 2 voice cloning quality, CosyVoice 2 quality review 2026, or CosyVoice 2 vs CosyVoice 3, this page is the quality-comparison view. It should be read beside the full CosyVoice fine-tuning guide, which covers the data, VRAM, LoRA, and rerun details.

Where this fits

  • For founders: do not deploy this CosyVoice3 run as-is.
  • For engineers: use this page as a diagnostic handoff for the next rerun.

Quick read:

  • CosyVoice2 remains the cleaner control sample from this evidence set.
  • CosyVoice3 is still attractive for zero-shot quality, but this specific fine-tuned run did not clear production listening review.
  • The corrected CosyVoice3 LoRA rerun is operationally healthier than the full-SFT run, but quality promotion still depends on listening evaluation.

Series overview:

For the full cross-model comparison, see the TTS Model Decision Tree - CosyVoice 3 is recommended for pre-produced content when zero-shot consistency matters most.

Result summary

CosyVoice2 is included as a baseline/control and produced acceptable qualitative output on our selected sample. CosyVoice3 finetuning in this run did not reach production-ready quality, with unstable long-form behavior and weaker linguistic consistency in listening checks.

Audio evidence

CosyVoice2 baseline/control

CosyVoice3 representative sample from this run

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.