CosyVoice 2 vs 3 - Voice Cloning Quality Compared (2026)

Download printable cheat-sheet (CC-BY 4.0)

07 Feb 2026, 00:00 Z

Experiment Status: LoRA rerun completed - best checkpoint at epoch 12. Listening evaluation pending.
60-second takeaway
The first CosyVoice3 run (full SFT) failed after epoch 1 with catastrophic overfitting. A corrected LoRA rerun (2.16M params via PEFT) reached its best checkpoint at epoch 12 with stable training.
CosyVoice2 baseline audio remains the control. The LoRA rerun tools and 9 pitfalls are published at instavar/cosyvoice3-lora-finetuning.

Where this fits

  • For founders: do not deploy this CosyVoice3 run as-is.
  • For engineers: use this page as a diagnostic handoff for the next rerun.

Series overview:

For the full cross-model comparison, see the TTS Model Decision Tree - CosyVoice 3 is recommended for pre-produced content when zero-shot consistency matters most.

Result summary

CosyVoice2 is included as a baseline/control and produced acceptable qualitative output on our selected sample. CosyVoice3 finetuning in this run did not reach production-ready quality, with unstable long-form behavior and weaker linguistic consistency in listening checks.

Audio evidence

CosyVoice2 baseline/control

CosyVoice3 representative sample from this run

What this does and does not mean

This conclusion is specific to our exact setup: dataset shape, checkpoint path, prompt handling, and inference configuration. It should not be interpreted as a universal model-family ranking.

Likely contributors in this run

  • Checkpoint quality drift after early epochs.
  • Sensitivity to prompt formatting and long-text decoding behavior.
  • Operational fragility from large checkpoint churn and unstable inference zones.

LoRA rerun update

The corrected LoRA rerun has been completed. Key results:

  • Best checkpoint: epoch 12 (CV loss 3.044)
  • Training stability: 12 epochs of improvement before divergence (vs 1 for full SFT)

Voice cloning

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.