Best Open-Source TTS Models for Production in 2026

Download printable cheat-sheet (CC-BY 4.0)

25 Mar 2026, 00:00 Z

60-second takeaway
We ran a consistent single-speaker benchmark on four open-source TTS models using IMDA NSC FEMALE_01 on an RTX 3090 Ti (24GB).
VoxCPM 1.5 and Qwen3-TTS 1.7B both produced deployable outputs. IndexTTS2 gave a stable full-SFT baseline. CosyVoice3 finetuning did not reach production quality in this run (rerun pending).
If you need something deployable today on a 24GB GPU, start with VoxCPM or Qwen3-TTS LoRA.

What this benchmark covers

This is a practitioner-oriented comparison, not an academic leaderboard. We evaluated four models under the same conditions:

  • Dataset: IMDA NSC FEMALE_01 - a single-speaker set with natural Singaporean English accent
  • Hardware: one NVIDIA RTX 3090 Ti (24 GB VRAM)
  • Goal: produce voice-cloned audio suitable for AI-generated video narration (A-roll use case)
  • Evaluation: qualitative listening on naturalness, long-text stability, accent retention, and operational friction

We are not measuring WER or MOS scores from automated tools. We are measuring whether the output sounds production-ready to a human listener on a video platform.

The four models

VoxCPM 1.5

VoxCPM 1.5 uses a LoRA finetuning path that fits within 24GB VRAM without modification. Training is straightforward with standard train/val splits.

DimensionResult
Finetuning approachLoRA
Best checkpoint (this run)step_0004000
Long-text stabilityGood
Prompt sensitivityModerate - use clean prompt clips
Production-ready?

Voice cloning

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.