Best Open-Source TTS Models for Production in 2026
Download printable cheat-sheet (CC-BY 4.0)25 Mar 2026, 00:00 Z
60-second takeaway
We ran a consistent single-speaker benchmark on four open-source TTS models using IMDA NSC FEMALE_01 on an RTX 3090 Ti (24GB).
VoxCPM 1.5 and Qwen3-TTS 1.7B both produced deployable outputs. IndexTTS2 gave a stable full-SFT baseline. CosyVoice3 finetuning did not reach production quality in this run (rerun pending).
If you need something deployable today on a 24GB GPU, start with VoxCPM or Qwen3-TTS LoRA.
What this benchmark covers
This is a practitioner-oriented comparison, not an academic leaderboard. We evaluated four models under the same conditions:
- Dataset: IMDA NSC
FEMALE_01— a single-speaker set with natural Singaporean English accent - Hardware: one NVIDIA RTX 3090 Ti (24 GB VRAM)
- Goal: produce voice-cloned audio suitable for AI-generated video narration (A-roll use case)
- Evaluation: qualitative listening on naturalness, long-text stability, accent retention, and operational friction
We are not measuring WER or MOS scores from automated tools. We are measuring whether the output sounds production-ready to a human listener on a video platform.
The four models
VoxCPM 1.5
VoxCPM 1.5 uses a LoRA finetuning path that fits within 24GB VRAM without modification. Training is straightforward with standard train/val splits.
| Dimension | Result |
| Finetuning approach | LoRA |
| Best checkpoint (this run) | step_0004000 |
| Long-text stability | Good |
| Prompt sensitivity | Moderate — use clean prompt clips |
| Production-ready? |