Which TTS Model Should You Use? A Decision Tree (2026)

Download printable cheat-sheet (CC-BY 4.0)

28 Mar 2026, 00:00 Z

60-second takeaway
There are now ten credible open-source TTS models you could deploy. The problem is not finding one that works - it is picking the one that fits your constraints.
We benchmarked most of them on IMDA NSC FEMALE_01 using an RTX 3090 Ti (24GB). This article gives you a decision tree: start with your use case (real-time streaming, audiobook, edge deployment, multilingual, fine-tuned voice cloning) and land on a specific model with a specific configuration.
If you just want a quick answer: Qwen3-TTS for real-time, CosyVoice 3 or VoxCPM 1.5 for pre-produced content, Chatterbox for fast fine-tuning, Supertonic for CPU-first ONNX deployment, and Kokoro for very small edge footprints.

If you searched for best TTS model 2026, open source TTS model comparison, CosyVoice vs Qwen3-TTS, F5-TTS quality review, or TTS inference speed, this is the routing page. It compares model choice by use case, not by a single leaderboard score.

Where this fits

For founders: Your team is about to pick a TTS model. This decision tree prevents the most expensive mistake - choosing a model that is technically impressive but wrong for your deployment constraints. A real-time product cannot tolerate CosyVoice 3's compute overhead. An audiobook pipeline does not need Qwen3-TTS's 97ms latency. Match the model to the use case before writing any integration code.
For engineers: This is the routing logic we use internally. Each recommendation is grounded in first-party benchmarks - not paper claims, not leaderboard scores, not vibes. We include the specific checkpoints, LoRA scales, and failure modes we observed so you can reproduce or skip straight to deployment.

How to use this decision tree

Start from your use case, not from a model name. The models below overlap in capability - most of them can do voice cloning, most of them fit on a 24GB GPU, and most of them produce decent output in zero-shot mode. The differences emerge when you add constraints: latency budget, fine-tuning requirements, target language, or deployment hardware.

Read the decision tree table first. If your situation maps cleanly to one row, jump to that model's deep-dive section. If you are torn between two models, the deep-dive sections include the trade-offs we observed.

Quick model-fit matrix

Forum threads around local TTS keep returning to the same practical questions: will it run locally, does it clone well enough, can it stream, what does it cost to operate, and what failure should I expect first? Use this matrix as a routing layer before comparing demos.

Model

Your first constraint	Start here	Why
Realtime voice agent	Qwen3-TTS	Start with a streaming-native path, then verify first-audio latency in your own stack.
Local ElevenLabs-style cloning	F5-TTS or Qwen3-TTS, then compare	F5 has a lighter experimentation path; Qwen3 gives stronger realtime and LoRA controls.
CPU-only or local app TTS	Supertonic 3	The Python SDK ran on Apple Silicon with ONNX Runtime CPU, ~547 MB peak RSS, and no discrete VRAM.
8GB or 12GB VRAM	Kokoro or Supertonic for edge, F5-TTS for tests	Treat this as exploration territory unless you can accept small batches and limited validation.
16GB VRAM	F5-TTS first	It is the most plausible fine-tuning experiment below the 24GB class, with caveats.
24GB VRAM	VoxCPM 1.5, Qwen3-TTS, CosyVoice	This is the validated consumer-GPU class for the Instavar benchmark series.
Long-form narration	CosyVoice 3 or VoxCPM 1.5	Batch workflows can review and regenerate segments, which matters more than first-packet latency.
Closest possible cloned speaker	IndexTTS2 or VoxCPM 2	Full-SFT paths are heavier, but they are the right comparison when voice fidelity matters most.
Commercial-safe open-source deployment	License check first, then model	The model choice is secondary if the license, training data, or voice consent does not fit the use.
Emotion control	Qwen3-TTS, Fish Speech, or Higgs	Treat emotion tags as a feature to test, not a guarantee.

Your situation	Recommended model	Why
Need real-time streaming (< 100ms latency)	Qwen3-TTS 1.7B	97ms first-packet latency, robust to formatting variation
Pre-produced content (audiobook, video A-roll)	CosyVoice 3 (zero-shot) or VoxCPM 1.5 (fine-tuned)	Flow matching quality for zero-shot; VoxCPM if fine-tuning is needed
Deploy something today, minimal setup	VoxCPM 1.5	Lowest friction LoRA path, step 4000 deployable
Need LoRA adapter control post-training	Qwen3-TTS 1.7B	Scale parameter (0.3 to 0.35) tunes output without retraining
Full SFT for maximum voice fidelity	IndexTTS2 or VoxCPM 2	IndexTTS2 is the most reproducible older baseline; VoxCPM 2 full SFT now works on 24GB with the right memory stack
Edge / on-device deployment	Supertonic 3 or Kokoro	Supertonic is the CPU-first ONNX path; Kokoro is the smallest-footprint option.
Multilingual (EN + CN + JP)	Fish Speech S2 Pro	300K+ hours multilingual training data, ELO 1339 in TTS Arena, LoRA fine-tuned on FEMALE_01
Non-American English accent retention	VoxCPM 1.5 or IndexTTS2	Both retained IMDA NSC FEMALE_01 Singaporean English accent well
Fine-tuning with minimal reference audio	Qwen3-TTS 1.7B	3-second minimum, 10 to 15s optimal, then plateau
Fastest fine-tuning turnaround	Chatterbox (0.5B)	512 samples in 2 min 20s on a single GPU - fastest fine-tuning of any model here
Maximum expressiveness + voice cloning	Higgs Audio V2 (3B)	10M+ hours pre-training, top trending on HuggingFace, Llama 3.2 backbone

Pitfall	Symptom	Fix
Double label-shift bug in `sft_12hz.py`	Speech progressively accelerates each epoch until unintelligible	Apply PR #178 - replace with `F.cross_entropy()` to avoid HuggingFace's internal shift
Missing `text_projection` call (line 93)	Hard crash on 0.6B model; silent wrong embeddings on 1.7B	Apply PR #188 (commit `680d4e9`)
Default LR too high (2e-5)	Pure noise output, infinite generation (no EOS), apparent divergence	Use 2e-6 instead (validated across GitHub issue #39)
Audio not at 24kHz	Crash deep in training with no early warning	Resample all audio to 24kHz before codec prep: `ffmpeg -i in.wav -ar 24000 out.wav`
LoRA scale 1.0 at inference	Over-steered, forced-sounding output	Use 0.3 to 0.35; run 5-sample listening test before committing
EOS token failures (~0.5% of inferences)	Infinite token generation, hangs	Set explicit `eos_token_id` list and `max_new_tokens` cap
Cold-start decoder distortion	First inference in a new process produces corrupted audio	Prepend silence codec tokens as warm-up, then trim
Progressive timbre shift across chunks	Voice changes between long-text chunks	Fix random seed before each chunk; extract speaker embedding once and reuse
Val evaluation crash on small val sets	`RuntimeError: zero-dimensional tensor cannot be concatenated`	Bug in evaluation function - needs guard for empty loss tensor
Inference segfaults mid-epoch sweep	Process crashes partway through checkpoint evaluation	Batch inference defensively; do not assume a loop completes
Val loss plateaus after epoch 10	Train loss keeps dropping but val loss stalls at ~10.3	Stop at epoch 10 - further training overfits without quality gain

Reference audio length	What to expect
3 seconds	Minimum viable for Qwen3-TTS voice cloning. Speaker identity is captured but prosody is approximate.
10 to 15 seconds	Optimal range. Captures speaker identity, natural rhythm, and accent characteristics.
15+ seconds	Diminishing returns. Quality plateaus - additional audio does not meaningfully improve output.
30+ minutes (full dataset)	Required for full SFT paths such as IndexTTS2 and VoxCPM 2. LoRA paths such as VoxCPM 1.5 and Qwen3-TTS can start with less, but still benefit from clean coverage.

Model	Parameters	Inference (24GB GPU)	Fine-tuning (24GB GPU)	Notes
Qwen3-TTS 1.7B	1.7B	✅	✅ (LoRA)	SDPA backend recommended for long-text inference
CosyVoice 3	-	✅	✅ (LoRA, companion repo)	Best at epoch 12; 6.96 GB VRAM peak
IndexTTS2	-	✅	✅ (Full SFT)	Keep all checkpoints; save more frequently than default
VoxCPM 1.5	-	✅	✅ (LoRA)	44.1 kHz resampling required before training
VoxCPM 2	2B	✅	✅ (LoRA or full SFT)	Full SFT needs gradient checkpointing, paged 8-bit optimizer state, clean manifests, and post-hoc validation
Kokoro	82M	✅	✅	Fits on much less than 24GB
Supertonic 3	99M	✅	Not covered here	CPU-first ONNX path; macOS smoke peaked at ~547 MB RSS with 385 MB local model cache
Fish Speech S2 Pro	4.6B (18.4M trainable)	✅	✅ (LoRA)	Three-step data pipeline; DualAR adds inference overhead
F5-TTS	-	✅	✅	Community recipes still maturing
Chatterbox	0.5B	✅	✅	512 samples in 2min 20s; fastest fine-tuning of any model
Higgs Audio V2	3B	✅	✅	Larger model; 10M+ hours pre-training

Which TTS Model Should You Use? A Decision Tree (2026)

Where this fits

How to use this decision tree

Quick model-fit matrix

Need consented AI voiceovers?

Which model should I try first?

The decision tree

Real-time streaming: Qwen3-TTS 1.7B

Pre-produced content: CosyVoice 3 + VoxCPM 1.5

CosyVoice 3 - best zero-shot quality

VoxCPM 1.5 - lowest-friction fine-tuned pre-produced quality

Edge and on-device deployment: Supertonic 3 and Kokoro

Multilingual requirements: Fish Speech S2 Pro

Fast fine-tuning: Chatterbox

Expressive generation: Higgs Audio V2

Fine-tuning: when zero-shot is not enough

Reference audio requirements

Full SFT: IndexTTS2 and VoxCPM 2

F5-TTS

Hardware constraints

Cross-model patterns: what broke the same way everywhere

FAQ

CosyVoice 3 or Qwen3-TTS - which should I pick?

How much reference audio do I need for voice cloning?

Zero-shot or fine-tuned - when does fine-tuning become worth it?

Which model retains Singaporean English accent best?

Can I run these on a consumer GPU?

What about proprietary models like ElevenLabs or PlayHT?

Sources

Related Posts

Where this fits

How to use this decision tree

Quick model-fit matrix

Need consented AI voiceovers?

Which model should I try first?

The decision tree

Real-time streaming: Qwen3-TTS 1.7B

Pre-produced content: CosyVoice 3 + VoxCPM 1.5

CosyVoice 3 - best zero-shot quality

VoxCPM 1.5 - lowest-friction fine-tuned pre-produced quality

Edge and on-device deployment: Supertonic 3 and Kokoro

Multilingual requirements: Fish Speech S2 Pro

Fast fine-tuning: Chatterbox

Expressive generation: Higgs Audio V2

Fine-tuning: when zero-shot is not enough

Reference audio requirements

Full SFT: IndexTTS2 and VoxCPM 2

F5-TTS

Hardware constraints

Cross-model patterns: what broke the same way everywhere

FAQ

CosyVoice 3 or Qwen3-TTS - which should I pick?

How much reference audio do I need for voice cloning?

Zero-shot or fine-tuned - when does fine-tuning become worth it?

Which model retains Singaporean English accent best?

Can I run these on a consumer GPU?

What about proprietary models like ElevenLabs or PlayHT?

Sources

Related Posts

Open-Source Lip Sync Models Compared in 2026

Supertonic 3 On-Device TTS Reality Check on macOS

Function Calling and MCP First Principles