We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
60-second takeaway There are now ten credible open-source TTS models you could deploy. The problem is not finding one that works - it is picking the one that fits your constraints. We benchmarked most of them on IMDA NSC FEMALE_01 using an RTX 3090 Ti (24GB). This article gives you a decision tree: start with your use case (real-time streaming, audiobook, edge deployment, multilingual, fine-tuned voice cloning) and land on a specific model with a specific configuration. If you just want a quick answer: Qwen3-TTS for real-time, CosyVoice 3 or VoxCPM 1.5 for pre-produced content, Chatterbox for fast fine-tuning, Supertonic for CPU-first ONNX deployment, and Kokoro for very small edge footprints.
If you searched for best TTS model 2026, open source TTS model comparison, CosyVoice vs Qwen3-TTS, F5-TTS quality review, or TTS inference speed, this is the routing page. It compares model choice by use case, not by a single leaderboard score.
Where this fits
For founders: Your team is about to pick a TTS model. This decision tree prevents the most expensive mistake - choosing a model that is technically impressive but wrong for your deployment constraints. A real-time product cannot tolerate CosyVoice 3's compute overhead. An audiobook pipeline does not need Qwen3-TTS's 97ms latency. Match the model to the use case before writing any integration code.
For engineers: This is the routing logic we use internally. Each recommendation is grounded in first-party benchmarks - not paper claims, not leaderboard scores, not vibes. We include the specific checkpoints, LoRA scales, and failure modes we observed so you can reproduce or skip straight to deployment.
How to use this decision tree
Start from your use case, not from a model name. The models below overlap in capability - most of them can do voice cloning, most of them fit on a 24GB GPU, and most of them produce decent output in zero-shot mode. The differences emerge when you add constraints: latency budget, fine-tuning requirements, target language, or deployment hardware.
Read the decision tree table first. If your situation maps cleanly to one row, jump to that model's deep-dive section. If you are torn between two models, the deep-dive sections include the trade-offs we observed.
Quick model-fit matrix
Forum threads around local TTS keep returning to the same practical questions: will it run locally, does it clone well enough, can it stream, what does it cost to operate, and what failure should I expect first? Use this matrix as a routing layer before comparing demos.
Model
Need consented AI voiceovers?
Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.
You need a simple inference-only upstream workflow
Batch and reviewed content
24GB validated
Crash recovery and checkpoint retention
Chatterbox
Fast fine-tuning experiments
You need mature recipes across many languages
Promising, still validate locally
Smaller model, 24GB easy
Too-small datasets and reference-audio guideline gaps
Fish Speech
EN, CN, and JP multilingual workflows
You only need English and want the lowest compute
More deployment work
Larger base model
Cross-lingual consistency and emotion control
Kokoro
Edge, CPU-adjacent, and low-compute deployment
You need strong custom voice cloning
Good for short local responses
Best small-footprint option here
Limited cloning flexibility
Supertonic 3
CPU-first on-device TTS via ONNX Runtime
You need first-party voice-cloning fine-tuning
Fast cached local synthesis
Mac CPU smoke used ~547 MB RSS
Voice ownership, license fit, and long-form stability
Dia
Expressive experiments and community demos
You need a proven fine-tuning pipeline
Validate per implementation
Depends on runner
Reproducibility and deployment maturity
Orpheus
Expressive speech experiments
You need low-risk production cloning
Validate per implementation
Depends on runner
Quality variance and recipe maturity
OpenVoice or XTTS
Baseline local cloning and older tooling
You need current best quality or low latency
Usually not the fastest path
Often easy to run
Clone similarity disappointing on new speakers
ElevenLabs
Hosted API quality and low setup friction
You need local privacy, offline use, or no per-use cost
Strong hosted option
Cloud API
Vendor dependence, data exposure, and commercial terms
For commercial deployment, check each model license and voice-consent workflow before treating a local model as cheaper or safer. Local generation removes a vendor dependency, but it does not remove consent, dataset provenance, or product liability questions.
Which model should I try first?
Your first constraint
Start here
Why
Realtime voice agent
Qwen3-TTS
Start with a streaming-native path, then verify first-audio latency in your own stack.
Local ElevenLabs-style cloning
F5-TTS or Qwen3-TTS, then compare
F5 has a lighter experimentation path; Qwen3 gives stronger realtime and LoRA controls.
CPU-only or local app TTS
Supertonic 3
The Python SDK ran on Apple Silicon with ONNX Runtime CPU, ~547 MB peak RSS, and no discrete VRAM.
8GB or 12GB VRAM
Kokoro or Supertonic for edge, F5-TTS for tests
Treat this as exploration territory unless you can accept small batches and limited validation.
16GB VRAM
F5-TTS first
It is the most plausible fine-tuning experiment below the 24GB class, with caveats.
24GB VRAM
VoxCPM 1.5, Qwen3-TTS, CosyVoice
This is the validated consumer-GPU class for the Instavar benchmark series.
Long-form narration
CosyVoice 3 or VoxCPM 1.5
Batch workflows can review and regenerate segments, which matters more than first-packet latency.
Closest possible cloned speaker
IndexTTS2 or VoxCPM 2
Full-SFT paths are heavier, but they are the right comparison when voice fidelity matters most.
Commercial-safe open-source deployment
License check first, then model
The model choice is secondary if the license, training data, or voice consent does not fit the use.
Emotion control
Qwen3-TTS, Fish Speech, or Higgs
Treat emotion tags as a feature to test, not a guarantee.
The decision tree
Your situation
Recommended model
Why
Need real-time streaming (< 100ms latency)
Qwen3-TTS 1.7B
97ms first-packet latency, robust to formatting variation
Pre-produced content (audiobook, video A-roll)
CosyVoice 3 (zero-shot) or VoxCPM 1.5 (fine-tuned)
Flow matching quality for zero-shot; VoxCPM if fine-tuning is needed
Deploy something today, minimal setup
VoxCPM 1.5
Lowest friction LoRA path, step 4000 deployable
Need LoRA adapter control post-training
Qwen3-TTS 1.7B
Scale parameter (0.3 to 0.35) tunes output without retraining
Full SFT for maximum voice fidelity
IndexTTS2 or VoxCPM 2
IndexTTS2 is the most reproducible older baseline; VoxCPM 2 full SFT now works on 24GB with the right memory stack
Edge / on-device deployment
Supertonic 3 or Kokoro
Supertonic is the CPU-first ONNX path; Kokoro is the smallest-footprint option.
Multilingual (EN + CN + JP)
Fish Speech S2 Pro
300K+ hours multilingual training data, ELO 1339 in TTS Arena, LoRA fine-tuned on FEMALE_01
Non-American English accent retention
VoxCPM 1.5 or IndexTTS2
Both retained IMDA NSC FEMALE_01 Singaporean English accent well
Fine-tuning with minimal reference audio
Qwen3-TTS 1.7B
3-second minimum, 10 to 15s optimal, then plateau
Fastest fine-tuning turnaround
Chatterbox (0.5B)
512 samples in 2 min 20s on a single GPU - fastest fine-tuning of any model here
Maximum expressiveness + voice cloning
Higgs Audio V2 (3B)
10M+ hours pre-training, top trending on HuggingFace, Llama 3.2 backbone
Real-time streaming: Qwen3-TTS 1.7B
If your product needs first-packet latency under 100ms, Qwen3-TTS is the only model in this list that reliably delivers it. We measured 97ms first-packet latency on our RTX 3090 Ti benchmark - fast enough for conversational interfaces, live dubbing, and interactive voice agents.
What makes it work for real-time:
Streaming-native architecture - the model generates audio incrementally, not as a single batch
10 languages supported out of the box
3-second voice cloning - you can enrol a new speaker from a single short clip
Robust to input formatting variation - handles punctuation, numbers, and abbreviations without special preprocessing
LoRA fine-tuning: Supported, and this is where the model shines post-training. The lora_scale parameter (0.3 to 0.35 optimal in our benchmark) lets you control how much the adapter influences output without retraining. This is a deployment-time knob, not a training-time decision. Run a 5-sample listening test at scales 0.2, 0.3, 0.35, and 0.5 before committing.
Current limitation: Single-speaker fine-tuning only. Multi-speaker LoRA is not yet supported. If you need multiple fine-tuned voices, you need separate adapters.
Known pitfalls (from our fine-tuning runs and companion repo):
Pitfall
Symptom
Fix
Double label-shift bug in sft_12hz.py
Speech progressively accelerates each epoch until unintelligible
Apply PR #178 - replace with F.cross_entropy() to avoid HuggingFace's internal shift
Missing text_projection call (line 93)
Hard crash on 0.6B model; silent wrong embeddings on 1.7B
Apply PR #188 (commit 680d4e9)
Default LR too high (2e-5)
Pure noise output, infinite generation (no EOS), apparent divergence
Use 2e-6 instead (validated across GitHub issue #39)
Audio not at 24kHz
Crash deep in training with no early warning
Resample all audio to 24kHz before codec prep: ffmpeg -i in.wav -ar 24000 out.wav
LoRA scale 1.0 at inference
Over-steered, forced-sounding output
Use 0.3 to 0.35; run 5-sample listening test before committing
EOS token failures (~0.5% of inferences)
Infinite token generation, hangs
Set explicit eos_token_id list and max_new_tokens cap
Cold-start decoder distortion
First inference in a new process produces corrupted audio
Prepend silence codec tokens as warm-up, then trim
Progressive timbre shift across chunks
Voice changes between long-text chunks
Fix random seed before each chunk; extract speaker embedding once and reuse
Val evaluation crash on small val sets
RuntimeError: zero-dimensional tensor cannot be concatenated
Bug in evaluation function - needs guard for empty loss tensor
Inference segfaults mid-epoch sweep
Process crashes partway through checkpoint evaluation
Batch inference defensively; do not assume a loop completes
Val loss plateaus after epoch 10
Train loss keeps dropping but val loss stalls at ~10.3
Stop at epoch 10 - further training overfits without quality gain
The double label-shift bug is the most impactful: it affects every training run on the official script and is not documented in the upstream README. If your fine-tuned output sounds increasingly fast with each epoch, this is almost certainly the cause.
Pre-produced content (audiobooks, video narration, podcast intros) does not need sub-100ms latency. It needs the highest possible naturalness and consistency across long passages. Two models fit this use case, and which one you pick depends on whether you need fine-tuning.
CosyVoice 3 - best zero-shot quality
CosyVoice 3 uses a flow matching architecture that produces extremely consistent zero-shot output. If you have a reference clip and do not want to fine-tune, this is the model to start with.
Strengths:
Flow matching produces smooth, natural prosody even on first attempt
Extremely high speaker consistency in zero-shot mode
Strong on long-form passages - audiobook chapters, 5-minute narration blocks
Trade-offs:
Higher compute cost than Qwen3-TTS or VoxCPM at inference time
LoRA fine-tuning is viable (best checkpoint at epoch 12) - see companion repo for PEFT integration and 9 pitfalls
Not suitable for real-time streaming due to compute overhead
If zero-shot is not enough and you need a fine-tuned voice for pre-produced content, VoxCPM 1.5 is the path of least resistance.
Strengths:
Lowest setup friction of any LoRA fine-tuning path we tested
Best checkpoint (step 4000) produced deployable output in our first run
No-prompt generation was the cleanest - prompted inference copied room noise from the reference clip
Strong accent retention on Singaporean English (IMDA NSC FEMALE_01)
Trade-offs:
Use VoxCPM 1.5 LoRA when iteration speed matters; use VoxCPM 2 full SFT when you need deeper model adaptation and can tolerate a heavier recipe
Requires 44.1 kHz audio resampling before training (skip this and training diverges)
Known pitfalls:
44.1 kHz resampling required - VoxCPM expects 44.1 kHz audio (not 24 kHz like Qwen3-TTS). Skip this and training diverges silently - loss looks normal but output quality degrades.
Prompted inference copies room noise - if the reference clip has any background noise, it bleeds into the output. Use no-prompt generation for production; only use prompted mode when strong speaker lock is required.
Edge and on-device deployment: Supertonic 3 and Kokoro
Kokoro and Supertonic 3 are the edge candidates in this list, but they solve different edge problems. Kokoro is the smallest-footprint model here. Supertonic 3 is the CPU-first ONNX Runtime option when you want local TTS in an app, browser-adjacent workflow, or desktop utility without running a large GPU model.
We ran a bounded Supertonic 3 smoke test on Apple Silicon macOS with the Python SDK and cached model assets. The run generated a valid mono 44.1 kHz PCM WAV, used the default CPUExecutionProvider, and peaked at about 547 MB resident memory during short synthesis. The local model cache was 385 MB at /Users/CheeWeiJie/.cache/supertonic3. This is not a voice-cloning quality benchmark, but it is strong evidence that the model belongs in the on-device deployment lane.
What makes it work for edge:
Supertonic 3 runs through ONNX Runtime and ships public Python SDK examples
The default local path does not require discrete VRAM on macOS
The upstream release adds 31-language support and targets fewer repeat/skip failures
Kokoro remains the lighter option when the smallest model footprint matters most
Both are suitable for on-device voice assistants, kiosk applications, offline narration, and privacy-sensitive local workflows
Trade-offs:
Smaller models have less headroom for unusual prosody or complex long-form narration
Supertonic's public path is best treated as inference-first unless you use Supertone's voice-building workflow
Kokoro has limited cloning flexibility compared with larger voice-cloning models
License and voice-consent terms still need a production review before commercial deployment
When to use Supertonic or Kokoro over a cloud-hosted larger model: Use Supertonic when you want a local ONNX TTS runtime with broad language coverage and acceptable memory use on consumer hardware. Use Kokoro when the smallest possible footprint matters more than language breadth or custom voice flexibility. Choose either over a cloud-hosted model when server latency, offline behavior, local privacy, or per-inference cost matters more than maximum voice-clone fidelity.
Multilingual requirements: Fish Speech S2 Pro
If your product needs to serve English, Chinese, and Japanese from a single model, Fish Speech S2 Pro is the strongest option. It was trained on 300K+ hours of multilingual data - an order of magnitude more than most open-source TTS models.
First-party experience: We fine-tuned Fish Speech S2 Pro with LoRA on IMDA NSC FEMALE_01 (March 2026). The model is a 4.6B parameter DualAR Transformer - only 18.4M parameters are trainable via LoRA, making fine-tuning feasible on a single 24GB GPU. The training pipeline requires three preparation steps (VQ extraction, protobuf building, then LoRA training) - more setup than Chatterbox or VoxCPM but well-documented. Our pilot run (64 steps) completed in 6 minutes 25 seconds.
Strong on EN, CN, and JP - the three languages with the deepest training coverage
ELO 1339 in TTS Arena, which tracks human preference across multilingual scenarios
Active community and regular model updates
LoRA fine-tuning supported - 18.4M trainable params out of 4.6B total
Trade-offs:
Three-step data pipeline (VQ → protobuf → train) adds setup complexity compared to simpler models
DualAR architecture is more complex to deploy than standard autoregressive models
If you only need English, the compute overhead of multilingual capability is wasted
4.6B base model requires more VRAM at inference than smaller alternatives
Known pitfalls:
Three-step data pipeline has version sensitivity - the VQ extraction → protobuf building → LoRA training pipeline requires matching protobuf versions. Our first protobuf build attempt failed with ImportError: cannot import name 'builder' from 'google.protobuf.internal'. Upgrading protobuf fixed it, but this is not obvious from the docs.
VQ extraction is slow - processing 12,057 files (14.45 hours of FEMALE_01 audio) took ~22 minutes. Budget this into your pipeline setup time.
4.6B base model, 18.4M trainable - the LoRA approach keeps most parameters frozen, but the base model still needs to fit in VRAM for inference. Fits on 24GB, but leave headroom.
When to use Fish Speech over Qwen3-TTS for multilingual: Qwen3-TTS supports 10 languages, but Fish Speech's deeper training data on EN/CN/JP produces more natural cross-language output for those three specifically. If your use case is primarily EN+CN+JP, Fish Speech wins. If you need broader language coverage (10 languages), Qwen3-TTS is more versatile.
Fast fine-tuning: Chatterbox
Chatterbox is a 0.5B parameter model built on Llama that has emerged as the fastest fine-tunable TTS model in the open-source ecosystem. In blind listening tests, it beats ElevenLabs at a 63.75% preference rate (Resemble AI benchmark). It was the #1 trending TTS model on HuggingFace.
First-party experience: We fine-tuned Chatterbox on IMDA NSC FEMALE_01 (512 samples, March 2026). Training completed in 2 minutes 20 seconds on a single GPU - orders of magnitude faster than IndexTTS2 or VoxCPM fine-tuning. Two checkpoints were saved (step 384 and step 512), with a final training loss of 1.26.
What makes it work for rapid iteration:
0.5B parameters means fine-tuning is extremely fast and fits comfortably on 24GB
The training pipeline uses standard HuggingFace Trainer - no custom training loop required
Supports both TTS and voice conversion (VC) modes
Multilingual support including English, Chinese, and more
Trade-offs:
Smaller model (0.5B) means less capacity for complex prosody compared to 1.7B+ models
Fine-tuning ecosystem is newer than Qwen3-TTS or IndexTTS2 - fewer community recipes
Quality evaluation on FEMALE_01 is still pending full listening comparison against our other benchmarked models
Known pitfalls:
Minimum dataset size - our pilot run with 64 samples produced loss=0.0 (no meaningful gradients). The model needs a minimum of ~256 samples to learn. Our 512-sample run produced loss 1.26 and generated usable checkpoints.
Quality evaluation pending - we have fine-tuned Chatterbox on FEMALE_01 but have not yet completed a formal listening comparison against IndexTTS2 or VoxCPM on the same corpus. The training metrics look healthy but training loss alone does not predict perceptual quality.
When to use Chatterbox: When you need to iterate on fine-tuning rapidly - testing multiple speaker profiles, dataset sizes, or training configurations. The 2-minute training cycle means you can run 30 experiments in the time it takes IndexTTS2 to complete one. Start with Chatterbox for exploration, then validate the best configuration against IndexTTS2 or VoxCPM for production deployment.
Expressive generation: Higgs Audio V2
Higgs Audio V2 is a 3B parameter model built on Llama 3.2, pre-trained on over 10 million hours of audio data. It is currently the top trending TTS model on HuggingFace (as of March 2026), positioned as an industry-leading model for expressive audio generation and multilingual voice cloning.
What makes it notable:
10M+ hours of pre-training data - the largest training corpus of any open-source TTS model
Llama 3.2 3B backbone provides strong language understanding
Expressive generation: captures whisper, vibrato, breathiness, and emotional variation
Multilingual voice cloning from short reference clips
Trade-offs:
3B parameters requires more VRAM than Chatterbox (0.5B) or Kokoro (82M)
Newer model - community recipes and fine-tuning guides are still emerging
We have not yet run IMDA NSC benchmarks on this model - the data below is from community evaluations, not first-party
When to consider Higgs Audio V2: When expressiveness matters more than latency - audiobook narration, character voices, or content where emotional range is a quality differentiator. The 10M-hour pre-training corpus gives it a broader stylistic range than models trained on smaller datasets. If you need fine-grained control over speaking style without fine-tuning, Higgs Audio V2 is worth evaluating.
Status: Community-benchmarked only. We plan to add IMDA NSC FEMALE_01 benchmarks in a future update.
Fine-tuning: when zero-shot is not enough
Zero-shot voice cloning has improved dramatically - CosyVoice 3 and Qwen3-TTS both produce usable output from a single reference clip. But "usable" is not "production-ready" for every use case. Fine-tuning is worth the effort when:
You need consistent output across hundreds of utterances (audiobook-length content)
The target voice has distinctive characteristics that zero-shot does not capture (regional accent, specific speaking rhythm)
You are building a branded voice that must sound identical every time
Reference audio requirements
This is the most common question we get. The answer is more nuanced than "more is better":
Reference audio length
What to expect
3 seconds
Minimum viable for Qwen3-TTS voice cloning. Speaker identity is captured but prosody is approximate.
10 to 15 seconds
Optimal range. Captures speaker identity, natural rhythm, and accent characteristics.
15+ seconds
Diminishing returns. Quality plateaus - additional audio does not meaningfully improve output.
30+ minutes (full dataset)
Required for full SFT paths such as IndexTTS2 and VoxCPM 2. LoRA paths such as VoxCPM 1.5 and Qwen3-TTS can start with less, but still benefit from clean coverage.
The practical takeaway: For zero-shot voice cloning, prepare 10 to 15 seconds of clean reference audio per speaker. For LoRA or full SFT, prepare a labelled dataset and audit it before training. Full SFT is less forgiving of dirty manifests because every trainable weight can learn the bad rows.
Full SFT: IndexTTS2 and VoxCPM 2
IndexTTS2 is the model to use when you need maximum voice fidelity and are willing to invest in full SFT training. In our benchmark, it outperformed SOTA on WER, speaker similarity, and emotional fidelity. The official IndexTTS2 repo provides inference only - we wrote and open-sourced the fine-tuning pipeline: instavar/indextts2-finetuning.
VoxCPM 2 is the newer full-SFT result. Vanilla full SFT did not fit on 24GB, but the run succeeded on an RTX 3090 Ti with gradient checkpointing, PagedAdamW8bit, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, a clean 90/5/5 split, and post-hoc validation. This makes VoxCPM 2 the better case study for "full SFT on consumer GPUs" while IndexTTS2 remains the more established reference point.
Key details:
Best checkpoint: step 14000
Requires explicit checkpoint retention - do not rely on automatic deletion
Crash recovery during training requires careful resume management
Keep all checkpoints until you have completed a listening evaluation sweep
Known pitfalls:
Checkpoint auto-deletion - the default retention policy deletes older checkpoints before you can evaluate them. Keep ALL checkpoints until listening eval is complete. The best checkpoint (step 14000 in our run) was not the final step (15949).
transformers version pinning - requires exactly transformers==4.52.1. Older versions throw KeyError: 'qwen3' during model loading due to the Qwen emotion model inside IndexTTS2.
Crash recovery requires explicit management - if training crashes mid-run, resume logic needs manual intervention. Log the last successful step and resume from there.
F5-TTS is an emerging fine-tunable model with a smaller but growing community. It is worth evaluating if you are running voice cloning experiments and want an alternative to VoxCPM or IndexTTS2. The model is capable, but the ecosystem (recipes, community checkpoints, debugging resources) is less mature than the top-tier options.
When to consider F5-TTS: When you have already tried VoxCPM or IndexTTS2 and want to compare outputs, or when the F5-TTS community has published a recipe specifically matching your language or accent.
Hardware constraints
The GPU-backed models in this guide fit on a 24GB GPU for inference. Fine-tuning constraints are tighter, while CPU-first and edge models have a different memory profile:
Keep all checkpoints; save more frequently than default
VoxCPM 1.5
-
✅
✅ (LoRA)
44.1 kHz resampling required before training
VoxCPM 2
2B
✅
✅ (LoRA or full SFT)
Full SFT needs gradient checkpointing, paged 8-bit optimizer state, clean manifests, and post-hoc validation
Kokoro
82M
✅
✅
Fits on much less than 24GB
Supertonic 3
99M
✅
Not covered here
CPU-first ONNX path; macOS smoke peaked at ~547 MB RSS with 385 MB local model cache
Fish Speech S2 Pro
4.6B (18.4M trainable)
✅
✅ (LoRA)
Three-step data pipeline; DualAR adds inference overhead
F5-TTS
-
✅
✅
Community recipes still maturing
Chatterbox
0.5B
✅
✅
512 samples in 2min 20s; fastest fine-tuning of any model
Higgs Audio V2
3B
✅
✅
Larger model; 10M+ hours pre-training
Consumer GPU reality check: An RTX 3090, RTX 3090 Ti, or RTX 4090 (all 24GB class) can run every model here for inference and can fine-tune the practical recipes we have validated. You do not need an A100 or H100 to get started. The constraint is recipe discipline: memory stack, dataset audit, checkpoint retention, and validation strategy.
Cross-model patterns: what broke the same way everywhere
After fine-tuning nine models on the same IMDA NSC corpus, three patterns appeared consistently:
1. The best checkpoint is never the last one.
VoxCPM 1.5 LoRA peaked at step 4000 (not later steps). VoxCPM 2 full SFT selected step 2000 by held-out validation, not the final 9000-step checkpoint. IndexTTS2 peaked at step 14000 (not the final 15949). Qwen3-TTS peaked at epoch 10 (not epoch 17). This is not a coincidence - TTS models overfit to training prosody quickly, and the last checkpoint has the lowest training loss but not the best perceptual quality. Keep all checkpoints. Evaluate by validation and listening, not by training loss alone.
2. Sample rate mismatches are the #1 silent failure.
Qwen3-TTS requires 24 kHz. VoxCPM requires 44.1 kHz. IndexTTS2 works with both but prefers 44.1 kHz. Fish Speech defaults to its own codec sample rate. If you switch between models without checking sample rate, training may appear to work (loss decreases normally) but output quality is degraded. Always verify sample rate before every training run.
3. Scale and LR defaults are wrong for fine-tuning.
Qwen3-TTS default LR (2e-5) causes noise; use 2e-6. Qwen3-TTS default LoRA scale (1.0) over-steers; use 0.3 to 0.35. These are not edge cases - they affect every fine-tuning run. No model's default hyperparameters are tuned for single-speaker fine-tuning on a small corpus. Always sweep before committing.
FAQ
CosyVoice 3 or Qwen3-TTS - which should I pick?
They solve different problems. CosyVoice 3 produces the best zero-shot quality for pre-produced content - audiobooks, video narration, anything where you batch-generate and review before publishing. Qwen3-TTS is the real-time model - 97ms first-packet latency, streaming-native, and the only option here if your product needs conversational response times. If latency does not matter, CosyVoice 3 for zero-shot, VoxCPM 1.5 LoRA for the fastest fine-tuned path, or VoxCPM 2 full SFT when you want the deeper adaptation path.
How much reference audio do I need for voice cloning?
3 seconds minimum (Qwen3-TTS), 10 to 15 seconds optimal for zero-shot cloning. Beyond 15 seconds, zero-shot reference quality often plateaus. LoRA and full SFT are different: they need labelled training rows, not just a longer prompt clip. Full SFT paths such as IndexTTS2 and VoxCPM 2 benefit from a complete clean dataset, and the manifest audit matters as much as duration.
Zero-shot or fine-tuned - when does fine-tuning become worth it?
Fine-tune when: you need consistent output across 50+ utterances, the target voice has a distinctive accent that zero-shot misses, or you are building a branded voice. Stay with zero-shot when: you are prototyping, the voice is a standard accent, or you cannot invest the 2 to 4 hours of training and evaluation time.
Which model retains Singaporean English accent best?
VoxCPM 1.5 and IndexTTS2 both retained the IMDA NSC FEMALE_01 accent well after fine-tuning. CosyVoice 3 zero-shot also handles non-American accents - it does not flatten to General American the way some models do. We specifically benchmark on Singaporean English because accent retention is a failure mode that most English-centric benchmarks miss entirely.
Can I run these on a consumer GPU?
Yes. The GPU-backed models in this guide fit on a 24GB GPU (RTX 3090, RTX 3090 Ti, RTX 4090) for inference, and the validated fine-tuning recipes fit with the caveats in the hardware table. Kokoro (82M) fits on much less, and Supertonic 3 ran in our macOS CPU smoke test without discrete VRAM. You do not need data-centre hardware to get started. See our 24GB GPU guide for exact VRAM profiles.
What about proprietary models like ElevenLabs or PlayHT?
This guide covers open-source models only. Proprietary APIs (ElevenLabs, PlayHT, Azure Neural TTS) are viable but introduce vendor lock-in, per-character pricing, and data residency concerns. If you need full control over voice data, on-premise deployment, or want to avoid per-inference costs at scale, open-source is the path. The models in this guide match or exceed proprietary quality for single-speaker fine-tuned use cases.
Sources
All recommendations in this article are grounded in first-party benchmarks run on IMDA NSC FEMALE_01 using an RTX 3090 Ti (24GB). For detailed per-model results: