We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
60-second takeaway We ran a consistent single-speaker benchmark on four open-source TTS models using IMDA NSC FEMALE_01 on an RTX 3090 Ti (24GB). VoxCPM 1.5 and Qwen3-TTS 1.7B both produced deployable LoRA outputs. IndexTTS2 gave a stable full-SFT baseline, and VoxCPM 2 full SFT is now validated on 24GB with the right memory stack. CosyVoice3's first full-SFT run failed, but the corrected LoRA run is the stable path. If you need something deployable today on a 24GB GPU, start with VoxCPM or Qwen3-TTS LoRA.
What this benchmark covers
This is a practitioner-oriented comparison, not an academic leaderboard. We evaluated four models under the same conditions:
Dataset: IMDA NSC FEMALE_01 - a single-speaker set with natural Singaporean English accent
Hardware: one NVIDIA RTX 3090 Ti (24 GB VRAM)
Goal: produce voice-cloned audio suitable for AI-generated video narration (A-roll use case)
Evaluation: qualitative listening on naturalness, long-text stability, accent retention, and operational friction
We are not measuring WER or MOS scores from automated tools. We are measuring whether the output sounds production-ready to a human listener on a video platform.
The four models
VoxCPM 1.5
VoxCPM 1.5 uses a LoRA finetuning path that fits within 24GB VRAM without modification. Training is straightforward with standard train/val splits.
Dimension
Result
Finetuning approach
LoRA
Best checkpoint (this run)
step_0004000
Long-text stability
Good
Prompt sensitivity
Moderate - use clean prompt clips
Voice cloning
Need consented AI voiceovers?
Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.
Key insight: No-prompt generation at step 4000 gave the best naturalness. Prompted inference copied prompt room noise into the output, which was audible on studio playback. Use prompt only when strong speaker lock is required.
VoxCPM 2
VoxCPM 2 is the newer full-SFT evidence point. LoRA fits comfortably, and full SFT now fits on an RTX 3090 Ti when the run uses gradient checkpointing, paged 8-bit optimizer state, allocator tuning, and a clean manifest.
Dimension
Result
Finetuning approach
LoRA or full SFT
Best checkpoint evidence
Full SFT selected step 2000 by held-out validation in the split run
Memory stack, dataset cleanup, disk, and validation
Production-ready?
Promising - validation-selected checkpoint available
Key insight: Full SFT on a 24GB consumer GPU is no longer just theoretical, but it is not the low-friction starting point. Use VoxCPM 1.5 LoRA when you need a fast deployable path; use VoxCPM 2 full SFT when you are testing whether deeper adaptation beats adapter speed.
Qwen3-TTS 1.7B
Qwen3-TTS 1.7B with LoRA was the model where adapter scale mattered most. Scale 1.0 over-steered and produced noisy outputs; scale 0.3 to 0.35 sounded stable.
Dimension
Result
Finetuning approach
LoRA
Best checkpoint (this run)
Epoch 10
Best LoRA scale
0.3 to 0.35
Long-text stability
Good with SDPA backend
Prompt sensitivity
Low - robust to formatting variation
Production-ready?
Yes
Key insight: The scale sweep matters more than checkpoint selection alone. Run a quick 5-sample listening test at scales 0.2, 0.3, 0.35, and 0.5 before committing to a checkpoint. Scale 1.0 is almost always wrong for this benchmark.
IndexTTS2
IndexTTS2 uses full SFT (not LoRA). It requires more careful checkpoint management because the training loop had crash recovery issues in our run.
Dimension
Result
Finetuning approach
Full SFT
Best checkpoint (this run)
model_step14000.pth
Long-text stability
Good
Crash recovery
Required explicit resume management
Production-ready?
Yes - with operational caution
Key insight: Keep ALL checkpoints until you've done a listening eval sweep. The retention policy deleted older checkpoints before we could test them. Pin the best checkpoint explicitly once identified - don't rely on automatic deletion logic.
CosyVoice3
CosyVoice3 was the outlier. Our first full-SFT run did not reach production quality, while the corrected LoRA rerun was much more stable.
Dimension
Result
Finetuning approach
LoRA for the corrected run
Run status
Stable LoRA rerun; listening evaluation still pending
Start with VoxCPM 1.5 step 4000. It had the lowest setup friction and the cleanest no-prompt output in our run. LoRA training is straightforward and the checkpoint selection rule is simple.
If you need LoRA-style adapter control
Use Qwen3-TTS 1.7B LoRA. The scale parameter gives you a post-training knob to tune output strength without retraining. This is valuable when you want to fine-tune the output on different content types without full retraining cycles.
If you need the most reproducible full-SFT baseline
Use IndexTTS2 if you want the established full-SFT baseline. Use VoxCPM 2 full SFT if you specifically want to test the newer consumer-GPU full-SFT path. It is feasible on 24GB, but only with gradient checkpointing, paged optimizer state, clean manifests, and post-hoc validation.
If you want to evaluate CosyVoice
Use CosyVoice2 as a zero-shot baseline and evaluate the corrected CosyVoice3 LoRA checkpoint by listening before deployment. Do not deploy the failed CosyVoice3 full-SFT run.
What is IMDA NSC FEMALE_01?
IMDA NSC is the National Speech Corpus published by Singapore's Infocomm Media Development Authority. FEMALE_01 is a single-speaker subset with natural Singaporean English. We use it as a benchmark voice because it has a distinctive accent profile that stress-tests speaker similarity in voice cloning - a model that sounds natural on this speaker generalises well to other non-American-English speakers.
Audio evidence
All audio samples from this benchmark are published in the individual model deep dives. Listen to them side by side before making a deployment decision.
Can I run these models on a single RTX 3090 Ti (24GB)?
Yes. The validated paths fit within 24GB VRAM for training or inference. The full feasibility notes - including peak VRAM, runtime, and recipe availability - are covered in Voice Cloning on a 24GB GPU: What Actually Works in 2026.
Which model has the best Singaporean English accent retention?
In this benchmark, VoxCPM and IndexTTS2 both retained the FEMALE_01 accent profile well. Qwen3-TTS at the right scale also retained it. CosyVoice3 (current run) had inconsistent retention.
Are any of these models commercially licensed for production use?
License status varies. IndexTTS2 uses a research license with commercial use restrictions. VoxCPM, Qwen3-TTS, and CosyVoice have varying commercial terms - verify the latest license on each model's repository before deploying.
What's the difference between LoRA and full SFT for TTS finetuning?
LoRA (Low-Rank Adaptation) trains a small adapter on top of a frozen base model. It uses less VRAM and trains faster, but the adapter's strength needs tuning (the lora_scale parameter). Full SFT (Supervised Fine-Tuning) updates all model weights. It requires more VRAM and longer training but tends to converge more reliably for voice profiles with strong accent characteristics.