We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
60-second takeaway VoxCPM 1.5, VoxCPM 2, Qwen3-TTS 1.7B, IndexTTS2, and CosyVoice3 all fit within 24GB VRAM on an RTX 3090 Ti for the validated training or inference paths we tested. The real constraint is not just memory - it is having a working recipe. LoRA paths (VoxCPM 1.5, Qwen3-TTS, CosyVoice3) are the fastest iteration loops. Full SFT paths (IndexTTS2, VoxCPM 2) work but need explicit checkpoint management, memory controls, and validation. If you have a 24GB GPU and want to start today, VoxCPM 1.5 LoRA is the path of least resistance.
Who this is for
This guide is for engineers who have a single consumer or prosumer GPU (RTX 3090, RTX 3090 Ti, RTX 4090, or similar 24GB class) and want to fine-tune a TTS model for custom voice cloning. We ran all benchmarks on an RTX 3090 Ti (24 GB VRAM) on a Tailscale-connected remote desktop.
The question we're answering: which open-source TTS models can you actually run - training and inference - on a single 24GB GPU in 2026?
The short answer
Model
VRAM fit (24GB)
Training mode
Recipe maturity
Deployable result?
VoxCPM 1.5
✅ Fits
LoRA
Mature
✅ Yes
VoxCPM 2
✅ Fits
LoRA or full SFT
Validated with memory controls
✅ Validation-selected checkpoint
Qwen3-TTS 1.7B
Voice cloning
Need consented AI voiceovers?
Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.
The validated paths fit. The differentiator is not just raw VRAM - it is recipe stability, checkpoint management, dataset cleanliness, and whether you need LoRA speed or full-SFT depth.
VoxCPM 1.5 on 24GB
VRAM profile: Fits comfortably within 24GB for both LoRA training and inference.
Training mode: LoRA fine-tuning. The 44.1 kHz audio prep requirement is the main setup step - resample your dataset before training.
Recipe: Standard LoRA train/val split. No custom modifications required for 24GB. Training to step 9000 is feasible in a single session.
What to watch:
Audio resampling to 44.1 kHz is mandatory. Skip this and training diverges.
Validation loss is a reliable guide here - step 4000 was the best in our benchmark, but your dataset may differ.
No-prompt inference is more stable than prompted inference for clean output.
Expected runtime: Training to step 9000 on FEMALE_01 completed in a single GPU session without memory pressure.
VoxCPM 2 on 24GB
VRAM profile: LoRA fits comfortably. Full SFT fits only with gradient checkpointing, paged 8-bit optimizer state, and allocator tuning.
Training mode: LoRA for fast adaptation, full SFT for deeper adaptation. The full-SFT run used gradient_checkpointing: true, PagedAdamW8bit, and PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Recipe: Clean grouped 90/5/5 split, train-only production run, then post-hoc validation across saved checkpoints without optimizer state loaded.
What to watch:
Vanilla full SFT OOMs on 24GB. Optimizer changes alone do not help if the OOM happens at first forward.
Empty-text manifest rows are toxic to stop loss. Clean the manifest before launching a long run.
The final checkpoint is not necessarily the best. Our held-out validation selected step 2000 over the final 9000-step checkpoint.
Expected runtime: About 2.4 hours for 9000-step LoRA and about 5 hours for a 9000-step full-SFT run at the validated effective batch settings.
Qwen3-TTS 1.7B on 24GB
VRAM profile: Fits within 24GB for LoRA training and inference.
Training mode: LoRA. Requires JSONL dataset prep and codec preprocessing before training. The sft_12hz.py script handles the codec path.
Recipe: LoRA train/val/test split. The lora_scale parameter at inference time is the key tuning knob - not just checkpoint selection.
What to watch:
Run the codec preprocessing step before training. Skipping it causes silent failures.
Use SDPA (Scaled Dot-Product Attention) backend for inference - it reduces VRAM pressure and improves long-text stability.
Scale sweep at inference: test 0.2, 0.3, 0.35, 0.5. Scale 1.0 almost always over-steers.
Deterministic decode (fixed seed) is required for reproducible listening comparisons.
Expected runtime: Training to epoch 10 on FEMALE_01 fits within one GPU session. The codec prep adds ~15 minutes of CPU time upfront.
IndexTTS2 on 24GB
VRAM profile: Fits within 24GB for full SFT training and inference.
Training mode: Full SFT with resume support. This is more memory-intensive than LoRA paths but still fits 24GB without quantisation.
Recipe: Process manifests (FEMALE_01_44k format) → full SFT with explicit resume management. The training loop has a crash-prone resume path in some versions - use explicit checkpoint save paths and keep crash logs.
What to watch:
Do not rely on automatic checkpoint retention. Keep all checkpoints manually until you have done a listening sweep.
The best validation region in our run was around step ~13800, but the nearest saved checkpoint was step 14000. This is typical - save more frequently than you think you need to.
Crashes during training are recoverable with the right resume path. Keep detailed logs.
Expected runtime: Training to step 15000+ was achievable on 24GB, but required crash recovery in our run.
CosyVoice3 on 24GB
VRAM profile: Fits within 24GB for LoRA fine-tuning.
Training mode: LoRA via PEFT-integrated tooling. Our earlier full-SFT run failed after epoch 1; the corrected LoRA run was the stable path.
Recipe: Available (train_cosyvoice3_lora.py, infer_cosyvoice3_lora.py). The recipe works at the hardware level - the issues in our run were checkpoint management and prompt handling, not VRAM.
Tighter checkpoint gating (explicit save every N epochs) is required before the run can be evaluated properly.
Long-text generation (>20s) was unstable in the current run configuration.
Expected runtime: LoRA training fits comfortably within 24GB, but a proper checkpoint gating discipline adds setup time before the first reliable evaluation.
Hardware notes: RTX 3090 Ti specifics
All runs were on an RTX 3090 Ti with 24 GB GDDR6X. A few GPU-specific observations:
Thermal throttling: Long training runs (4+ hours) on the 3090 Ti can trigger thermal throttling under poor airflow. Monitor GPU temperature and ensure adequate case ventilation.
Memory bandwidth: The 3090 and 3090 Ti are both in the same 936 GB/s class, so full-SFT feasibility is determined more by memory controls than by the small clock difference.
A100/H100 comparison: These 24GB consumer runs are roughly 2 to 4x slower than equivalent runs on an A100 80GB. For production-scale fine-tuning (larger datasets, more epochs), a cloud A100 is significantly faster. The 24GB path is viable for prototyping and single-speaker benchmarking.
Practical setup checklist
Before starting any of these runs on a 24GB GPU:
□ Verify CUDA version matches model requirements (check README)
□ Pre-process and resample dataset to model-required sample rate
□ Set explicit checkpoint save paths (do not rely on defaults)
□ Confirm available disk space (full SFT checkpoints can be several GB each)
□ Set up crash recovery / resume path before starting long runs
□ Run a 10-minute smoke test (1 epoch, small batch) before committing to full training
□ Keep a training log for each run (model, dataset, LR, steps, VRAM peak)
FAQ
Can I use a 16GB GPU (RTX 3080, RTX 4080) instead?
Not without modification. VoxCPM 1.5, Qwen3-TTS, and CosyVoice3 LoRA may be achievable with reduced batch size and gradient checkpointing, but we have not tested this. IndexTTS2 and VoxCPM 2 full SFT are unlikely to fit 16GB without more aggressive quantization, sharding, or offload.
How long does a full benchmark run take on a 3090 Ti?
Rough estimates for FEMALE_01-scale datasets (single speaker, about 2 to 5 hours of audio):
VoxCPM 1.5 or VoxCPM 2 LoRA to step 9000: about 2.5 to 6 hours depending on recipe and effective batch
VoxCPM 2 full SFT to step 9000: about 5 hours with gradient checkpointing and paged optimizer state
Qwen3-TTS LoRA to epoch 10: 4 to 8 hours
IndexTTS2 full SFT to step 15000: 8 to 16 hours (with crash recovery overhead)
CosyVoice3 LoRA: checkpoint gating and listening evaluation are the main overheads
Is LoRA always better than full SFT for 24GB runs?
Not necessarily. LoRA is faster and uses less VRAM, but full SFT can produce more stable adaptation for some voice profiles. In our benchmark, the LoRA models (VoxCPM 1.5, Qwen3-TTS, CosyVoice3 rerun) were easier to iterate. Full SFT (IndexTTS2, VoxCPM 2) also worked but required more operational discipline. Choose based on your iteration speed, disk budget, and validation requirements.
What is IMDA NSC FEMALE_01?
IMDA NSC is Singapore's National Speech Corpus. FEMALE_01 is a single-speaker set with natural Singaporean English. We use it as a benchmark because the accent profile stress-tests speaker similarity in voice cloning. See IMDA NSC Voice Cloning Finetuning Benchmark 2026 for the full methodology.