Voice Cloning on a 24GB GPU — What Actually Works in 2026
Download printable cheat-sheet (CC-BY 4.0)25 Mar 2026, 00:00 Z
60-second takeaway
VoxCPM 1.5, Qwen3-TTS 1.7B, IndexTTS2, and CosyVoice3 all fit within 24GB VRAM on an RTX 3090 Ti for both training and inference.
The real constraint is not memory — it is having a working recipe. LoRA paths (VoxCPM, Qwen3-TTS) have the most mature recipes. Full SFT paths (IndexTTS2, CosyVoice3) work but need explicit checkpoint management.
If you have a 24GB GPU and want to start today, VoxCPM 1.5 LoRA is the path of least resistance.
Who this is for
This guide is for engineers who have a single consumer or prosumer GPU (RTX 3090, RTX 3090 Ti, RTX 4090, or similar 24GB class) and want to fine-tune a TTS model for custom voice cloning. We ran all benchmarks on an RTX 3090 Ti (24 GB VRAM) on a Tailscale-connected remote desktop.
The question we're answering: which open-source TTS models can you actually run — training and inference — on a single 24GB GPU in 2026?
The short answer
| Model | VRAM fit (24GB) | Training mode | Recipe maturity | Deployable result? |
| VoxCPM 1.5 | ✅ Fits | LoRA | Mature | ✅ Yes |
| Qwen3-TTS 1.7B | ✅ Fits | LoRA | Mature | ✅ Yes |
| IndexTTS2 | ✅ Fits | Full SFT |