Voice Cloning on a 24GB GPU - What Actually Works in 2026

Download printable cheat-sheet (CC-BY 4.0)

25 Mar 2026, 00:00 Z

60-second takeaway
VoxCPM 1.5, VoxCPM 2, Qwen3-TTS 1.7B, IndexTTS2, and CosyVoice3 all fit within 24GB VRAM on an RTX 3090 Ti for the validated training or inference paths we tested.
The real constraint is not just memory - it is having a working recipe. LoRA paths (VoxCPM 1.5, Qwen3-TTS, CosyVoice3) are the fastest iteration loops. Full SFT paths (IndexTTS2, VoxCPM 2) work but need explicit checkpoint management, memory controls, and validation.
If you have a 24GB GPU and want to start today, VoxCPM 1.5 LoRA is the path of least resistance.

Who this is for

This guide is for engineers who have a single consumer or prosumer GPU (RTX 3090, RTX 3090 Ti, RTX 4090, or similar 24GB class) and want to fine-tune a TTS model for custom voice cloning. We ran all benchmarks on an RTX 3090 Ti (24 GB VRAM) on a Tailscale-connected remote desktop.

The question we're answering: which open-source TTS models can you actually run - training and inference - on a single 24GB GPU in 2026?

The short answer

The 24GB class is the comfortable consumer baseline, but many teams start below it. Use this picker before deciding whether to fine-tune locally, run inference only, or use a hosted API while you collect cleaner voice data.

Hardware you haveInference fitFine-tuning fitFirst model path to test
CPU onlyFeasible only for small or slow local testsNot practical for modern voice cloningUse CPU to verify tooling or test Kokoro-style small models; use hosted or remote GPU for real cloning evaluation
4GB VRAMVery limited, mostly small models or aggressive offloadNot practicalTreat as edge or demo hardware, not a training box

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.