How Open-Source TTS Architectures Differ - And What It Means for Fine-Tuning (2026)

Download printable cheat-sheet (CC-BY 4.0)

30 Mar 2026, 00:00 Z

60-second takeaway
Six open-source TTS models dominate the 2026 fine-tuning landscape. They look similar on paper - most do voice cloning, most fit on 24GB, most produce good output. But they use fundamentally different architectures, and those differences determine which fine-tuning approach works, which LoRA framework you need, how long data preprocessing takes, and whether you can deploy commercially.
We fine-tuned five of these models on the same single-speaker corpus (IMDA NSC FEMALE_01) and analyzed the sixth from its paper and model code. This article explains what the architectures tell you about fine-tuning - before you commit GPU hours.

Who this is for

  • Engineers choosing a TTS model to fine-tune: You have read the benchmarks. You know which models exist. You need to understand the architectural differences before committing to a fine-tuning approach - because picking the wrong approach wastes days, not hours.
  • ML engineers adding TTS to a pipeline: You need to know which models wrap with standard PEFT, which need custom LoRA libraries, and which only support full SFT - before you design your training infrastructure.
  • Technical leads evaluating licenses: Two of these six models have license restrictions that are not obvious from their GitHub repos. This article flags them before you build on top of them.

The six models

ModelTotal paramsLLM backboneReleasedLicense
Voxtral 4B~4.1BMinistral-3BMarch 2026CC BY-NC 4.0
Qwen3-TTS 1.7B1.7B

Voice cloning

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.