How Open-Source TTS Architectures Differ - And What It Means for Fine-Tuning (2026)

Download printable cheat-sheet (CC-BY 4.0)

30 Mar 2026, 00:00 Z

60-second takeaway
Six open-source TTS models dominate the 2026 fine-tuning landscape. They look similar on paper - most do voice cloning, most fit on 24GB, most produce good output. But they use fundamentally different architectures, and those differences determine which fine-tuning approach works, which LoRA framework you need, how long data preprocessing takes, and whether you can deploy commercially.
We fine-tuned five of these models on the same single-speaker corpus (IMDA NSC FEMALE_01) and analyzed the sixth from its paper and model code. This article explains what the architectures tell you about fine-tuning - before you commit GPU hours.

Who this is for

Engineers choosing a TTS model to fine-tune: You have read the benchmarks. You know which models exist. You need to understand the architectural differences before committing to a fine-tuning approach - because picking the wrong approach wastes days, not hours.
ML engineers adding TTS to a pipeline: You need to know which models wrap with standard PEFT, which need custom LoRA libraries, and which only support full SFT - before you design your training infrastructure.
Technical leads evaluating licenses: Two of these six models have license restrictions that are not obvious from their GitHub repos. This article flags them before you build on top of them.

The six models

Model	Total params	LLM backbone	Released	License
Voxtral 4B	~4.1B	Ministral-3B	March 2026	CC BY-NC 4.0
Qwen3-TTS 1.7B	1.7B

Paradigm	Models	Fine-tuning target	What's frozen	LoRA possible?
A: AR + flow-matching + vocoder	Voxtral, Chatterbox, CosyVoice 3	LLM backbone	Flow model + vocoder	Yes (backbone only)
B: Multi-codebook AR + codec	Qwen3-TTS, Fish Speech S2 Pro	Full AR model	Codec encoder/decoder	Yes (single LoRA pass)
C: AR + DiT + vocoder	IndexTTS2	GPT backbone	DiT + BigVGAN	No (full SFT only)

Model	Codec	Token rate	Semantic vocab	Tokens per frame	Output sample rate
Voxtral 4B	Voxtral Codec (VQ + FSQ hybrid)	12.5 Hz	8,192	37 (1 semantic + 36 acoustic)	24 kHz
Qwen3-TTS	Custom RVQ (16 codebooks)	12.5 Hz	2,048 per layer	16	24 kHz
IndexTTS2	MaskGCT semantic (RepCodec)	~25 Hz	8,192	1	22 kHz
Chatterbox	S3Tokenizer (FSQ)	25 Hz	6,561	1	24 kHz
Fish Speech S2 Pro	Modified DAC (RVQ)	~21 Hz	4,096	10 (1 slow + 9 fast)	44.1 kHz
CosyVoice 3	FSQ-MinMo (multi-task)	25 Hz	6,561	1	24 kHz

Model	LoRA framework	Why	PEFT `get_peft_model()` works?
Qwen3-TTS	HuggingFace PEFT	Talker is a standard Qwen3 transformer decoder	Yes - native
Voxtral 4B	HuggingFace PEFT (expected)	Ministral backbone is HF-compatible	Likely yes - untested
CosyVoice 3	HuggingFace PEFT (on Qwen2)	Qwen2 backbone is HF-compatible	Yes for backbone, not for DiT
Chatterbox	HuggingFace PEFT or custom	Llama-based T3 component	Partial - community repos exist
Fish Speech S2 Pro	Custom `loralib`	Not PEFT - uses `setup_lora()` that replaces `nn.Linear` layers	No
IndexTTS2	None	Custom GPT-2, not a `PreTrainedModel`	No - full SFT only

Model	Target modules	Rank	Alpha
Qwen3-TTS	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`	16	32
Voxtral 4B	`qkv_proj`, `o_proj`, `gate_up_proj`, `down_proj` (from Voxtral Mini ASR)	32	32
Fish Speech S2 Pro	`attention`, `mlp`, `embeddings`, `output` (both Slow + Fast AR)	8	16
CosyVoice 3	Qwen2 attention + MLP layers	TBD	TBD

Rank	Model	Preprocessing steps	Time for ~7h audio	Complexity
1 (simplest)	Chatterbox	wav + txt → auto-resamples, caches embeddings	Minutes	Low
2	CosyVoice 3	wav → CAMPPlus embedding + FSQ tokenization → Parquet	~10 min	Medium
3	Qwen3-TTS	wav → 16-layer RVQ codec extraction → JSONL	~15 min	Medium
4	Voxtral 4B	wav → Voxtral Codec encoding (semantic + acoustic)	TBD	Medium (estimated)
5	Fish Speech S2 Pro	wav → DAC VQ extraction → protobuf	22 min (VQ bottleneck)	High
6 (most complex)	IndexTTS2	wav → SeamlessM4T → RepCodec → conditioning → prompt pairs → .npy	30+ min (5 stages)	Very high

How Open-Source TTS Architectures Differ - And What It Means for Fine-Tuning (2026)

Who this is for

The six models

Need consented AI voiceovers?

Three generation paradigms

Paradigm A: Autoregressive backbone → flow-matching decoder → vocoder

Paradigm B: Multi-codebook autoregressive → codec decoder (no separate flow stage)

Paradigm C: Autoregressive backbone → DiT mel predictor → vocoder

Paradigm summary

Codec and tokenizer comparison

What the token rate tells you

What the tokens-per-frame count tells you

The sample rate trap

Codec family relationships

LoRA compatibility: why some models wrap with PEFT and others don't

The PEFT compatibility rule

LoRA target modules

Data pipeline complexity

What the preprocessing tells you about iteration speed

Data format summary

What we learned fine-tuning on the same corpus

Three patterns that appeared across every model

Model-specific findings

License matrix: what you can actually deploy

Architecture → fine-tuning strategy decision matrix

Papers and further reading

Related guides

Related Posts

Model	Input	Training format
Chatterbox	wav (16 kHz auto-resampled) + txt	LJSpeech format or wav+txt pairs
CosyVoice 3	wav (24 kHz) + text + instruct file	Parquet (1000 utterances per file)
Qwen3-TTS	wav (24 kHz) + transcript	JSONL with pre-extracted 16-layer codec codes
Voxtral 4B	wav (24 kHz) + transcript	TBD
Fish Speech S2 Pro	wav (44.1 kHz) + .lab transcript	Protobuf dataset
IndexTTS2	wav (24 kHz resampled) + transcript	JSONL manifest + .npy feature files

Model	Fine-tuning type	Steps/epochs completed	Best checkpoint	Production-ready?
Qwen3-TTS	LoRA	17 epochs (2,950 steps)	Epoch 10	Yes
IndexTTS2	Full SFT	15,949 steps	Step 14,000	Yes
CosyVoice 3	LoRA (custom scripts)	5,800+ steps	Rerun pending	Not yet
Chatterbox	SFT	512 steps (2 min 20s)	Quality eval pending	TBD
Fish Speech S2 Pro	LoRA (pilot)	64 steps	Pilot only	Not evaluated
Voxtral 4B	-	-	-	Fine-tuning run planned

Model	Weights license	Commercial fine-tuned weights?	Fine-tuning code
Qwen3-TTS	Apache-2.0	Yes	Apache-2.0
IndexTTS2	Apache-2.0	Yes	Apache-2.0
Chatterbox	MIT	Yes	Apache-2.0
CosyVoice 3	Apache-2.0	Yes	Apache-2.0
Voxtral 4B	CC BY-NC 4.0	No - requires separate Mistral commercial license	Apache-2.0 (ours)
Fish Speech S2 Pro	Fish Audio Research License	No - requires separate Fish Audio commercial license	Apache-2.0 (ours)

Your situation	Paradigm	Recommended model	Fine-tuning approach	Data prep time
Need LoRA with standard PEFT	B	Qwen3-TTS	`peft.get_peft_model()` on Qwen3 talker	~15 min
Need full SFT for maximum fidelity	C	IndexTTS2	Full-weight training on GPT backbone	30+ min
Need fastest possible iteration	A	Chatterbox	SFT on T3 component	Minutes
Need flow-matching quality	A	CosyVoice 3	LoRA on Qwen2 backbone (custom scripts)	~10 min
Need multilingual (EN+CN+JP)	B	Fish Speech S2 Pro	LoRA via custom `loralib`	22+ min
Want first-mover on newest model	A	Voxtral 4B	LoRA on Ministral backbone (not yet implemented)	TBD
Need commercial deployment	A/B/C	Qwen3-TTS, IndexTTS2, Chatterbox, or CosyVoice 3	Depends on model	Varies

Model	Paper	arXiv
Voxtral 4B	Voxtral TTS	2603.25551
Qwen3-TTS	Qwen3-TTS Technical Report	2601.15621
IndexTTS2	IndexTTS 2	2506.21619
Chatterbox	No paper (Resemble AI blog + benchmark)	-
Fish Speech S2 Pro	Fish Audio S2 Technical Report	2603.08823
CosyVoice 3	CosyVoice 3: Scaling-up and Post-training	2505.17589

Who this is for

The six models

Need consented AI voiceovers?

Three generation paradigms

Paradigm A: Autoregressive backbone → flow-matching decoder → vocoder

Paradigm B: Multi-codebook autoregressive → codec decoder (no separate flow stage)

Paradigm C: Autoregressive backbone → DiT mel predictor → vocoder

Paradigm summary

Codec and tokenizer comparison

What the token rate tells you

What the tokens-per-frame count tells you

The sample rate trap

Codec family relationships

LoRA compatibility: why some models wrap with PEFT and others don't

The PEFT compatibility rule

LoRA target modules

Data pipeline complexity

What the preprocessing tells you about iteration speed

Data format summary

What we learned fine-tuning on the same corpus

Three patterns that appeared across every model

Model-specific findings

License matrix: what you can actually deploy

Architecture → fine-tuning strategy decision matrix

Papers and further reading

Related guides

Related Posts

Function Calling and MCP First Principles

FP8 on RTX 3090 Ti - What Actually Works on Consumer GPUs

LoRA vs Full SFT for Voice Models - What Actually Changes on a 24 GB GPU