Qwen3-TTS LoRA Fine-Tuning - Scale Sweeps, Checkpoints, and Production Defaults

Download printable cheat-sheet (CC-BY 4.0)

07 Feb 2026, 00:00 Z

60-second takeaway
Qwen3-TTS + LoRA worked well on this benchmark once we controlled inference scale and learning rate.
The key lesson was not just checkpoint selection but adapter strength: scale 1.0 over-steered, while 0.3 to 0.35 sounded stable.
The official default LR (2e-5) is too high - use 2e-6 for the 1.7B model.
For this run, epoch 10 plus lora_scale around 0.3 was the best operating point - but this is partly bug-dependent (see the double-shift note below).

If you searched for qwen3 tts lora, qwen3 tts finetune, qwen3-tts fine-tuning, or Qwen3-TTS VRAM requirements, this is the main guide. The sections below cover the dataset recipe, 24GB GPU settings, LoRA-vs-full-fine-tune tradeoff, deployment-time scale control, and the training-script bugs you should patch before spending GPU time.

Update (Mar 2026):
Community research surfaced two critical bugs in the official sft_12hz.py that affect training results: a missing text_projection call and a double label-shift causing progressive speech acceleration. The epoch 10 sweet spot we found is likely an artifact of the double-shift bug. See the Known Bugs section below before starting a new run.

Companion repo

All reusable LoRA tooling is published separately:

Where this fits

For founders: this is a strong candidate if you want high quality from single-GPU LoRA runs.
For engineers: this page captures exact run behavior, including where losses flattened and where inference destabilized - plus community-sourced bug fixes and configuration recommendations.

Series overview:

https://instavar.com/blog/IMDA_NSC_Voice_Cloning_Finetuning_Benchmark_2026

Not sure which model to fine-tune? See the TTS Model Decision Tree for a use-case-first comparison across all seven models we benchmarked.

Start by intent:

Dataset requirements: use 10 to 30 minutes of clean single-speaker audio, with 24 kHz codec preparation and stripped non-speech tags.
VRAM requirements: use an RTX 3090 Ti or RTX 4090 class 24GB card for comfortable LoRA sweeps; lower batch size and raise gradient accumulation for long clips.

Parameter	Our run	Community consensus (1.7B)
Learning rate	Not specified (used default)	2e-6
Batch size	Not specified	2
Gradient accumulation	Not specified	1-4
Epochs	10 (best)	3-5 (with bug fix applied)
Precision	bfloat16	bfloat16

Version	Pre-SFT loss	Post-SFT loss	Output
Buggy (double-shift)	22	13	"Very fast speech"
Fixed	8.3	7.8	"Correct speech"

Failure Mode	Most Likely Cause	Fix
Pure noise output	LR too high (default 2e-5); use 2e-6	Lower LR
Infinite generation (no EOS)	LR too high - model forgets EOS token	Lower LR; explicit eos_token_id
Speech gets faster each epoch	Double label-shift bug in sft_12hz.py	Apply PR #178 fix
Monotone output, emotion loss	Fine-tuning flattens expressiveness	Use ICL voice cloning instead
Training crash mid-run	Audio not at 24kHz	Resample to 24kHz before codec prep
Progressive timbre shift between chunks	Independent sampling per chunk	Fixed seed + repetition_penalty
Audio truncation at ~24-second mark	chunked_decode over-trims samples	PR #259 fix; unmerged
First audio token distorted	Cold-start decoder, no warm-up context	Prepend silence tokens; trim output
Accent not transferred	Pre-trained accent bias is very strong	Expect this; plan for phoneme-level approach
Speaker lost after fine-tuning	Training overwrites speaker embedding slots	PR #232 preserves them
VRAM OOM on long training clips	Speaker encoder batch size	Reduce batch size on speaker encoder path
Memory leak on repeated inference	Known leak in multi-generation loops	Call `torch.cuda.empty_cache()` periodically

Qwen3-TTS LoRA Fine-Tuning - Scale Sweeps, Checkpoints, and Production Defaults

Companion repo

Where this fits

Need consented AI voiceovers?

Experiment setup

Dataset preparation

Sample rate: 24 kHz mandatory before codec generation

Dataset size: 10-30 minutes of clean audio

Sample-level tricks

Learning rate: the most important configuration change

Known bugs in sft_12hz.py (apply fixes before training)

Bug 1 - Missing text_projection call

Bug 2 - Double label-shift causing progressive speech acceleration

Best checkpoint logic (original run - pre-fix)

Audio evidence

Recommended sample from this run

Recommended inference settings

LoRA scale

Generation parameters

Attention backend

Multi-chunk voice consistency

Cold-start decoder distortion

Failure modes: complete taxonomy

Practical support FAQ

Preset voice or LoRA?

How much data is enough?

What does faster-qwen3-tts change?

What do streaming forks solve?

Why does generation loop or turn into noise?

When are emotion tags reliable?

What VRAM should I expect?

Community tooling worth knowing

Engineer appendix

Key paths from this run

Distribution note

Before starting your next run: checklist

Related deep dives

Related Posts

Function Calling and MCP First Principles

Companion repo

Where this fits

Need consented AI voiceovers?

Experiment setup

Dataset preparation

Sample rate: 24 kHz mandatory before codec generation

Dataset size: 10-30 minutes of clean audio

Sample-level tricks

Learning rate: the most important configuration change

Known bugs in sft_12hz.py (apply fixes before training)

Bug 1 - Missing text_projection call

Bug 2 - Double label-shift causing progressive speech acceleration

Best checkpoint logic (original run - pre-fix)

Audio evidence

Recommended sample from this run

Recommended inference settings

LoRA scale

Generation parameters

Attention backend

Multi-chunk voice consistency

Cold-start decoder distortion

Failure modes: complete taxonomy

Practical support FAQ

Preset voice or LoRA?

How much data is enough?

What does faster-qwen3-tts change?

What do streaming forks solve?

Why does generation loop or turn into noise?

When are emotion tags reliable?

What VRAM should I expect?

Community tooling worth knowing

Engineer appendix

Key paths from this run

Distribution note

Before starting your next run: checklist

Related deep dives

Related Posts

Open-Source Lip Sync Models Compared in 2026

Supertonic 3 On-Device TTS Reality Check on macOS

Function Calling and MCP First Principles