Qwen3-TTS LoRA Fine-Tuning - Scale Sweeps, Checkpoints, and Production Defaults

Download printable cheat-sheet (CC-BY 4.0)

07 Feb 2026, 00:00 Z

60-second takeaway
Qwen3-TTS + LoRA worked well on this benchmark once we controlled inference scale and learning rate.
The key lesson was not just checkpoint selection but adapter strength: scale 1.0 over-steered, while 0.3 to 0.35 sounded stable.
The official default LR (2e-5) is too high - use 2e-6 for the 1.7B model.
For this run, epoch 10 plus lora_scale around 0.3 was the best operating point - but this is partly bug-dependent (see the double-shift note below).

If you searched for qwen3 tts lora, qwen3 tts finetune, qwen3-tts fine-tuning, or Qwen3-TTS VRAM requirements, this is the main guide. The sections below cover the dataset recipe, 24GB GPU settings, LoRA-vs-full-fine-tune tradeoff, deployment-time scale control, and the training-script bugs you should patch before spending GPU time.

Update (Mar 2026):
Community research surfaced two critical bugs in the official sft_12hz.py that affect training results: a missing text_projection call and a double label-shift causing progressive speech acceleration. The epoch 10 sweet spot we found is likely an artifact of the double-shift bug. See the Known Bugs section below before starting a new run.

Companion repo

All reusable LoRA tooling is published separately:

Where this fits

  • For founders: this is a strong candidate if you want high quality from single-GPU LoRA runs.
  • For engineers: this page captures exact run behavior, including where losses flattened and where inference destabilized - plus community-sourced bug fixes and configuration recommendations.

Series overview:

Not sure which model to fine-tune? See the TTS Model Decision Tree for a use-case-first comparison across all seven models we benchmarked.

Start by intent:

  • Dataset requirements: use 10 to 30 minutes of clean single-speaker audio, with 24 kHz codec preparation and stripped non-speech tags.
  • VRAM requirements: use an RTX 3090 Ti or RTX 4090 class 24GB card for comfortable LoRA sweeps; lower batch size and raise gradient accumulation for long clips.

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.