Qwen3-TTS LoRA Fine-Tuning - Scale Sweeps, Checkpoints, and Production Defaults

Download printable cheat-sheet (CC-BY 4.0)

07 Feb 2026, 00:00 Z

60-second takeaway
Qwen3-TTS + LoRA worked well on this benchmark once we controlled inference scale and learning rate.
The key lesson was not just checkpoint selection but adapter strength: scale 1.0 over-steered, while 0.3 to 0.35 sounded stable.
The official default LR (2e-5) is too high - use 2e-6 for the 1.7B model.
For this run, epoch 10 plus lora_scale around 0.3 was the best operating point - but this is partly bug-dependent (see the double-shift note below).
Update (Mar 2026):
Community research surfaced two critical bugs in the official sft_12hz.py that affect training results: a missing text_projection call and a double label-shift causing progressive speech acceleration. The epoch 10 sweet spot we found is likely an artifact of the double-shift bug. See the Known Bugs section below before starting a new run.

Companion repo

All reusable LoRA tooling is published separately:

Where this fits

  • For founders: this is a strong candidate if you want high quality from single-GPU LoRA runs.
  • For engineers: this page captures exact run behavior, including where losses flattened and where inference destabilized - plus community-sourced bug fixes and configuration recommendations.

Series overview:

Not sure which model to fine-tune? See the TTS Model Decision Tree for a use-case-first comparison across all seven models we benchmarked.

Experiment setup

  • Model: Qwen3-TTS 1.7B Base + LoRA
  • Dataset: IMDA NSC FEMALE_01_44k, JSONL + codec prep pipeline
  • Split: train/val/test = 90/5/5
  • Hardware: RTX 3090 Ti 24 GB

Dataset preparation

Sample rate: 24 kHz mandatory before codec generation

The codec pipeline asserts AssertionError: Only support 24kHz audio when it encounters 16kHz, 44.1kHz, or 48kHz input. This crash fires deep in training with no early warning - you may lose several hours of a run before seeing it. Resample to 24kHz manually before any codec prep step. (PR

Voice cloning

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.