We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
60-second takeaway Qwen3-TTS + LoRA worked well on this benchmark once we controlled inference scale and learning rate. The key lesson was not just checkpoint selection but adapter strength: scale 1.0 over-steered, while 0.3 to 0.35 sounded stable. The official default LR (2e-5) is too high - use 2e-6 for the 1.7B model. For this run, epoch 10 plus lora_scale around 0.3 was the best operating point - but this is partly bug-dependent (see the double-shift note below).
If you searched for qwen3 tts lora, qwen3 tts finetune, qwen3-tts fine-tuning, or Qwen3-TTS VRAM requirements, this is the main guide. The sections below cover the dataset recipe, 24GB GPU settings, LoRA-vs-full-fine-tune tradeoff, deployment-time scale control, and the training-script bugs you should patch before spending GPU time.
Update (Mar 2026): Community research surfaced two critical bugs in the official sft_12hz.py that affect training results: a missing text_projection call and a double label-shift causing progressive speech acceleration. The epoch 10 sweet spot we found is likely an artifact of the double-shift bug. See the Known Bugs section below before starting a new run.
Companion repo
All reusable LoRA tooling is published separately:
For founders: this is a strong candidate if you want high quality from single-GPU LoRA runs.
For engineers: this page captures exact run behavior, including where losses flattened and where inference destabilized - plus community-sourced bug fixes and configuration recommendations.
Not sure which model to fine-tune? See the TTS Model Decision Tree for a use-case-first comparison across all seven models we benchmarked.
Start by intent:
Dataset requirements: use 10 to 30 minutes of clean single-speaker audio, with 24 kHz codec preparation and stripped non-speech tags.
VRAM requirements: use an RTX 3090 Ti or RTX 4090 class 24GB card for comfortable LoRA sweeps; lower batch size and raise gradient accumulation for long clips.
Need consented AI voiceovers?
Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.
LoRA vs full fine-tune: use LoRA first when you need adapter control and lower storage churn; reserve full SFT for models where the recipe is known stable.
Deployment and inference: treat lora_scale as a deployment knob. Test 0.2, 0.3, 0.35, and 0.5 before promoting a voice.
Troubleshooting: check the missing text_projection call and double label-shift before interpreting any bad audio sample as a model-quality result.
Sample rate: 24 kHz mandatory before codec generation
The codec pipeline asserts AssertionError: Only support 24kHz audio when it encounters 16kHz, 44.1kHz, or 48kHz input. This crash fires deep in training with no early warning - you may lose several hours of a run before seeing it. Resample to 24kHz manually before any codec prep step. (PR #233 proposes auto-resampling but is unmerged as of March 2026.)
# Resample all audio to 24kHz before codec prep
ffmpeg -i input.wav -ar 24000 output_24k.wav
Dataset size: 10-30 minutes of clean audio
Community benchmarks (GitHub issue #39) confirm working results with as few as 9 minutes / 50 samples. Quality matters more than quantity:
Strip annotation tags from transcriptions. Datasets containing tags like <laugh> confuse the model and reduce speaker similarity. Strip all non-speech annotations before generating the JSONL manifest.
Add 1 second of silence to the end of each training clip. Community reports (#39, user dariox1337) found this prevents overly fast speech and reduces noise hiccups at the end of generations.
Learning rate: the most important configuration change
The official default LR of 2e-5 is widely reported as too high for the 1.7B model. It is the single most common cause of noisy output, EOS token failures (infinite generation), and apparent training divergence despite decreasing loss.
Community-validated settings for the 1.7B model (#39):
Parameter
Our run
Community consensus (1.7B)
Learning rate
Not specified (used default)
2e-6
Batch size
Not specified
2
Gradient accumulation
Not specified
1-4
Epochs
10 (best)
3-5 (with bug fix applied)
Precision
bfloat16
bfloat16
For 24GB VRAM on an RTX 3090 Ti, the practical gradient accumulation options are:
Do not set batch_size=32 without gradient accumulation - it will OOM even on larger GPUs.
Known bugs in sft_12hz.py (apply fixes before training)
Two confirmed bugs in the official training script affect output quality. Apply both fixes before starting a new run.
Bug 1 - Missing text_projection call
Line 93 of sft_12hz.py passes raw text embeddings to the codec path without going through model.talker.text_projection(...). For the 0.6B model this causes a hard crash (embedding dimension mismatch). For the 1.7B model the dimensions happen to match, so training proceeds silently with incorrect embeddings. PR #188 documents the fix; commit 680d4e9 applied it in the official repo - verify your local copy is at or past this commit.
sft_12hz.py manually shifts labels on lines 114-125, then passes them to HuggingFace's ForCausalLMLoss which performs an internal shift again. This double-shift creates a training-inference mismatch. With each successive epoch, generated speech becomes progressively faster until it is unintelligible.
Measured impact (GitHub issue #179, user fumyou13, 8-GPU setup):
Version
Pre-SFT loss
Post-SFT loss
Output
Buggy (double-shift)
22
13
"Very fast speech"
Fixed
8.3
7.8
"Correct speech"
This is why epoch 10 appeared optimal in our run. The model sounded best before the speed acceleration became audible. With the fix applied, speaker similarity continues improving through 20+ epochs rather than degrading after epoch 13.
PR #178 provides the code diff for this fix. The sub-talker loss also needs F.cross_entropy() replacing the HF loss_function to avoid the second shift.
Best checkpoint logic (original run - pre-fix)
Validation improved early and flattened around epochs 8 to 12.
Validation started rising after epoch 13 in our continued run.
Best checkpoint by validation trend in this run: epoch 10.
Caveat: this pattern is characteristic of the double-shift bug. With the fix applied, the optimal epoch is likely higher.
Audio evidence
Recommended sample from this run
Settings: epoch 10 adapter, scale 0.35.
Recommended inference settings
LoRA scale
The lora_scale parameter controls how strongly the trained adapter blends into the base model at inference. Scale 1.0 consistently over-steers - the voice sounds forced and unnatural.
Default safe range: 0.25 to 0.35
Recommended starting point: 0.3
Scale sweep protocol: generate 5 samples at 0.2, 0.3, 0.35, and 0.5 before committing to a checkpoint. Do not evaluate only scale 1.0.
Generation parameters
Community-validated values for natural-sounding output:
temperature = 0.8 # Default 0.9; lower = more stable, less expressive
top_p = 0.85 # Default 1.0
top_k = 30 # Default 50; tighter sampling reduces voice drift
repetition_penalty = 1.05 # Critical for multi-chunk consistency
seed = 42 # Fix seed per generation for reproducible A/B comparisons
max_new_tokens = 2048 # Set explicitly - EOS bug affects ~0.5% of base model inferences
For the EOS failure case (infinite generation), set:
attn_implementation = "sdpa" # Stable on all consumer GPUs
# flash_attention_2 is faster but requires dtype=torch.bfloat16 in from_pretrained()
Multi-chunk voice consistency
When generating long texts in chunks:
Fix the random seed before each chunk to prevent voice timbre shifting across boundaries.
Extract the speaker embedding once from reference audio and reuse it across all chunks - per-chunk re-encoding introduces variability.
Use larger chunks (fewer boundaries) where possible.
Cold-start decoder distortion
On the first inference call in a new process, the speech tokenizer decoder has no warm-up context. This causes the first few audio frames to reconstruct incorrectly (documented as issue #219). The community workaround is to prepend silence codec tokens as warm-up context, then trim the corresponding samples from output. Affects short phrases starting with numerals most visibly.
These are the questions that come up after the main LoRA recipe works. They are intentionally compact because this page should stay focused on the winning Qwen3-TTS LoRA path rather than becoming a broad model-comparison article.
Preset voice or LoRA?
Use preset or in-context voice cloning when you are still testing product fit, speaker consent, or short-form prompts. Move to LoRA when you need the same voice across many utterances, need repeatable deployment settings, or need to tune adapter strength with lora_scale. If your main problem is choosing between Qwen3-TTS, F5-TTS, CosyVoice, or VoxCPM, use the TTS model decision tree first.
How much data is enough?
For this LoRA path, 10 to 30 minutes of clean single-speaker audio is the practical target. Five minutes can expose whether the pipeline works, but it is a weak basis for production judgement. Transcript quality matters as much as duration: strip tags, remove non-speech rows, resample to 24 kHz, and keep a held-out listening script.
What does faster-qwen3-tts change?
faster-qwen3-tts is useful when the model choice is already made and throughput is the bottleneck. It does not fix a bad LoRA, noisy dataset, wrong lora_scale, missing EOS cap, or cold-start decoder distortion. Benchmark quality and failure modes first, then use faster runners to reduce serving cost.
What do streaming forks solve?
Streaming forks help with user-perceived latency by emitting audio earlier. They do not automatically make every deployment voice-agent-ready. You still need to measure first-audio latency, real-time factor, chunk boundary stability, and memory behavior under repeated requests.
Why does generation loop or turn into noise?
The most common causes are the official learning rate being too high, EOS not being capped, the double label-shift bug, and scale 1.0 over-steering at inference. Before changing models, verify LR 2e-6, explicit eos_token_id, max token caps, the sft_12hz.py fixes, and a scale sweep around 0.25 to 0.35.
When are emotion tags reliable?
Treat emotion tags as prompt controls that need per-speaker listening tests. They are more reliable for broad direction than for exact performance. If the speaker identity is fragile, emotion tags can make cloning similarity worse. Lock a neutral voice first, then test emotion on short, medium, and multi-chunk passages.
What VRAM should I expect?
The validated Instavar run used a 24GB RTX 3090 Ti. Smaller GPUs may work with lower batch size and more gradient accumulation, but this page should not be read as proof that every Qwen3-TTS LoRA workflow fits 8GB or 12GB. For hardware-first routing, see Voice Cloning on a 24GB GPU.
Community tooling worth knowing
rekuenkdr/ComfyUI-Qwen3-TTS - training fork with the LR fix, gradient accumulation, resume support, and audio caching.
The companion repo approach keeps upstream Qwen3-TTS clean while allowing pinned, reproducible patches and scripts.
Before starting your next run: checklist
□ Verify sft_12hz.py is at or past commit 680d4e9 (text_projection fix)
□ Apply PR #178 double label-shift fix if not yet merged
□ Resample all audio to 24kHz before codec prep
□ Strip annotation tags from transcriptions
□ Add 1s silence to end of each training clip
□ Set LR = 2e-6 (not the default 2e-5)
□ Set max_new_tokens explicitly in inference scripts
□ Keep all checkpoints until after listening eval sweep