We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
60-second takeaway Qwen3-TTS + LoRA worked well on this benchmark once we controlled inference scale and learning rate. The key lesson was not just checkpoint selection but adapter strength: scale 1.0 over-steered, while 0.3 to 0.35 sounded stable. The official default LR (2e-5) is too high - use 2e-6 for the 1.7B model. For this run, epoch 10 plus lora_scale around 0.3 was the best operating point - but this is partly bug-dependent (see the double-shift note below).
Update (Mar 2026): Community research surfaced two critical bugs in the official sft_12hz.py that affect training results: a missing text_projection call and a double label-shift causing progressive speech acceleration. The epoch 10 sweet spot we found is likely an artifact of the double-shift bug. See the Known Bugs section below before starting a new run.
Companion repo
All reusable LoRA tooling is published separately:
For founders: this is a strong candidate if you want high quality from single-GPU LoRA runs.
For engineers: this page captures exact run behavior, including where losses flattened and where inference destabilized - plus community-sourced bug fixes and configuration recommendations.
Sample rate: 24 kHz mandatory before codec generation
The codec pipeline asserts AssertionError: Only support 24kHz audio when it encounters 16kHz, 44.1kHz, or 48kHz input. This crash fires deep in training with no early warning - you may lose several hours of a run before seeing it. Resample to 24kHz manually before any codec prep step. (PR
Voice cloning
Need consented AI voiceovers?
Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.
Strip annotation tags from transcriptions. Datasets containing tags like <laugh> confuse the model and reduce speaker similarity. Strip all non-speech annotations before generating the JSONL manifest.
Add 1 second of silence to the end of each training clip. Community reports (#39, user dariox1337) found this prevents overly fast speech and reduces noise hiccups at the end of generations.
Learning rate: the most important configuration change
The official default LR of 2e-5 is widely reported as too high for the 1.7B model. It is the single most common cause of noisy output, EOS token failures (infinite generation), and apparent training divergence despite decreasing loss.
Community-validated settings for the 1.7B model (#39):
Parameter
Our run
Community consensus (1.7B)
Learning rate
Not specified (used default)
2e-6
Batch size
Not specified
2
Gradient accumulation
Not specified
1–4
Epochs
10 (best)
3–5 (with bug fix applied)
Precision
bfloat16
bfloat16
For 24GB VRAM on an RTX 3090 Ti, the practical gradient accumulation options are:
Do not set batch_size=32 without gradient accumulation - it will OOM even on larger GPUs.
Known bugs in sft_12hz.py (apply fixes before training)
Two confirmed bugs in the official training script affect output quality. Apply both fixes before starting a new run.
Bug 1 - Missing text_projection call
Line 93 of sft_12hz.py passes raw text embeddings to the codec path without going through model.talker.text_projection(...). For the 0.6B model this causes a hard crash (embedding dimension mismatch). For the 1.7B model the dimensions happen to match, so training proceeds silently with incorrect embeddings. PR #188 documents the fix; commit 680d4e9 applied it in the official repo - verify your local copy is at or past this commit.
sft_12hz.py manually shifts labels on lines 114–125, then passes them to HuggingFace's ForCausalLMLoss which performs an internal shift again. This double-shift creates a training-inference mismatch. With each successive epoch, generated speech becomes progressively faster until it is unintelligible.
Measured impact (GitHub issue #179, user fumyou13, 8-GPU setup):
Version
Pre-SFT loss
Post-SFT loss
Output
Buggy (double-shift)
22
13
"Very fast speech"
Fixed
8.3
7.8
"Correct speech"
This is why epoch 10 appeared optimal in our run. The model sounded best before the speed acceleration became audible. With the fix applied, speaker similarity continues improving through 20+ epochs rather than degrading after epoch 13.
PR #178 provides the code diff for this fix. The sub-talker loss also needs F.cross_entropy() replacing the HF loss_function to avoid the second shift.
Best checkpoint logic (original run - pre-fix)
Validation improved early and flattened around epochs 8 to 12.
Validation started rising after epoch 13 in our continued run.
Best checkpoint by validation trend in this run: epoch 10.
Caveat: this pattern is characteristic of the double-shift bug. With the fix applied, the optimal epoch is likely higher.
Audio evidence
Recommended sample from this run
Settings: epoch 10 adapter, scale 0.35.
Recommended inference settings
LoRA scale
The lora_scale parameter controls how strongly the trained adapter blends into the base model at inference. Scale 1.0 consistently over-steers - the voice sounds forced and unnatural.
Default safe range: 0.25 to 0.35
Recommended starting point: 0.3
Scale sweep protocol: generate 5 samples at 0.2, 0.3, 0.35, and 0.5 before committing to a checkpoint. Do not evaluate only scale 1.0.
Generation parameters
Community-validated values for natural-sounding output:
temperature = 0.8 # Default 0.9; lower = more stable, less expressive
top_p = 0.85 # Default 1.0
top_k = 30 # Default 50; tighter sampling reduces voice drift
repetition_penalty = 1.05 # Critical for multi-chunk consistency
seed = 42 # Fix seed per generation for reproducible A/B comparisons
max_new_tokens = 2048 # Set explicitly - EOS bug affects ~0.5% of base model inferences
For the EOS failure case (infinite generation), set:
attn_implementation = "sdpa" # Stable on all consumer GPUs
# flash_attention_2 is faster but requires dtype=torch.bfloat16 in from_pretrained()
Multi-chunk voice consistency
When generating long texts in chunks:
Fix the random seed before each chunk to prevent voice timbre shifting across boundaries.
Extract the speaker embedding once from reference audio and reuse it across all chunks - per-chunk re-encoding introduces variability.
Use larger chunks (fewer boundaries) where possible.
Cold-start decoder distortion
On the first inference call in a new process, the speech tokenizer decoder has no warm-up context. This causes the first few audio frames to reconstruct incorrectly (documented as issue #219). The community workaround is to prepend silence codec tokens as warm-up context, then trim the corresponding samples from output. Affects short phrases starting with numerals most visibly.
The companion repo approach keeps upstream Qwen3-TTS clean while allowing pinned, reproducible patches and scripts.
Before starting your next run: checklist
□ Verify sft_12hz.py is at or past commit 680d4e9 (text_projection fix)
□ Apply PR #178 double label-shift fix if not yet merged
□ Resample all audio to 24kHz before codec prep
□ Strip annotation tags from transcriptions
□ Add 1s silence to end of each training clip
□ Set LR = 2e-6 (not the default 2e-5)
□ Set max_new_tokens explicitly in inference scripts
□ Keep all checkpoints until after listening eval sweep