GLM-TTS Technical Report for Production Zero-Shot TTS
Download printable cheat-sheet (CC-BY 4.0)14 Feb 2026, 00:00 Z
GLM-TTS is one of the stronger open-source TTS releases from late 2025 because it is framed as a production system, not just a lab demo.
The paper positions the model around three goals that matter in shipping teams: quality, controllability, and operational cost.
Status note (as of February 14, 2026):
The GitHub repo, inference scripts, and checkpoints are public.
The README still marks RL-optimized weights and the 2D Vocos update as "coming soon," so separate what is currently runnable from what is paper-claimed.
60-second takeaway
- GLM-TTS uses a two-stage stack: autoregressive text-to-token generation, then flow-based token-to-waveform synthesis.
- The main technical bet is not one module; it is a bundled system: upgraded tokenizer + GRPO multi-reward RL + hybrid phoneme input + LoRA customization + vocoder upgrades.
- On Seed-TTS-eval zh (paper-reported), GLM-TTS is at CER 1.03 / SIM 76.1, and GLM-TTS_RL improves to CER 0.89 / SIM 76.4.
- Several headline gains (phoneme-control and Vocos2D quality) are from internal evaluations, so treat them as promising, not yet independently verified.
What GLM-TTS is trying to solve
The technical report argues that many modern zero-shot TTS systems still have five recurring production pain points:
- pronunciation control for polyphones and rare words
- emotional expressiveness without unstable tuning
- affordable voice customization without full-model finetuning
- robustness under real-world data noise
- quality retention while supporting streaming-like deployment patterns
GLM-TTS is designed as a direct response to those constraints.
Architecture in one view
GLM-TTS follows the now-common hybrid pattern: text to discrete speech tokens, then tokens to waveform.
flowchart LR
A[Text] --> B[AR LLM<br/>Text to Speech Tokens]
P[Prompt Audio] --> C[Speech Tokenizer + Speaker Embedding]
C --> B
B --> D[Flow Model<br/>Tokens to Mel]
D --> E[Vocoder]
E --> F[Waveform]The paper explicitly frames this as a production compromise between controllability and synthesis quality.
The 6 design choices that matter
1) Tokenizer upgrades are central, not incidental
The speech tokenizer is upgraded from 12.5 Hz to 25 Hz and from a 16k to 32k vocabulary, with added pitch-estimation constraints.
The paper's own tokenizer ablation reports:
- SIM: 75.2 -> 76.1
- CER: 1.44 -> 1.03
This is important because many TTS stacks fail at the tokenizer layer before downstream modeling even has a chance.