GLM-TTS Technical Report for Production Zero-Shot TTS

Download printable cheat-sheet (CC-BY 4.0)

14 Feb 2026, 00:00 Z

GLM-TTS is one of the stronger open-source TTS releases from late 2025 because it is framed as a production system, not just a lab demo.

The paper positions the model around three goals that matter in shipping teams: quality, controllability, and operational cost.

Status note (as of February 14, 2026):
The GitHub repo, inference scripts, and checkpoints are public.
The README still marks RL-optimized weights and the 2D Vocos update as "coming soon," so separate what is currently runnable from what is paper-claimed.

60-second takeaway

GLM-TTS uses a two-stage stack: autoregressive text-to-token generation, then flow-based token-to-waveform synthesis.
The main technical bet is not one module; it is a bundled system: upgraded tokenizer + GRPO multi-reward RL + hybrid phoneme input + LoRA customization + vocoder upgrades.
On Seed-TTS-eval zh (paper-reported), GLM-TTS is at CER 1.03 / SIM 76.1, and GLM-TTS_RL improves to CER 0.89 / SIM 76.4.
Several headline gains (phoneme-control and Vocos2D quality) are from internal evaluations, so treat them as promising, not yet independently verified.

What GLM-TTS is trying to solve

The technical report argues that many modern zero-shot TTS systems still have five recurring production pain points:

pronunciation control for polyphones and rare words
emotional expressiveness without unstable tuning
affordable voice customization without full-model finetuning
robustness under real-world data noise
quality retention while supporting streaming-like deployment patterns

GLM-TTS is designed as a direct response to those constraints.

Architecture in one view

GLM-TTS follows the now-common hybrid pattern: text to discrete speech tokens, then tokens to waveform.

flowchart LR
  A[Text] --> B[AR LLM<br/>Text to Speech Tokens]
  P[Prompt Audio] --> C[Speech Tokenizer + Speaker Embedding]
  C --> B
  B --> D[Flow Model<br/>Tokens to Mel]
  D --> E[Vocoder]
  E --> F[Waveform]

The paper explicitly frames this as a production compromise between controllability and synthesis quality.

The 6 design choices that matter

1) Tokenizer upgrades are central, not incidental

The speech tokenizer is upgraded from 12.5 Hz to 25 Hz and from a 16k to 32k vocabulary, with added pitch-estimation constraints.

The paper's own tokenizer ablation reports:

SIM: 75.2 -> 76.1
CER: 1.44 -> 1.03

This is important because many TTS stacks fail at the tokenizer layer before downstream modeling even has a chance.

Model	CER (lower better)	SIM (higher better)	Open-source
MiniMax-Speech	0.83	78.3	No
Seed-TTS	1.12	79.6	No
VoxCPM	0.93	77.2	Yes
IndexTTS2	1.03	76.5	Yes
GLM-TTS	1.03	76.1	Yes
GLM-TTS_RL	0.89	76.4	Yes

GLM-TTS Technical Report for Production Zero-Shot TTS

60-second takeaway

What GLM-TTS is trying to solve

Architecture in one view

The 6 design choices that matter

1) Tokenizer upgrades are central, not incidental

2) GRPO-based RL is used as alignment, not as the whole training story

Need consented AI voiceovers?

3) Phoneme-in gives explicit pronunciation control

4) LoRA customization is positioned as cost control

5) Vocos2D is a vocoder-side quality push

6) Data pipeline discipline is treated as a first-class feature

Reported benchmark snapshot

Seed-TTS-eval zh (paper/readme reported)

English split note

Repo reality check for engineers

Where this fits vs your current Instavar TTS coverage

Practical caution before production decisions

Suggested evaluation checklist for next update

Related Instavar TTS coverage

Sources

Related Posts

Function Calling and MCP First Principles

60-second takeaway

What GLM-TTS is trying to solve

Architecture in one view

The 6 design choices that matter

1) Tokenizer upgrades are central, not incidental

2) GRPO-based RL is used as alignment, not as the whole training story

Need consented AI voiceovers?

3) Phoneme-in gives explicit pronunciation control

4) LoRA customization is positioned as cost control

5) Vocos2D is a vocoder-side quality push

6) Data pipeline discipline is treated as a first-class feature

Reported benchmark snapshot

Seed-TTS-eval zh (paper/readme reported)

English split note

Repo reality check for engineers

Where this fits vs your current Instavar TTS coverage

Practical caution before production decisions

Suggested evaluation checklist for next update

Related Instavar TTS coverage

Sources

Related Posts

Open-Source Lip Sync Models Compared in 2026

Supertonic 3 On-Device TTS Reality Check on macOS

Function Calling and MCP First Principles