GLM-TTS Technical Report for Production Zero-Shot TTS

Download printable cheat-sheet (CC-BY 4.0)

14 Feb 2026, 00:00 Z

GLM-TTS is one of the stronger open-source TTS releases from late 2025 because it is framed as a production system, not just a lab demo.

The paper positions the model around three goals that matter in shipping teams: quality, controllability, and operational cost.

Status note (as of February 14, 2026):
The GitHub repo, inference scripts, and checkpoints are public.
The README still marks RL-optimized weights and the 2D Vocos update as "coming soon," so separate what is currently runnable from what is paper-claimed.

60-second takeaway

  • GLM-TTS uses a two-stage stack: autoregressive text-to-token generation, then flow-based token-to-waveform synthesis.
  • The main technical bet is not one module; it is a bundled system: upgraded tokenizer + GRPO multi-reward RL + hybrid phoneme input + LoRA customization + vocoder upgrades.
  • On Seed-TTS-eval zh (paper-reported), GLM-TTS is at CER 1.03 / SIM 76.1, and GLM-TTS_RL improves to CER 0.89 / SIM 76.4.
  • Several headline gains (phoneme-control and Vocos2D quality) are from internal evaluations, so treat them as promising, not yet independently verified.

What GLM-TTS is trying to solve

The technical report argues that many modern zero-shot TTS systems still have five recurring production pain points:

  • pronunciation control for polyphones and rare words
  • emotional expressiveness without unstable tuning
  • affordable voice customization without full-model finetuning
  • robustness under real-world data noise
  • quality retention while supporting streaming-like deployment patterns

GLM-TTS is designed as a direct response to those constraints.

Architecture in one view

GLM-TTS follows the now-common hybrid pattern: text to discrete speech tokens, then tokens to waveform.

flowchart LR
  A[Text] --> B[AR LLM<br/>Text to Speech Tokens]
  P[Prompt Audio] --> C[Speech Tokenizer + Speaker Embedding]
  C --> B
  B --> D[Flow Model<br/>Tokens to Mel]
  D --> E[Vocoder]
  E --> F[Waveform]

The paper explicitly frames this as a production compromise between controllability and synthesis quality.

The 6 design choices that matter

1) Tokenizer upgrades are central, not incidental

The speech tokenizer is upgraded from 12.5 Hz to 25 Hz and from a 16k to 32k vocabulary, with added pitch-estimation constraints.

The paper's own tokenizer ablation reports:

  • SIM: 75.2 -> 76.1
  • CER: 1.44 -> 1.03

This is important because many TTS stacks fail at the tokenizer layer before downstream modeling even has a chance.

2) GRPO-based RL is used as alignment, not as the whole training story

Voice cloning

Need consented AI voiceovers?

Launch AI voice cloning with clear consent, pronunciation tuning, and ad-ready mixes.