Diffusion Speech Denoising in 2025 -- StoRM, SGMSE+, UNIVERSE++, Schrodinger Bridges, and Streaming Variants
Download printable cheat-sheet (CC-BY 4.0)24 Apr 2025, 00:00 Z
TL;DR Speech denoising now spans a family of diffusion-flavoured designs. StoRM blends a predictive estimate with a diffusion sampler to tame hallucinations at low cost; SGMSE/SGMSE+ continue to scale score matching with variance-aware schedulers; UNIVERSE++ bakes in adversarial loss and low-rank adaptation for cross-condition robustness; few-step Schrodinger-Bridge variants target sub-10 step inference; causal diffusion architectures chase streaming deployment; and MossFormer2 remains a strong baseline when you can tolerate separation-first latency.
Why teams care in 2025
- Customer support, call centers, and meeting tooling now demand universal denoisers that handle noise, reverberation, codec artifacts, and far-field setups without per-domain tuning.
- Real-time AI voice agents (telephony, kiosks, wearables) force inference budgets down to single-digit diffusion steps -- or hybrids that drop to predictive streams when necessary.
- Evaluation shifted from "cleaner spectrograms" to intelligibility (STOI/SI-SDR), MOS (subjective or DNSMOS), and downstream ASR WER. Modern stacks must show gains across all.
The models, at a glance
| Model | Key idea | Step count | Notable metrics | Where it shines |
| StoRM (Lemercier et al., IEEE/ACM TASLP 2023) | Predictive network provides a guided starting point for the diffusion sampler, suppressing breathing/phonation artifacts. | 8-30 (configurable) | VoiceBank+DEMAND PESQ >= 2.9 with 8 steps; DNSMOS better than pure score-matching at same budget. | Low-latency deployments that still need diffusion-grade quality. |
| SGMSE / SGMSE+ (Richter et al., 2022; Lay & Gerkmann, 2024) | Score-based generative speech enhancement with SDEs; SGMSE+ adds a stronger UNet, variance-aware schedules, and dereverb handling. |