ReStyle-TTS and Relative Style Control in Zero-Shot TTS
Download printable cheat-sheet (CC-BY 4.0)14 Feb 2026, 00:00 Z
ReStyle-TTS is one of the more interesting speech papers from early 2026 because it focuses on a practical failure case in zero-shot voice cloning: you can copy timbre from a reference clip, but you often inherit the reference style too strongly, which makes style control clunky.
For production teams, the core claim is simple: instead of forcing absolute style targets ("make this angry"), ReStyle-TTS aims for relative control ("make this slightly angrier than the reference").
Status note (important):
As of February 14, 2026, this is an arXiv v1 paper with no public code/demo.
Treat this post as a research briefing, not a deployment recipe.
60-second takeaway
- What is new: decoupling text guidance from reference guidance, then adding continuous style control via style LoRAs.
- Why it matters: relative controls are easier for editors and creators to use than brittle absolute prompts.
- What looks strong (reported): better contradictory-style generation (reference style does not match target style), while keeping intelligibility and timbre in range.
- What is missing today: reproducible implementation artifacts.
The problem it targets
Most zero-shot TTS pipelines can preserve speaker identity, but style remains sticky: if your reference is calm and low-energy, your output usually stays close to that style unless you over-prompt and risk instability.
This friction is real in production:
- short-form ads need fast style variants
- narration needs controlled energy ramps
- multilingual voiceovers need style edits without re-recording references
ReStyle-TTS frames this as a guidance-balancing problem first, then a style-control problem.
What ReStyle-TTS changes
The paper introduces three components.
1) Decoupled Classifier-Free Guidance (DCFG)
Standard CFG uses one guidance knob and entangles text fidelity with reference influence. DCFG introduces separate strengths for text and reference guidance. That means the model can reduce dependence on reference style without losing text alignment as quickly.
2) Style LoRAs plus Orthogonal LoRA Fusion (OLoRA)
The method trains style-specific LoRAs (pitch, energy, emotions) and combines multiple LoRAs with orthogonal projection to reduce interference. The intended UX is a continuous control surface where each attribute can move independently.
3) Timbre Consistency Optimization (TCO)
Weakening reference influence can hurt speaker identity. TCO adds a reward-weighted training signal tied to speaker similarity so timbre consistency recovers while control flexibility remains.