G-DFlow-TTS Demo

Abstract

Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as conditional infilling of masked acoustic representations, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a continuous-time Markov chain (CTMC) framework for discrete generation, a natural fit. However, for TTS, inference-time control for stable low-step conditional infilling remains underexplored. We propose G-DFlow-TTS, a DFM-based alignment-free TTS model with an inference-time stack: (i) predictor-free guidance (PFG) via CTMC rate blending, (ii) conditional coupling to construct conditional probability paths, and (iii) schedule-constrained remasking that adds token-to-mask transitions to revise early errors. Together, these mechanisms improve conditional control and robustness for practical alignment-free DFM-TTS.

Model Overview

Figure 1: Overview of the G-DFlow-TTS architecture. The model operates on discrete speech tokens using Discrete Flow Matching with predictor-free guidance, conditional coupling and schedule-constrained remasking.

Baseline Comparison

CosyVoice 2

Scalable Multilingual TTS with LLM-powered Zero-Shot Cloning

Architecture: LLM + Flow-Matching Decoder
Method: LLM-based Semantic Generation
Training Data: Proprietary multi-lingual
Features: 9 languages, 3-sec cloning, streaming
Paper: arXiv:2412.10117

F5-TTS

A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Architecture: Diffusion Transformer (DiT)
Method: Flow Matching
Training Data: Emilia (~100K hours)
Features: Non-autoregressive, multi-lingual
NFE (In this Demo): 32
Paper: arXiv:2410.06885

MaskGCT

Zero-Shot TTS with Masked Generative Codec Transformer

Architecture: Masked Generative Transformer
Method: Mask-and-Predict
Training Data: Emilia (~100K hours)
Features: Non-autoregressive, no text-speech alignment
Paper: arXiv:2409.00750

Zero-Shot Voice Cloning

Testing the models' ability to clone voices from a short reference audio. All models use a in-filling approach, which benefits from knowing the prompt transcript.

Sample 1

Target Text: "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."

Prompt Audio	CosyVoice 2	F5-TTS	MaskGCT	G-DFlow-TTS (Ours)

Sample 2 (Long text)

Target Text: "Throughout history, great civilizations have risen and fallen, each leaving behind monuments, writings, and cultural achievements that continue to influence our modern world in countless ways."

Prompt Audio	CosyVoice 2	F5-TTS	MaskGCT	G-DFlow-TTS (Ours)

Sample 3

Prompt Text: "Kids are talking by the door."

Target Text: "The weather forecast predicts sunny skies for the remainder of the week."

Prompt Audio	CosyVoice 2	F5-TTS	MaskGCT	G-DFlow-TTS (Ours)

Sample 4

Prompt Text: "Kids are talking by the door."

Target Text: "Please take a moment to relax and breathe deeply before continuing."

Prompt Audio	CosyVoice 2	F5-TTS	MaskGCT	G-DFlow-TTS (Ours)

LibriSpeech Test-Clean

Reconstruction evaluation on LibriSpeech test-clean. The prompt audio and text are identical to the target (same speaker, same content), testing how well models can reproduce the original speech characteristics.

Sample 1 (2830-3980-0000)

Prompt Text: "In every way they sought to undermine the authority of Saint Paul."

Target Text: "In every way they sought to undermine the authority of Saint Paul."

Ground Truth	CosyVoice 2	F5-TTS	MaskGCT	G-DFlow-TTS (Ours)

Sample 2 (1089-134686-0000)

Prompt Text: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fattened sauce."

Target Text: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fattened sauce."

Ground Truth	CosyVoice 2	F5-TTS	MaskGCT	G-DFlow-TTS (Ours)

Sample 3 (1580-141083-0000)

Prompt Text: "I will endeavour in my statement to avoid such terms as would serve to limit the events to any particular place or give a clue as to the people concerned."

Target Text: "I will endeavour in my statement to avoid such terms as would serve to limit the events to any particular place or give a clue as to the people concerned."

Ground Truth	CosyVoice 2	F5-TTS	MaskGCT	G-DFlow-TTS (Ours)

NFE Steps Comparison (GDFM-TTS)

How does the number of function evaluations (NFE/steps) affect synthesis quality? Higher NFE produces higher quality audio at the cost of slower inference. Scroll the table horizontally to compare all values.

Sample 1 — Zero-Shot Voice Cloning

Prompt Audio: "Some call me nature, others call me Mother Nature."

Synthesized Text: "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."

Prompt	NFE=8	NFE=16	NFE=32 ★	NFE=64	NFE=128

Sample 2 — LibriSpeech (Speaker 1089)

Prompt & Target Text: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fattened sauce."

Ground Truth	NFE=8	NFE=16	NFE=32 ★	NFE=64	NFE=128