G-DFlow-TTS Demo

Comparing Zero-Shot Text-to-Speech Models: CosyVoice 2, F5-TTS, MaskGCT, and G-DFlow-TTS

Abstract

Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as conditional infilling of masked acoustic representations, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a continuous-time Markov chain (CTMC) framework for discrete generation, a natural fit. However, for TTS, inference-time control for stable low-step conditional infilling remains underexplored. We propose G-DFlow-TTS, a DFM-based alignment-free TTS model with an inference-time stack: (i) predictor-free guidance (PFG) via CTMC rate blending, (ii) conditional coupling to construct conditional probability paths, and (iii) schedule-constrained remasking that adds token-to-mask transitions to revise early errors. Together, these mechanisms improve conditional control and robustness for practical alignment-free DFM-TTS.

Model Overview

G-DFlow-TTS Architecture

Figure 1: Overview of the G-DFlow-TTS architecture. The model operates on discrete speech tokens using Discrete Flow Matching with predictor-free guidance, conditional coupling and schedule-constrained remasking.

Baseline Comparison

CosyVoice 2

Scalable Multilingual TTS with LLM-powered Zero-Shot Cloning

  • Architecture: LLM + Flow-Matching Decoder
  • Method: LLM-based Semantic Generation
  • Training Data: Proprietary multi-lingual
  • Features: 9 languages, 3-sec cloning, streaming
  • Paper: arXiv:2412.10117

F5-TTS

A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

  • Architecture: Diffusion Transformer (DiT)
  • Method: Flow Matching
  • Training Data: Emilia (~100K hours)
  • Features: Non-autoregressive, multi-lingual
  • NFE (In this Demo): 32
  • Paper: arXiv:2410.06885

MaskGCT

Zero-Shot TTS with Masked Generative Codec Transformer

  • Architecture: Masked Generative Transformer
  • Method: Mask-and-Predict
  • Training Data: Emilia (~100K hours)
  • Features: Non-autoregressive, no text-speech alignment
  • Paper: arXiv:2409.00750

Zero-Shot Voice Cloning

Testing the models' ability to clone voices from a short reference audio. All models use a in-filling approach, which benefits from knowing the prompt transcript.

Sample 1

Target Text: "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
Prompt Audio CosyVoice 2 F5-TTS MaskGCT G-DFlow-TTS (Ours)

Sample 2 (Long text)

Target Text: "Throughout history, great civilizations have risen and fallen, each leaving behind monuments, writings, and cultural achievements that continue to influence our modern world in countless ways."
Prompt Audio CosyVoice 2 F5-TTS MaskGCT G-DFlow-TTS (Ours)

Sample 3

Prompt Text: "Kids are talking by the door."
Target Text: "The weather forecast predicts sunny skies for the remainder of the week."
Prompt Audio CosyVoice 2 F5-TTS MaskGCT G-DFlow-TTS (Ours)

Sample 4

Prompt Text: "Kids are talking by the door."
Target Text: "Please take a moment to relax and breathe deeply before continuing."
Prompt Audio CosyVoice 2 F5-TTS MaskGCT G-DFlow-TTS (Ours)

LibriSpeech Test-Clean

Reconstruction evaluation on LibriSpeech test-clean. The prompt audio and text are identical to the target (same speaker, same content), testing how well models can reproduce the original speech characteristics.

Sample 1 (2830-3980-0000)

Prompt Text: "In every way they sought to undermine the authority of Saint Paul."
Target Text: "In every way they sought to undermine the authority of Saint Paul."
Ground Truth CosyVoice 2 F5-TTS MaskGCT G-DFlow-TTS (Ours)

Sample 2 (1089-134686-0000)

Prompt Text: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fattened sauce."
Target Text: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fattened sauce."
Ground Truth CosyVoice 2 F5-TTS MaskGCT G-DFlow-TTS (Ours)

Sample 3 (1580-141083-0000)

Prompt Text: "I will endeavour in my statement to avoid such terms as would serve to limit the events to any particular place or give a clue as to the people concerned."
Target Text: "I will endeavour in my statement to avoid such terms as would serve to limit the events to any particular place or give a clue as to the people concerned."
Ground Truth CosyVoice 2 F5-TTS MaskGCT G-DFlow-TTS (Ours)

NFE Steps Comparison (GDFM-TTS)

How does the number of function evaluations (NFE/steps) affect synthesis quality? Higher NFE produces higher quality audio at the cost of slower inference. Scroll the table horizontally to compare all values.

Sample 1 — Zero-Shot Voice Cloning

Prompt Audio: "Some call me nature, others call me Mother Nature."
Synthesized Text: "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
Prompt NFE=8 NFE=16 NFE=32 ★ NFE=64 NFE=128

Sample 2 — LibriSpeech (Speaker 1089)

Prompt & Target Text: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fattened sauce."
Ground Truth NFE=8 NFE=16 NFE=32 ★ NFE=64 NFE=128