Comparing Zero-Shot Text-to-Speech Models: CosyVoice 2, F5-TTS, MaskGCT, and G-DFlow-TTS
Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as conditional infilling of masked acoustic representations, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a continuous-time Markov chain (CTMC) framework for discrete generation, a natural fit. However, for TTS, inference-time control for stable low-step conditional infilling remains underexplored. We propose G-DFlow-TTS, a DFM-based alignment-free TTS model with an inference-time stack: (i) predictor-free guidance (PFG) via CTMC rate blending, (ii) conditional coupling to construct conditional probability paths, and (iii) schedule-constrained remasking that adds token-to-mask transitions to revise early errors. Together, these mechanisms improve conditional control and robustness for practical alignment-free DFM-TTS.
Figure 1: Overview of the G-DFlow-TTS architecture. The model operates on discrete speech tokens using Discrete Flow Matching with predictor-free guidance, conditional coupling and schedule-constrained remasking.
Scalable Multilingual TTS with LLM-powered Zero-Shot Cloning
A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Zero-Shot TTS with Masked Generative Codec Transformer
Testing the models' ability to clone voices from a short reference audio. All models use a in-filling approach, which benefits from knowing the prompt transcript.
| Prompt Audio | CosyVoice 2 | F5-TTS | MaskGCT | G-DFlow-TTS (Ours) |
|---|---|---|---|---|
| Prompt Audio | CosyVoice 2 | F5-TTS | MaskGCT | G-DFlow-TTS (Ours) |
|---|---|---|---|---|
| Prompt Audio | CosyVoice 2 | F5-TTS | MaskGCT | G-DFlow-TTS (Ours) |
|---|---|---|---|---|
| Prompt Audio | CosyVoice 2 | F5-TTS | MaskGCT | G-DFlow-TTS (Ours) |
|---|---|---|---|---|
Reconstruction evaluation on LibriSpeech test-clean. The prompt audio and text are identical to the target (same speaker, same content), testing how well models can reproduce the original speech characteristics.
| Ground Truth | CosyVoice 2 | F5-TTS | MaskGCT | G-DFlow-TTS (Ours) |
|---|---|---|---|---|
| Ground Truth | CosyVoice 2 | F5-TTS | MaskGCT | G-DFlow-TTS (Ours) |
|---|---|---|---|---|
| Ground Truth | CosyVoice 2 | F5-TTS | MaskGCT | G-DFlow-TTS (Ours) |
|---|---|---|---|---|
How does the number of function evaluations (NFE/steps) affect synthesis quality? Higher NFE produces higher quality audio at the cost of slower inference. Scroll the table horizontally to compare all values.
| Prompt | NFE=8 | NFE=16 | NFE=32 ★ | NFE=64 | NFE=128 |
|---|---|---|---|---|---|
| Ground Truth | NFE=8 | NFE=16 | NFE=32 ★ | NFE=64 | NFE=128 |
|---|---|---|---|---|---|