High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

doi:10.48550/arXiv.2406.17310

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked Language Model (G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker similarity.

Publication:

arXiv e-prints

Pub Date:

June 2024

DOI:

10.48550/arXiv.2406.17310

arXiv:

arXiv:2406.17310

Bibcode:

2024arXiv240617310Y

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Accepted by Interspeech2024

NASA/ADS

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

Abstract