The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024

doi:10.48550/arXiv.2412.01100

The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024

This paper describes the zero-shot spontaneous style TTS system for the ISCSLP 2024 Conversational Voice Clone Challenge (CoVoC). We propose a LLaMA-based codec language model with a delay pattern to achieve spontaneous style voice cloning. To improve speech intelligibility, we introduce the Classifier-Free Guidance (CFG) strategy in the language model to strengthen conditional guidance on token prediction. To generate high-quality utterances, we adopt effective data preprocessing operations and fine-tune our model with selected high-quality spontaneous speech data. The official evaluations in the CoVoC constrained track show that our system achieves the best speech naturalness MOS of 3.80 and obtains considerable speech quality and speaker similarity results.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.01100

arXiv:

arXiv:2412.01100

Bibcode:

2024arXiv241201100Z

Keywords:

Computer Science - Sound;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Accepted by ISCSLP 2024

ADS

The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024

Abstract