Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS
Abstract
This research introduces a comprehensive Bahasa text-to-speech (TTS) dataset and a novel TTS model, EnGen-TTS, designed to enhance the quality and versatility of synthetic speech in the Bahasa language. The dataset, spanning \textasciitilde55.0 hours and 52K audio recordings, integrates diverse textual sources, ensuring linguistic richness. A meticulous recording setup captures the nuances of Bahasa phonetics, employing professional equipment to ensure high-fidelity audio samples. Statistical analysis reveals the dataset's scale and diversity, laying the foundation for model training and evaluation. The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 $\pm$ 0.13. Additionally, our investigation on real-time factor and model size highlights EnGen-TTS as a compelling choice, with efficient performance. This research marks a significant advancement in Bahasa TTS technology, with implications for diverse language applications. Link to Generated Samples: \url{https://bahasa-harmony-comp.vercel.app/}
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2024
- DOI:
- 10.48550/arXiv.2410.06608
- arXiv:
- arXiv:2410.06608
- Bibcode:
- 2024arXiv241006608K
- Keywords:
-
- Computer Science - Sound;
- Computer Science - Artificial Intelligence;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- EMNLP 2024