Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

doi:10.48550/arXiv.1810.12598

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more expensive computationally. Meanwhile, generative adversarial networks (GANs) have achieved impressive results in image generation and are making their way into audio applications; parallel inference is among their lucrative properties. By adopting recent advances in GAN training techniques, this investigation studies waveform generation for TTS in two domains (speech signal and glottal excitation). Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.

Publication:

arXiv e-prints

Pub Date:

October 2018

DOI:

10.48550/arXiv.1810.12598

arXiv:

arXiv:1810.12598

Bibcode:

2018arXiv181012598J

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound;
Statistics - Machine Learning

E-Print:

Submitted to ICASSP 2019

NASA/ADS

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

Abstract