Fitting New Speakers Based on a Short Untranscribed Sample

doi:10.48550/arXiv.1802.06984

Fitting New Speakers Based on a Short Untranscribed Sample

Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in the embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples.

Publication:

arXiv e-prints

Pub Date:

February 2018

DOI:

10.48550/arXiv.1802.06984

arXiv:

arXiv:1802.06984

Bibcode:

2018arXiv180206984N

Keywords:

Computer Science - Machine Learning;
Computer Science - Sound;
Electrical Engineering and Systems Science - Audio and Speech Processing

NASA/ADS

Fitting New Speakers Based on a Short Untranscribed Sample

Abstract