Neural Network-Based Modeling of Phonetic Durations

doi:10.48550/arXiv.1909.03030

Neural Network-Based Modeling of Phonetic Durations

A deep neural network (DNN)-based model has been developed to predict non-parametric distributions of durations of phonemes in specified phonetic contexts and used to explore which factors influence durations most. Major factors in US English are pre-pausal lengthening, lexical stress, and speaking rate. The model can be used to check that text-to-speech (TTS) training speech follows the script and words are pronounced as expected. Duration prediction is poorer with training speech for automatic speech recognition (ASR) because the training corpus typically consists of single utterances from many speakers and is often noisy or casually spoken. Low probability durations in ASR training material nevertheless mostly correspond to non-standard speech, with some having disfluencies. Children's speech is disproportionately present in these utterances, since children show much more variation in timing.

Publication:

arXiv e-prints

Pub Date:

September 2019

DOI:

10.48550/arXiv.1909.03030

arXiv:

arXiv:1909.03030

Bibcode:

2019arXiv190903030W

Keywords:

Computer Science - Sound;
Computer Science - Machine Learning;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

5 pages, 5 figures

NASA/ADS

Neural Network-Based Modeling of Phonetic Durations

Abstract