HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters
Abstract
We introduce an end-to-end neural speech synthesis system that uses the source-filter model of speech production. Specifically, we apply differentiable resonant filters to a glottal waveform generated by a neural vocoder. The aim is to obtain a controllable synthesiser, similar to classic formant synthesis, but with much higher perceptual quality - filling a research gap in current neural waveform generators and responding to hitherto unmet needs in the speech sciences. Our setup generates audio from a core set of phonetically meaningful speech parameters, with the filters providing direct control over formant frequency resonances in synthesis. Direct synthesis control is a key feature for reliable stimulus creation in important speech science experiments. We show that the proposed source-filter method gives better perceptual quality than the industry standard for formant manipulation (i.e., Praat), whilst being competitive in terms of formant frequency control accuracy.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2024
- DOI:
- arXiv:
- arXiv:2409.14823
- Bibcode:
- 2024arXiv240914823J
- Keywords:
-
- Computer Science - Sound;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- Submitted to ICASSP 2025