HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters

doi:10.48550/arXiv.2409.14823

HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters

We introduce an end-to-end neural speech synthesis system that uses the source-filter model of speech production. Specifically, we apply differentiable resonant filters to a glottal waveform generated by a neural vocoder. The aim is to obtain a controllable synthesiser, similar to classic formant synthesis, but with much higher perceptual quality - filling a research gap in current neural waveform generators and responding to hitherto unmet needs in the speech sciences. Our setup generates audio from a core set of phonetically meaningful speech parameters, with the filters providing direct control over formant frequency resonances in synthesis. Direct synthesis control is a key feature for reliable stimulus creation in important speech science experiments. We show that the proposed source-filter method gives better perceptual quality than the industry standard for formant manipulation (i.e., Praat), whilst being competitive in terms of formant frequency control accuracy.

Publication:

arXiv e-prints

Pub Date:

September 2024

DOI:

10.48550/arXiv.2409.14823

arXiv:

arXiv:2409.14823

Bibcode:

2024arXiv240914823J

Keywords:

Computer Science - Sound;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Submitted to ICASSP 2025

ADS

HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters

Abstract