High-Fidelity Neural Phonetic Posteriorgrams

doi:10.48550/arXiv.2402.17735

High-Fidelity Neural Phonetic Posteriorgrams

A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.

Publication:

arXiv e-prints

Pub Date:

February 2024

DOI:

10.48550/arXiv.2402.17735

arXiv:

arXiv:2402.17735

Bibcode:

2024arXiv240217735C

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

E-Print:

Accepted to ICASSP 2024 Workshop on Explainable Machine Learning for Speech and Audio

NASA/ADS

High-Fidelity Neural Phonetic Posteriorgrams

Abstract