Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

doi:10.48550/arXiv.2406.05128

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

Training the linear prediction (LP) operator end-to-end for audio synthesis in modern deep learning frameworks is slow due to its recursive formulation. In addition, frame-wise approximation as an acceleration method cannot generalise well to test time conditions where the LP is computed sample-wise. Efficient differentiable sample-wise LP for end-to-end training is the key to removing this barrier. We generalise the efficient time-invariant LP implementation from the GOLF vocoder to time-varying cases. Combining this with the classic source-filter model, we show that the improved GOLF learns LP coefficients and reconstructs the voice better than its frame-wise counterparts. Moreover, in our listening test, synthesised outputs from GOLF scored higher in quality ratings than the state-of-the-art differentiable WORLD vocoder.

Publication:

arXiv e-prints

Pub Date:

June 2024

DOI:

10.48550/arXiv.2406.05128

arXiv:

arXiv:2406.05128

Bibcode:

2024arXiv240605128Y

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

E-Print:

Accepted at Interspeech 2024

NASA/ADS

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

Abstract