Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

doi:10.48550/arXiv.2306.09417

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach. Please see https://shivammehta25.github.io/Diff-TTSG/ for video examples, data, and code.

Publication:

arXiv e-prints

Pub Date:

June 2023

DOI:

10.48550/arXiv.2306.09417

arXiv:

arXiv:2306.09417

Bibcode:

2023arXiv230609417M

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Artificial Intelligence;
Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Human-Computer Interaction;
Computer Science - Machine Learning;
68T07 (Primary);
68T42 (Secondary);
I.2.7;
I.2.6;
G.3;
H.5.5

E-Print:

7 pages, 2 figures, presented at the ISCA Speech Synthesis Workshop (SSW) 2023

NASA/ADS

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

Abstract