EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

doi:10.48550/arXiv.2102.03786

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to-speech system that directly converts articulatory movements to speech signals. We use a neural-network-based vocoder combined with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and deep features. The experimental results confirm that the multimodal approach of EMA2S outperforms the baseline system in terms of both objective evaluation and subjective evaluation metrics. Moreover, results demonstrate that joint mel-spectrogram and deep feature loss training can effectively improve system performance.

Publication:

arXiv e-prints

Pub Date:

February 2021

DOI:

10.48550/arXiv.2102.03786

arXiv:

arXiv:2102.03786

Bibcode:

2021arXiv210203786C

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Machine Learning;
Computer Science - Sound;
Electrical Engineering and Systems Science - Signal Processing

NASA/ADS

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Abstract