Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Abstract
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity.
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2023
- DOI:
- arXiv:
- arXiv:2308.14909
- Bibcode:
- 2023arXiv230814909Y
- Keywords:
-
- Computer Science - Sound;
- Computer Science - Machine Learning;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- INTERSPEECH 2023