Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems
Abstract
In recent years, there has been increased demand for speech-to-speech translation (S2ST) systems in industry settings. Although successfully commercialized, cloning-based S2ST systems expose their distributors to liabilities when misused by individuals and can infringe on personality rights when exploited by media organizations. This work proposes a regulated S2ST framework called Preset-Voice Matching (PVM). PVM removes cross-lingual voice cloning in S2ST by first matching the input voice to a similar prior consenting speaker voice in the target-language. With this separation, PVM avoids cloning the input speaker, ensuring PVM systems comply with regulations and reduce risk of misuse. Our results demonstrate PVM can significantly improve S2ST system run-time in multi-speaker settings and the naturalness of S2ST synthesized speech. To our knowledge, PVM is the first explicitly regulated S2ST framework leveraging similarly-matched preset-voices for dynamic S2ST tasks.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2024
- DOI:
- arXiv:
- arXiv:2407.13153
- Bibcode:
- 2024arXiv240713153P
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Cryptography and Security;
- Computer Science - Machine Learning;
- Computer Science - Sound;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- Accepted to the ACL PrivateNLP 2024 Workshop, 7 pages, 2 figures