AdaptVC: High Quality Voice Conversion with Adaptive Learning

doi:10.48550/arXiv.2501.01347

AdaptVC: High Quality Voice Conversion with Adaptive Learning

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.

Publication:

arXiv e-prints

Pub Date:

January 2025

DOI:

10.48550/arXiv.2501.01347

arXiv:

arXiv:2501.01347

Bibcode:

2025arXiv250101347K

Keywords:

Computer Science - Sound;
Computer Science - Computation and Language;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

4 pages, 3 figures. Audio samples are available in the demo page: https://mm.kaist.ac.kr/projects/AdaptVC

ADS

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Abstract