FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation

doi:10.48550/arXiv.2011.05731

FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation

This paper presents FastSVC, a light-weight cross-domain singing voice conversion (SVC) system, which can achieve high conversion performance, with inference speed 4x faster than real-time on CPUs. FastSVC uses Conformer-based phoneme recognizer to extract singer-agnostic linguistic features from singing signals. A feature-wise linear modulation based generator is used to synthesize waveform directly from linguistic features, leveraging information from sine-excitation signals and loudness features. The waveform generator can be trained conveniently using a multi-resolution spectral loss and an adversarial loss. Experimental results show that the proposed FastSVC system, compared with a computationally heavy baseline system, can achieve comparable conversion performance in some scenarios and significantly better conversion performance in other scenarios. Moreover, the proposed FastSVC system achieves desirable cross-lingual singing conversion performance. The inference speed of the FastSVC system is 3x and 70x faster than the baseline system on GPUs and CPUs, respectively.

Publication:

arXiv e-prints

Pub Date:

November 2020

DOI:

10.48550/arXiv.2011.05731

arXiv:

arXiv:2011.05731

Bibcode:

2020arXiv201105731L

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Accepted by IEEE International Conference on Multimedia and Expo (ICME) 2021

NASA/ADS

FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation

Abstract