On the Use of Audio Fingerprinting Features for Speech Enhancement with Generative Adversarial Network
Abstract
The advent of learning-based methods in speech enhancement has revived the need for robust and reliable training features that can compactly represent speech signals while preserving their vital information. Time-frequency domain features, such as the Short-Term Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficients (MFCC), are preferred in many approaches. While the MFCC provide for a compact representation, they ignore the dynamics and distribution of energy in each mel-scale subband. In this work, a speech enhancement system based on Generative Adversarial Network (GAN) is implemented and tested with a combination of Audio FingerPrinting (AFP) features obtained from the MFCC and the Normalized Spectral Subband Centroids (NSSC). The NSSC capture the locations of speech formants and complement the MFCC in a crucial way. In experiments with diverse speakers and noise types, GAN-based speech enhancement with the proposed AFP feature combination achieves the best objective performance while reducing memory requirements and training time.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2020
- DOI:
- 10.48550/arXiv.2007.13258
- arXiv:
- arXiv:2007.13258
- Bibcode:
- 2020arXiv200713258F
- Keywords:
-
- Electrical Engineering and Systems Science - Audio and Speech Processing;
- Computer Science - Machine Learning;
- Computer Science - Sound
- E-Print:
- 6 pages, 2020 IEEE Workshop on Signal Processing Systems (SiPS)