Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification

doi:10.48550/arXiv.1909.11886

Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification

Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes, namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA). Experiments on a Korean speech database demonstrate that the verification performance is improved significantly in real-world environments by using self-adaptive soft VAD.

Publication:

arXiv e-prints

Pub Date:

September 2019

DOI:

10.48550/arXiv.1909.11886

arXiv:

arXiv:1909.11886

Bibcode:

2019arXiv190911886J

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Computation and Language;
Computer Science - Machine Learning;
Computer Science - Sound;
Statistics - Machine Learning

E-Print:

Accepted at 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019)

NASA/ADS

Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification

Abstract