Improving Speaker Verification with Self-Pretrained Transformer Models

doi:10.48550/arXiv.2305.10517

Improving Speaker Verification with Self-Pretrained Transformer Models

Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models are pretrained and finetuned on the same dataset. Three pre-trained models including HuBERT, Conformer and WavLM are evaluated on four different speaker verification datasets with varying sizes. Our experiments show that these self-pretrained models achieve competitive performance on downstream speaker verification tasks with only one-third of the data compared to Librispeech pretraining, such as VoxCeleb1 and CNCeleb1. Furthermore, when pre-training only on the VoxCeleb2-dev, the Conformer model outperforms the one pre-trained on 94k hours of data using the same fine-tuning settings.

Publication:

arXiv e-prints

Pub Date:

May 2023

DOI:

10.48550/arXiv.2305.10517

arXiv:

arXiv:2305.10517

Bibcode:

2023arXiv230510517P

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Accepted to Interspeech 2023

ADS

Improving Speaker Verification with Self-Pretrained Transformer Models

Abstract