Challenging margin-based speaker embedding extractors by using the variational information bottleneck
Abstract
Speaker embedding extractors are typically trained using a classification loss over the training speakers. During the last few years, the standard softmax/cross-entropy loss has been replaced by the margin-based losses, yielding significant improvements in speaker recognition accuracy. Motivated by the fact that the margin merely reduces the logit of the target speaker during training, we consider a probabilistic framework that has a similar effect. The variational information bottleneck provides a principled mechanism for making deterministic nodes stochastic, resulting in an implicit reduction of the posterior of the target speaker. We experiment with a wide range of speaker recognition benchmarks and scoring methods and report competitive results to those obtained with the state-of-the-art Additive Angular Margin loss.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2024
- DOI:
- 10.48550/arXiv.2406.12622
- arXiv:
- arXiv:2406.12622
- Bibcode:
- 2024arXiv240612622S
- Keywords:
-
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- Accepted at Interspeech 2024