Discriminative Neural Clustering for Speaker Diarisation
Abstract
In this paper, we propose Discriminative Neural Clustering (DNC) that formulates data clustering with a maximum number of clusters as a supervised sequence-to-sequence learning problem. Compared to traditional unsupervised clustering algorithms, DNC learns clustering patterns from training data without requiring an explicit definition of a similarity measure. An implementation of DNC based on the Transformer architecture is shown to be effective on a speaker diarisation task using the challenging AMI dataset. Since AMI contains only 147 complete meetings as individual input sequences, data scarcity is a significant issue for training a Transformer model for DNC. Accordingly, this paper proposes three data augmentation schemes: sub-sequence randomisation, input vector randomisation, and Diaconis augmentation, which generates new data samples by rotating the entire input sequence of L2-normalised speaker embeddings. Experimental results on AMI show that DNC achieves a reduction in speaker error rate (SER) of 29.4% relative to spectral clustering.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2019
- DOI:
- 10.48550/arXiv.1910.09703
- arXiv:
- arXiv:1910.09703
- Bibcode:
- 2019arXiv191009703L
- Keywords:
-
- Electrical Engineering and Systems Science - Audio and Speech Processing;
- Computer Science - Computation and Language;
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Machine Learning;
- Computer Science - Sound
- E-Print:
- Accepted as a conference paper at the 8th IEEE Spoken Language Technology Workshop (SLT 2021)