Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

doi:10.48550/arXiv.2409.14486

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.

Publication:

arXiv e-prints

Pub Date:

September 2024

DOI:

10.48550/arXiv.2409.14486

arXiv:

arXiv:2409.14486

Bibcode:

2024arXiv240914486M

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Computation and Language;
Computer Science - Sound

E-Print:

3 figures, 3 tables

ADS

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Abstract