Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

doi:10.48550/arXiv.2305.11435

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art.

Publication:

arXiv e-prints

Pub Date:

May 2023

DOI:

10.48550/arXiv.2305.11435

arXiv:

arXiv:2305.11435

Bibcode:

2023arXiv230511435P

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Artificial Intelligence;
Computer Science - Computation and Language;
Computer Science - Sound

E-Print:

Interspeech 2023. Code &amp

ADS

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

Abstract