Bidirectional Representations for Low Resource Spoken Language Understanding

doi:10.48550/arXiv.2211.14320

Bidirectional Representations for Low Resource Spoken Language Understanding

Most spoken language understanding systems use a pipeline approach composed of an automatic speech recognition interface and a natural language understanding module. This approach forces hard decisions when converting continuous inputs into discrete language symbols. Instead, we propose a representation model to encode speech in rich bidirectional encodings that can be used for downstream tasks such as intent prediction. The approach uses a masked language modelling objective to learn the representations, and thus benefits from both the left and right contexts. We show that the performance of the resulting encodings before fine-tuning is better than comparable models on multiple datasets, and that fine-tuning the top layers of the representation model improves the current state of the art on the Fluent Speech Command dataset, also in a low-data regime, when a limited amount of labelled data is used for training. Furthermore, we propose class attention as a spoken language understanding module, efficient both in terms of speed and number of parameters. Class attention can be used to visually explain the predictions of our model, which goes a long way in understanding how the model makes predictions. We perform experiments in English and in Dutch.

Publication:

arXiv e-prints

Pub Date:

November 2022

DOI:

10.48550/arXiv.2211.14320

arXiv:

arXiv:2211.14320

Bibcode:

2022arXiv221114320M

Keywords:

Computer Science - Computation and Language

E-Print:

Appl. Sci. 2023, 13, 11291

ADS

Bidirectional Representations for Low Resource Spoken Language Understanding

Abstract