Improving End-to-End SLU performance with Prosodic Attention and Distillation

doi:10.48550/arXiv.2305.08067

Improving End-to-End SLU performance with Prosodic Attention and Distillation

Rajaa, Shangeth

Most End-to-End SLU methods depend on the pretrained ASR or language model features for intent prediction. However, other essential information in speech, such as prosody, is often ignored. Recent research has shown improved results in classifying dialogue acts by incorporating prosodic information. The margins of improvement in these methods are minimal as the neural models ignore prosodic features. In this work, we propose prosody-attention, which uses the prosodic features differently to generate attention maps across time frames of the utterance. Then we propose prosody-distillation to explicitly learn the prosodic information in the acoustic encoder rather than concatenating the implicit prosodic features. Both the proposed methods improve the baseline results, and the prosody-distillation method gives an intent classification accuracy improvement of 8\% and 2\% on SLURP and STOP datasets over the prosody baseline.

Publication:

arXiv e-prints

Pub Date:

May 2023

DOI:

10.48550/arXiv.2305.08067

arXiv:

arXiv:2305.08067

Bibcode:

2023arXiv230508067R

Keywords:

Computer Science - Computation and Language;
Computer Science - Machine Learning;
Computer Science - Sound;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Submitted to InterSpeech 2023

ADS

Improving End-to-End SLU performance with Prosodic Attention and Distillation

Abstract