Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

doi:10.48550/arXiv.2008.06667

Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

Categorical speech emotion recognition is typically performed as a sequence-to-label problem, i.e., to determine the discrete emotion label of the input utterance as a whole. One of the main challenges in practice is that most of the existing emotion corpora do not give ground truth labels for each segment; instead, we only have labels for whole utterances. To extract segment-level emotional information from such weakly labeled emotion corpora, we propose using multiple instance learning (MIL) to learn segment embeddings in a weakly supervised manner. Also, for a sufficiently long utterance, not all of the segments contain relevant emotional information. In this regard, three attention-based neural network models are then applied to the learned segment embeddings to attend the most salient part of a speech utterance. Experiments on the CASIA corpus and the IEMOCAP database show better or highly competitive results than other state-of-the-art approaches.

Publication:

arXiv e-prints

Pub Date:

August 2020

DOI:

10.48550/arXiv.2008.06667

arXiv:

arXiv:2008.06667

Bibcode:

2020arXiv200806667M

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

NASA/ADS

Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

Abstract