MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
Abstract
In this paper, we propose MMER, a novel Multimodal Multi-task learning approach for Speech Emotion Recognition. MMER leverages a novel multimodal network based on early-fusion and cross-modal self-attention between text and acoustic modalities and solves three novel auxiliary tasks for learning emotion recognition from spoken utterances. In practice, MMER outperforms all our baselines and achieves state-of-the-art performance on the IEMOCAP benchmark. Additionally, we conduct extensive ablation studies and results analysis to prove the effectiveness of our proposed approach.
- Publication:
-
arXiv e-prints
- Pub Date:
- March 2022
- DOI:
- arXiv:
- arXiv:2203.16794
- Bibcode:
- 2022arXiv220316794G
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Sound;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- InterSpeech 2023 Main Conference