AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations
Abstract
Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 2024
- DOI:
- arXiv:
- arXiv:2407.00743
- Bibcode:
- 2024arXiv240700743W
- Keywords:
-
- Computer Science - Multimedia;
- Computer Science - Artificial Intelligence;
- Computer Science - Computation and Language;
- Electrical Engineering and Systems Science - Audio and Speech Processing