AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

doi:10.48550/arXiv.2407.00743

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.

Publication:

arXiv e-prints

Pub Date:

April 2024

DOI:

10.48550/arXiv.2407.00743

arXiv:

arXiv:2407.00743

Bibcode:

2024arXiv240700743W

Keywords:

Computer Science - Multimedia;
Computer Science - Artificial Intelligence;
Computer Science - Computation and Language;
Electrical Engineering and Systems Science - Audio and Speech Processing

ADS

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Abstract