MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

doi:10.48550/arXiv.2501.01094

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.

Publication:

arXiv e-prints

Pub Date:

January 2025

DOI:

10.48550/arXiv.2501.01094

arXiv:

arXiv:2501.01094

Bibcode:

2025arXiv250101094C

Keywords:

Computer Science - Sound;
Computer Science - Artificial Intelligence;
Computer Science - Multimedia;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Paper accepted in Artificial Intelligence for Music workshop at AAAI 2025

ADS

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Abstract