LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

doi:10.48550/arXiv.2412.20872

LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

Audio-visual video parsing focuses on classifying videos through weak labels while identifying events as either visible, audible, or both, alongside their respective temporal boundaries. Many methods ignore that different modalities often lack alignment, thereby introducing extra noise during modal interaction. In this work, we introduce a Learning Interaction method for Non-aligned Knowledge (LINK), designed to equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction. Additionally, we leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities. Our experimental findings demonstrate that our model outperforms existing methods on the LLP dataset.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.20872

arXiv:

arXiv:2412.20872

Bibcode:

2024arXiv241220872W

Keywords:

Computer Science - Computer Vision and Pattern Recognition

E-Print:

Accepted by ICASSP 2025

ADS

LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

Abstract