Can audio-visual integration strengthen robustness under multimodal attacks?
Abstract
In this paper, we propose to make a systematic study on machines multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weakly-supervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 2021
- DOI:
- 10.48550/arXiv.2104.02000
- arXiv:
- arXiv:2104.02000
- Bibcode:
- 2021arXiv210402000T
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Cryptography and Security;
- Computer Science - Sound;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- CVPR 2021