Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering
Abstract
Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2018
- DOI:
- 10.48550/arXiv.1812.05252
- arXiv:
- arXiv:1812.05252
- Bibcode:
- 2018arXiv181205252P
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Electrical Engineering and Systems Science - Image and Video Processing
- E-Print:
- CVPR 2019 ORAL