Multimodal grid features and cell pointers for scene text visual question answering
Abstract
This paper presents a new model for the task of scene text visual question answering. In this task questions about a given image can only be answered by reading and understanding scene text. Current state of the art models for this task make use of a dual attention mechanism in which one attention module attends to visual features while the other attends to textual features. A possible issue with this is that it makes difficult for the model to reason jointly about both modalities. To fix this problem we propose a new model that is based on an single attention mechanism that attends to multi-modal features conditioned to the question. The output weights of this attention module over a grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text to the given question. Our experiments demonstrate competitive performance in two standard datasets with a model that is × 5 faster than previous methods at inference time. Furthermore, we also provide a novel analysis of the ST-VQA dataset based on a human performance study. Supplementary material, code, and data is made available through this link.
- Publication:
-
Pattern Recognition Letters
- Pub Date:
- October 2021
- DOI:
- 10.1016/j.patrec.2021.06.026
- arXiv:
- arXiv:2006.00923
- Bibcode:
- 2021PaReL.150..242G
- Keywords:
-
- Deep learning;
- Scene text;
- Visual question answering;
- Multi-modal learning;
- MSC;
- 41A05;
- 41A10;
- 65D05;
- 65D17;
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- This paper is under consideration at Pattern Recognition Letters