Compositional Generalization in Image Captioning
Abstract
Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2019
- DOI:
- 10.48550/arXiv.1909.04402
- arXiv:
- arXiv:1909.04402
- Bibcode:
- 2019arXiv190904402N
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Computation and Language;
- Computer Science - Computer Vision and Pattern Recognition;
- Statistics - Machine Learning
- E-Print:
- To appear at CoNLL 2019, EMNLP