On the use of human reference data for evaluating automatic image descriptions

doi:10.48550/arXiv.2006.08792

On the use of human reference data for evaluating automatic image descriptions

van Miltenburg, Emiel

Automatic image description systems are commonly trained and evaluated using crowdsourced, human-generated image descriptions. The best-performing system is then determined using some measure of similarity to the reference data (BLEU, Meteor, CIDER, etc). Thus, both the quality of the systems as well as the quality of the evaluation depends on the quality of the descriptions. As Section 2 will show, the quality of current image description datasets is insufficient. I argue that there is a need for more detailed guidelines that take into account the needs of visually impaired users, but also the feasibility of generating suitable descriptions. With high-quality data, evaluation of image description systems could use reference descriptions, but we should also look for alternatives.

Publication:

arXiv e-prints

Pub Date:

June 2020

DOI:

10.48550/arXiv.2006.08792

arXiv:

arXiv:2006.08792

Bibcode:

2020arXiv200608792V

Keywords:

Computer Science - Computation and Language;
Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Human-Computer Interaction

E-Print:

Originally presented as a (non-archival) poster at the VizWiz 2020 workshop, collocated with CVPR 2020. See: https://vizwiz.org/workshops/2020-workshop/

NASA/ADS

On the use of human reference data for evaluating automatic image descriptions

Abstract